Towards Task Understanding in Visual Settings

11/28/2018 ∙ by Sebastin Santy, et al. ∙ Spotify UCL 0

We consider the problem of understanding real world tasks depicted in visual images. While most existing image captioning methods excel in producing natural language descriptions of visual scenes involving human tasks, there is often the need for an understanding of the exact task being undertaken rather than a literal description of the scene. We leverage insights from real world task understanding systems, and propose a framework composed of convolutional neural networks, and an external hierarchical task ontology to produce task descriptions from input images. Detailed experiments highlight the efficacy of the extracted descriptions, which could potentially find their way in many applications, including image alt text generation.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


A substantial portion of real world images depict a human task; for example, Figure 3 shows tasks like pitching a baseball, building lego toys or playing a guitar. While humans are efficient at understanding and describing the task intent by just a quick glance at a visual scene, most image captioning systems are only able to generate plain description of the different visual elements in the image. Especially in the cases of highly complex tasks, predicting task intent may involve more than just generating a plain description of the scene. Understanding the task context in visual scenes is indeed important in a number of application settings, including image alt text generation, image suggestions or image search.

While a lot of work has gone into generating image descriptions, most prior work [Bernardi and Cakici2016] have used visual and multi-modal space to assist the generation of a dense natural language description which allow for a more expressive prediction. Often times such detailed descriptions of the different visual elements are not required but a minimal explanation as to what is happening in the image suffices.Such task-based captions help in keeping the description more technical and contextual. For example, “Pitch a Baseball” is an apt minimal replacement to “baseball player is throwing ball in game.” in Figure 3, while being more technical, contextual and maintaining brevity. Our method primarily aims to improve existing methods by specifically overcoming this limitation by trying to precisely predict the task present in the scene while simultaneously preserving the context, as opposed to synthesizing a verbose description.

We jointly leverage insights from recent advancements in deep convolutional architectures and hierarchical task ontologies, and propose a two phase model to suggest scene task descriptions. The convolutional architecture generates contextual labels from the image, while the task extractor maps these labels to real world tasks. We leverage the TaskHierarchy138K 111 ontology which contains ‘tasks’ and keywords associated with each of these ‘tasks’ in a hierarchical structure, with a complex task often decomposed into simpler sub-tasks. Detailed experiments based on both qualitative and quantitative experiments demonstrate that our method not only helps in extracting task information, but also provides more useful descriptions when compared with state-of-the-art image description approaches.


In order to extract the tasks depicted in an image, we propose a two phased model: i) Multi-label classification of scenes to generate input labels for the task extractor and ii) Leveraging external hierarchical ontology for task identification by task extractor.

For image classification, we train the deep Inception v4 architecture [Szegedy et al.2017]

which is capable of detecting 1000 categories and fine-tune the network for multi-label contextual classification of the scene. The input image is fed to this classifier to obtain contextual labels along with their respective confidence scores. The labels generated by the classifier is further processed to filter out redundant information, and the resulting filtered set of labels is then passed to the task extractor module. Given the labels of the image, the task extractor will probe into a task hierarchy to suggest tasks . In order to infer the task, we leverage TaskHierarchy138K, which is an external task ontology that uses Wikihow articles to provide task information for over 100k real world tasks. Each node in this hierarchy represents a WikiHow category, with its children nodes representing its subcategories. A node

contains a Representative embedding and an Average Embedding . is an average embedding of the articles present in node . is which is recursively calculated for all nodes except leaf-nodes. For leaf-nodes, .

Given a task hierarchy and a set of labels produced from the classifier, we start with the root of the tree, and then trickle the labels down through the hierarchy. This trickling process is divided into two steps in order to achieve speed while simultaneously making it robust to noise.

  1. First Order Trickling

    : We pass down a weighted average vector embedding (

    ) of all the labels starting from node

    by recursively trickling to the child node with maximum cosine similarity (

    ) between and . The trickling stops at node when the where the acts as a threshold.

  2. Second Order Trickling: This is used for further trickling down the achieved node in the previous step. Some specific low-weighted labels belonging to a subcategory of the achieved node gets buried in . Hence, we calculate between each label and . The labels are trickled down to the node which returns the maximum , iff it is higher than threshold defined in the previous step.

After the node is captured by the trickling process, we rank the content tasks using cosine similarity over their respective article knowledge, to suggest the appropriate task.

Results and Discussion

To the best of our knowledge, this is the first work done on predicting tasks being undertaken in a given scene. Generating expressive image descriptions is the closest work to task suggestion. We compare our results with one of the best image descriptors - NeuralTalk2[Karpathy and Fei-Fei2015] in Figure 3. It should be noted that our work does not compete with the image descriptor. We want to make the reader aware that we are able to suggest task fairly accurately with a less complex model by leveraging existing task information.

We conducted a crowd-sourced study on Amazon Mechanical Turk. In the study, workers answer 10-randomly picked images along with image descriptions generated by NeuralTalk2, im2txt[Vinyals and Toshev2015], multi-label classifier [Szegedy et al.2017]

(as baselines) and our method. NeuralTalk2 uses convolutional and recurrent neural networks in multimodal space to generate image descriptions. im2txt is similar to NeuralTalk2 with a better classifier. We evaluate on the basis of 4 metrics: Task Relevance, Usefulness, General Preference and Technicality. As seen in Figure

3, our method outweighs NeuralTalk2 and im2txt captions for task relevance metric by a large margin and performs almost equally for the other three metrics. As expected, multi-label classifier tags perform poorly due to non-aesthetic descriptions. This study reinforces our assertion on task suggestion capability of our method.

Conclusion and Future Work

In this work, we propose a novel method for a scene task suggestion system. These descriptions can be used for applications like image alt text generation or as priors to existing image description models to build their descriptions upon, rather than generating them base up. However, this kind of a system is constrained to work on scenes where the task being done is a prominent part of it. We intend to extend this work to aid in the existing dense image description generation, making models intrinsically more task-aware by injecting task coherence scores within their architecture.


This project was partially funded by the EPSRC Fellowship titled “Task Based Information Retrieval”, grant reference number EP/P024289/1.


  • [Bernardi and Cakici2016] Bernardi, R., and Cakici, R. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. JAIR.
  • [Karpathy and Fei-Fei2015] Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
  • [Szegedy et al.2017] Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A. 2017.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI.
  • [Vinyals and Toshev2015] Vinyals, O., and Toshev, A. 2015. Show and tell: A neural image caption generator. In CVPR.