The VGG Image Annotator (VIA)

by   Abhishek Dutta, et al.
University of Oxford

Manual image annotation, such as defining and labelling regions of interest, is a fundamental processing stage of many research projects and industrial applications. In this paper, we introduce a simple and standalone manual image annotation tool: the VGG Image Annotator ( vgg/software/via/VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. Due to its lightness and flexibility, the VIA software has quickly become an essential and invaluable research support tool in many academic disciplines. Furthermore, it has also been immensely popular in several industrial sectors which have invested in adapting this open source software to their requirements. Since its public release in 2017, the VIA software has been used more than 500,000 times and has nurtured a large and thriving open source community.


page 2

page 3

page 4

page 5


A Dataset of Enterprise-Driven Open Source Software

We present a dataset of open source software developed mainly by enterpr...

Towards Utility-based Prioritization of Requirements in Open Source Environments

Requirements Engineering in open source projects such as Eclipse faces t...

An Empirical Study of User Support Tools in Open Source Software

End users positive response is essential for the success of any software...

Clone Detection on Large Scala Codebases

Code clones are identical or similar code segments. The wide existence o...

An Infrastructure for Software Release Analysis through Provenance Graphs

Nowadays, quickly evolving and delivering software through a continuous ...

GELATO and SAGE: An Integrated Framework for MS Annotation

Several algorithms and tools have been developed to (semi) automate the ...

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

Summarization is a challenging problem, and even more challenging is to ...

1 Introduction

Manual annotation of a digital image, audio or video is a fundamental processing stage of many research projects and industrial applications. It requires human annotators to define and describe spatial regions associated with an image or still frame from a video and temporal segments associated with audio or video. Spatial regions are defined using standard region shapes such as rectangle, circle, ellipse, point, polygon, polyline, freehand drawn mask, etc. while the temporal segments are defined by delineating start and end timestamps (e.g. video segment from 3.1 sec. to 9.2 sec.). These spatial regions and temporal segments are described using textual metadata.

A manual annotation tool allows human annotators to define and describe such spatial regions and temporal segments. In this paper, we present a simple and standalone manual annotation tool, the VGG Image Annotator (VIA), that runs in a web browser and does not require any installation or setup. The complete VIA software fits in a single self-contained HTML page of size less than kilobyte that runs as an offline application in most modern web browsers. This light footprint of VIA software allows it to be easily shared (e.g. using email) and distributed amongst manual annotators. VIA can be downloaded from

VIA software is an open source project created solely using HTML, Javascript and CSS. This choice of platform has allowed us to build a flexible manual annotation tool with the following capabilities: a) Up and running in a few seconds, b) No installation or setup required, c) Light weight, portable and offline, d) Simple and easy to use. Since VIA requires no installation or setup, non-technical users can begin annotating their images, audio and video very quickly; and consequently, we have seen widespread adoption of this software in a large number of academic disciplines and industrial sectors. A minimalistic approach to user interface design and rigorous testing (both internally and by our vibrant open source community) has allowed the VIA software to become an easily configurable, simple and easy to use manual annotation tool.

The development of VIA software began in August 2016 and the first public release of Version 1 was made in April 2017. Many new advanced features for image annotation were introduced in Version 2 which was released in June 2018 [1]. The recently released Version 3 supports annotation of audio and video. As of May 2019, the VIA software has been used more than times ( unique pageviews).

This paper is organised as follows. We describe different use cases of the VIA software in Sections 24, and the software design principles in Section 5. A brief overview of the open source ecosystem thriving around the VIA software is included in Section 6. The impact of VIA software on several academic and industrial projects is discussed in Section 7. Finally, we describe our planned directions for extensions in Section 8.

2 Image Annotation

VIA software allows human annotators to define and describe regions in an image. The manually defined regions can have one of the following six shapes: rectangle, circle, ellipse, polygon, point and polyline. Rectangular shaped regions are very common and are mostly used to define the bounding box of an object. Polygon shaped regions are used to capture the boundary of objects having a complex shape. The point shape is used to define feature points like facial landmarks, or keypoints in MRI images, location of particles in microscopy images, etc. The circle, ellipse and polyline shaped regions are less common but are essential for some projects. Examples are given in  Figure 2.

Figure 2: The VIA software is being used in a wide range of academic disciplines and industrial sectors to define and describe regions in an image. For example, (a) actor faces are annotated using rectangle shape and identified using a predefined list; (b) the boundary of arbitrarily shaped objects in scanning electron microscope image is defined using circle and polygon shapes by [2], (c) 15th-century printed illustrations are annotated using rectangle shape, and (d) the point shape has been used by [3] to manually define the location of particles in cryo-electron microscopy image.

The textual description of each region is essential for many projects. Such textual descriptions often describe visual content of the region. While a plain text input element is sufficient to update the textual description, VIA supports the following additional input types with predefined lists: checkbox, radio, image and dropdown. The predefined list ensures label naming consistency from the human annotators.

3 Image Group Annotation

Annotation of large image datasets is rarely accomplished solely by human annotators. More usually, a two stage process is used to reduce the burden on human annotators: a) Automatic Annotation: Computer vision algorithms are applied to the image dataset to perform a preliminary (but possibly imperfect) annotation of the images. b) Manual Filtering, Selection and Update: Human annotators review the annotations produced by the Automatic Annotation process and perform manual filtering, selection and update to retain only the good quality annotations. This two stage process off-loads the burden of image annotation from human annotators and only requires them to perform filtering, selection and update of automatic annotations. The VIA software supports this two stage model of image annotation using its Image Grid View

feature which is designed to help human annotators filter, select and update metadata associated with a group of images. The image groups are based on the metadata and regions defined by automatic computer vision algorithms.

Figure 3: A set of automatically detected face tracks in consecutive video frames from BBC Sherlock series is assigned metadata (is_good_track and name) quickly by human annotators using the image grid view feature of VIA. A face track containing incorrect detections can also be easily filtered out by setting is_good_track to “No”.

To illustrate the image grid view feature of VIA, consider the task of face track annotation which involves delineating and identifying face region of an individual in consecutive frames of a video – also called a face track. Such annotated dataset are often used to train face detection and recognition algorithms. For face track annotation, an automatic face detector (e.g. Faster R-CNN 

[4]) is used to detect face regions in consecutive frames of a video and a face track identification system (e.g. VGG Face Tracker [5]) identifies unique face tracks from the automatically detected face regions in consecutive frames. Automatically generated face track annotations are imported into the VIA software and human annotators – using the image grid view feature of VIA – review these automatically generated face tracks and select or filter the ones that are correct as shown in Figure 3. Furthermore, the VIA image grid view also allows bulk update of attributes associated with each group. This capability of VIA allows human annotators to quickly annotate large number of images that have been partially annotated by automatic computer vision algorithms.

The grid view also enables annotators to easily remove erroneous images from a group. This functionality is very useful for re-training an existing image classifier by identifying images that have been incorrectly classified.

4 Audio and Video Annotation

The VIA software also allows human annotators to define temporal segments of an audio or video and describe those segments using textual metadata. Such manually annotated audio or video segments are useful for many projects. For instance, a large number of human annotators are using VIA to collaboratively define temporal segments containing speech of an individual (i.e. speaker diarisation) in videos taken from [6] as shown in Figure 4. Such a manually annotated dataset is essential for assessing the accuracy of automatic tools for speaker diarisation. In a similar way, researchers from Oxford Anthropology department are using VIA for identifying video segments containing a particular chimpanzee in videos captured at a forest site. Such annotated video segments are used to train and test computer vision algorithms that can automatically detect and identify chimpanzees in videos captured in “the wild”. As an illustrative example for audio, we show the results of speaker diarisation on an audio recording containing a conversation between an air traffic controller (ATC) and a pilot in Figure 1(middle).

Figure 4: VIA software being used to perform speaker diarisation for a video containing a conversation between two individuals. Human annotators manually identify the segments of the video that contains speech of an individual.

5 Software Design

The user interface of VIA is made using standard HTML components and therefore the VIA software looks familiar to most new users. These components are styled using CSS to achieve a greyscale colour scheme which helps avoid distractions and focus attention on the visual content that is being manually annotated using the VIA software. We follow the minimalist approach for the user interface, and strive for simplicity both in design and implementation. We resist adding new features or updating existing user interface components if we feel that such change leads to complexity in terms of usability and implementation. Most of our design decisions are influenced by feedback from the open source community thriving around the VIA software.

The HTML and CSS based user interface of VIA is powered by nearly lines of Javascript code which is based solely on standard features available in modern web browsers. VIA does not depend on any external libraries. These design decisions has helped us create a very light weight and feature rich manaul annotation software that can run on most modern web browsers without requiring any installation or setup. The full VIA software sprouted from an early prototype111 of VIA which implemented a minimal – yet functional – image annotation tool using only lines of HTML/CSS/Javascript code that runs as an offline application in most modern web browsers. This early prototype provides a springboard for understanding the current codebase of VIA which is just an extension of the early prototype. A detailed source code documentation222 is available for existing developers and potential contributors of the VIA open source project.

Many existing manual annotation software (e.g. [7, 8]) require installation and setup. This requirement often results in a barrier for non-technical users who cannot deal with variability in software installation and setup procedure on different types of computing systems. The VIA software and other recent annotation software tools like [9] have overcome this challenge by using the web browser as a platform for deployment of offline manual annotation software. Since a standard web browser is already installed in most computing systems, users can get up and running with such web browser based manual annotation software in few seconds.

6 Open Source Ecosystem

We have nurtured a large and thriving open source community which not only provides feedback but also contributes code to add new features and improve existing features in the VIA software. The open source ecosystem of VIA thrives around its source code repository333 hosted by the Gitlab platform. Most of our users report issues and request new features for future releases using the issue portal. Many of our users not only submit bug reports but also suggest a potential fix for these software issues. Some of our users also contribute code to add new features to the VIA software using the merge request portal. Thanks to the flexibility provided by our BSD open source software license, many representatives from commercial industry have contacted us through email to seek advice for their engineering team tasked with adapting the VIA software for internal or commercial use.

7 Impact on Academia and Industry

The VIA software has quickly become an essential and invaluable research support tool for many academic disciplines. For example, in Humanities, VIA has been used to annotate hundreds of 15th-century printed illustrations [10] and annotate images “which are meant to be read as texts” [11]. In Computer Science, large numbers of image and video datasets have been manually annotated by groups of human annotators using the VIA software [12]. In the History of Art, VIA was used to manually annotate a multilayered 14th-century cosmological diagram containing many different elements [13]. In Physical Sciences, VIA is being used to annotate particles in electron microscopy images [2, 3]. In Medicine, the VIA software has allowed researchers to create manually annotated medical imaging datasets [14, 15, 16].

VIA has also been very popular in several industrial sectors which have invested in adapting this open source software to their specific requirements. For example, Puget Systems (USA), Larsen & Toubro Infotech Ltd. (India) and Vidteq (Bangalore, India) have integrated the VIA software in their internal work flow. Trimble Inc. (Colorado, USA) adapted VIA for large scale collaborative annotation by running VIA on the Amazon Mechanical Turk platform.

8 Summary and Future Development

In this paper, we described our manual annotation tool called VGG Image Annotator (VIA). We continue to develop and maintain this software according to the principles of open source software development and maintenance.

VIA is a continually evolving open source project which aims to be useful for manual annotation tasks in many academic disciplines and industrial settings. This demands continuous improvement and introduction of advanced new features in VIA. In the future releases of VIA software, we will introduce the following two features:

  • Collaborative Annotation: Annotating a large number of images (e.g. a million images) or videos (e.g. thousands of hours of videos) requires collaboration between a large number of human annotators. Therefore, we are upgrading VIA to add support for collaborative annotation which will allow multiple human annotators to incrementally and independently annotate a large collection of images and videos. The collaborative annotation feature is now being internally tested and will soon be released as a part of VIA version 3.

  • Plugins: The state-of-the-art computer vision models are becoming very accurate in common annotation tasks such as locating objects, detecting and recognising human faces, reading text, detecting keypoints on a human body and many other tasks commonly assigned to human annotators. These computer vision models can help speed up the manual annotation process by seeding an image with automatically annotated regions and then letting human annotators edit or update these detections to create the final annotation. Thanks to projects like TensorFlow.js, it is now possible to run many of these models in a web browser. We envisage such computer vision models attached as plugins to VIA and running in the background to assist human annotators.