Manual annotation of a digital image, audio or video is a fundamental processing stage of many research projects and industrial applications. It requires human annotators to define and describe spatial regions associated with an image or still frame from a video and temporal segments associated with audio or video. Spatial regions are defined using standard region shapes such as rectangle, circle, ellipse, point, polygon, polyline, freehand drawn mask, etc. while the temporal segments are defined by delineating start and end timestamps (e.g. video segment from 3.1 sec. to 9.2 sec.). These spatial regions and temporal segments are described using textual metadata.
A manual annotation tool allows human annotators to define and describe such spatial regions and temporal segments. In this paper, we present a simple and standalone manual annotation tool, the VGG Image Annotator (VIA), that runs in a web browser and does not require any installation or setup. The complete VIA software fits in a single self-contained HTML page of size less than kilobyte that runs as an offline application in most modern web browsers. This light footprint of VIA software allows it to be easily shared (e.g. using email) and distributed amongst manual annotators. VIA can be downloaded from http://www.robots.ox.ac.uk/~vgg/software/via.
The development of VIA software began in August 2016 and the first public release of Version 1 was made in April 2017. Many new advanced features for image annotation were introduced in Version 2 which was released in June 2018 . The recently released Version 3 supports annotation of audio and video. As of May 2019, the VIA software has been used more than times ( unique pageviews).
This paper is organised as follows. We describe different use cases of the VIA software in Sections 2–4, and the software design principles in Section 5. A brief overview of the open source ecosystem thriving around the VIA software is included in Section 6. The impact of VIA software on several academic and industrial projects is discussed in Section 7. Finally, we describe our planned directions for extensions in Section 8.
2 Image Annotation
VIA software allows human annotators to define and describe regions in an image. The manually defined regions can have one of the following six shapes: rectangle, circle, ellipse, polygon, point and polyline. Rectangular shaped regions are very common and are mostly used to define the bounding box of an object. Polygon shaped regions are used to capture the boundary of objects having a complex shape. The point shape is used to define feature points like facial landmarks, or keypoints in MRI images, location of particles in microscopy images, etc. The circle, ellipse and polyline shaped regions are less common but are essential for some projects. Examples are given in Figure 2.
The textual description of each region is essential for many projects. Such textual descriptions often describe visual content of the region. While a plain text input element is sufficient to update the textual description, VIA supports the following additional input types with predefined lists: checkbox, radio, image and dropdown. The predefined list ensures label naming consistency from the human annotators.
3 Image Group Annotation
Annotation of large image datasets is rarely accomplished solely by human annotators. More usually, a two stage process is used to reduce the burden on human annotators: a) Automatic Annotation: Computer vision algorithms are applied to the image dataset to perform a preliminary (but possibly imperfect) annotation of the images. b) Manual Filtering, Selection and Update: Human annotators review the annotations produced by the Automatic Annotation process and perform manual filtering, selection and update to retain only the good quality annotations. This two stage process off-loads the burden of image annotation from human annotators and only requires them to perform filtering, selection and update of automatic annotations. The VIA software supports this two stage model of image annotation using its Image Grid View
feature which is designed to help human annotators filter, select and update metadata associated with a group of images. The image groups are based on the metadata and regions defined by automatic computer vision algorithms.
To illustrate the image grid view feature of VIA, consider the task of face track annotation which involves delineating and identifying face region of an individual in consecutive frames of a video – also called a face track. Such annotated dataset are often used to train face detection and recognition algorithms. For face track annotation, an automatic face detector (e.g. Faster R-CNN) is used to detect face regions in consecutive frames of a video and a face track identification system (e.g. VGG Face Tracker ) identifies unique face tracks from the automatically detected face regions in consecutive frames. Automatically generated face track annotations are imported into the VIA software and human annotators – using the image grid view feature of VIA – review these automatically generated face tracks and select or filter the ones that are correct as shown in Figure 3. Furthermore, the VIA image grid view also allows bulk update of attributes associated with each group. This capability of VIA allows human annotators to quickly annotate large number of images that have been partially annotated by automatic computer vision algorithms.
The grid view also enables annotators to easily remove erroneous images from a group. This functionality is very useful for re-training an existing image classifier by identifying images that have been incorrectly classified.
4 Audio and Video Annotation
The VIA software also allows human annotators to define temporal segments of an audio or video and describe those segments using textual metadata. Such manually annotated audio or video segments are useful for many projects. For instance, a large number of human annotators are using VIA to collaboratively define temporal segments containing speech of an individual (i.e. speaker diarisation) in videos taken from  as shown in Figure 4. Such a manually annotated dataset is essential for assessing the accuracy of automatic tools for speaker diarisation. In a similar way, researchers from Oxford Anthropology department are using VIA for identifying video segments containing a particular chimpanzee in videos captured at a forest site. Such annotated video segments are used to train and test computer vision algorithms that can automatically detect and identify chimpanzees in videos captured in “the wild”. As an illustrative example for audio, we show the results of speaker diarisation on an audio recording containing a conversation between an air traffic controller (ATC) and a pilot in Figure 1(middle).
5 Software Design
The user interface of VIA is made using standard HTML components and therefore the VIA software looks familiar to most new users. These components are styled using CSS to achieve a greyscale colour scheme which helps avoid distractions and focus attention on the visual content that is being manually annotated using the VIA software. We follow the minimalist approach for the user interface, and strive for simplicity both in design and implementation. We resist adding new features or updating existing user interface components if we feel that such change leads to complexity in terms of usability and implementation. Most of our design decisions are influenced by feedback from the open source community thriving around the VIA software.
Many existing manual annotation software (e.g. [7, 8]) require installation and setup. This requirement often results in a barrier for non-technical users who cannot deal with variability in software installation and setup procedure on different types of computing systems. The VIA software and other recent annotation software tools like  have overcome this challenge by using the web browser as a platform for deployment of offline manual annotation software. Since a standard web browser is already installed in most computing systems, users can get up and running with such web browser based manual annotation software in few seconds.
6 Open Source Ecosystem
We have nurtured a large and thriving open source community which not only provides feedback but also contributes code to add new features and improve existing features in the VIA software. The open source ecosystem of VIA thrives around its source code repository333https://gitlab.com/vgg/via hosted by the Gitlab platform. Most of our users report issues and request new features for future releases using the issue portal. Many of our users not only submit bug reports but also suggest a potential fix for these software issues. Some of our users also contribute code to add new features to the VIA software using the merge request portal. Thanks to the flexibility provided by our BSD open source software license, many representatives from commercial industry have contacted us through email to seek advice for their engineering team tasked with adapting the VIA software for internal or commercial use.
7 Impact on Academia and Industry
The VIA software has quickly become an essential and invaluable research support tool for many academic disciplines. For example, in Humanities, VIA has been used to annotate hundreds of 15th-century printed illustrations  and annotate images “which are meant to be read as texts” . In Computer Science, large numbers of image and video datasets have been manually annotated by groups of human annotators using the VIA software . In the History of Art, VIA was used to manually annotate a multilayered 14th-century cosmological diagram containing many different elements . In Physical Sciences, VIA is being used to annotate particles in electron microscopy images [2, 3]. In Medicine, the VIA software has allowed researchers to create manually annotated medical imaging datasets [14, 15, 16].
VIA has also been very popular in several industrial sectors which have invested in adapting this open source software to their specific requirements. For example, Puget Systems (USA), Larsen & Toubro Infotech Ltd. (India) and Vidteq (Bangalore, India) have integrated the VIA software in their internal work flow. Trimble Inc. (Colorado, USA) adapted VIA for large scale collaborative annotation by running VIA on the Amazon Mechanical Turk platform.
8 Summary and Future Development
In this paper, we described our manual annotation tool called VGG Image Annotator (VIA). We continue to develop and maintain this software according to the principles of open source software development and maintenance.
VIA is a continually evolving open source project which aims to be useful for manual annotation tasks in many academic disciplines and industrial settings. This demands continuous improvement and introduction of advanced new features in VIA. In the future releases of VIA software, we will introduce the following two features:
Collaborative Annotation: Annotating a large number of images (e.g. a million images) or videos (e.g. thousands of hours of videos) requires collaboration between a large number of human annotators. Therefore, we are upgrading VIA to add support for collaborative annotation which will allow multiple human annotators to incrementally and independently annotate a large collection of images and videos. The collaborative annotation feature is now being internally tested and will soon be released as a part of VIA version 3.
Plugins: The state-of-the-art computer vision models are becoming very accurate in common annotation tasks such as locating objects, detecting and recognising human faces, reading text, detecting keypoints on a human body and many other tasks commonly assigned to human annotators. These computer vision models can help speed up the manual annotation process by seeding an image with automatically annotated regions and then letting human annotators edit or update these detections to create the final annotation. Thanks to projects like TensorFlow.js, it is now possible to run many of these models in a web browser. We envisage such computer vision models attached as plugins to VIA and running in the background to assist human annotators.
-  Abhishek Dutta and Andrew Zisserman. The VGG image annotator (VIA). arXiv preprint arXiv:1904.10699, 2019.
-  BigParticle.Cloud. How-to: Generate primary object masks. https://www.bigparticle.cloud/index.php/how-to-generate-primary-object-masks/, 2010. Accessed: Mar 2019.
Tristan Bepler, Andrew Morin, Julia Brasch, Lawrence Shapiro, Alex J.
Noble, and Bonnie Berger.
Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs.arXiv e-prints, Mar 2018.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
-  Qiong Cao, Omkar M. Parkhi, Mark Everingham, Josef Sivic, and Andrew Zisserman. Vgg face tracker. http://www.robots.ox.ac.uk/~vgg/software/face_tracker/, 2019. Accessed: Mar 2019.
-  Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
-  Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation. International journal of computer vision, 77(1-3):157–173, 2008.
-  Chuanhai Zhang, Kurt Loken, Zhiyu Chen, Zhiyong Xiao, and Gary Kunkel. Mask editor: an image annotation tool for image segmentation tasks. arXiv preprint arXiv:1809.06461, 2018.
-  Matthieu Pizenberg, Axel Carlier, Emmanuel Faure, and Vincent Charvillat. Web-based configurable image annotations. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 1368–1371. ACM, 2018.
-  Matilde Malaspina and Cristina Dondi. The 15cbooktrade project. http://15cbooktrade.ox.ac.uk/, 2019. Accessed: Mar 2019.
-  William Pascoe and Kaspar Paseko. Scriptopict. https://c21ch.newcastle.edu.au/scriptopict/, 2019. Accessed: May 2019.
-  Milind Naphade, David C Anastasiu, Anuj Sharma, Vamsi Jagrlamudi, Hyeran Jeon, Kaikai Liu, Ming-Ching Chang, Siwei Lyu, and Zeyu Gao. The nvidia ai city challenge. In 2017 IEEE SmartWorld, pages 1–6. IEEE, 2017.
-  Sarah Griffin. Diagram and Dimension: Visualising Time in the Drawings of Opicinus De Canistris (1296-c. 1352). PhD thesis, University of Oxford, 2018.
-  Michael Ferlaino, Craig A Glastonbury, Carolina Motta-Mejia, Manu Vatish, Ingrid Granne, Stephen Kennedy, Cecilia M Lindgren, and Christoffer Nellåker. Towards deep cellular phenotyping in placental histology. arXiv preprint arXiv:1804.03270, 2018.
-  Alexander Rakhlin and Sergey Nikolenko. Neuromation research: Pediatric bone age assessment with convolutional neural networks. https://medium.com/neuromation-blog/, 2018. Accessed: Mar 2019.
-  Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnières, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher. Endoscopy artifact detection (ead 2019) challenge dataset, 2019.