Today, we gather thousands of hours of underwater imagery and video using Remotely Operated Vehicles (ROVs) and other underwater assets, but remain unable to mobilize its full value potential because of the prohibitive amount of time that is required to manually review the video data, and is subsequently not viewed. While this video data is collected by a wide variety of research groups in the marine sciences across the globe, the analysis of the data collected cannot be fully realized until there exists an exemplar annotated repository of marine life imagery. In this paper we propose FathomNet: a baseline image training set that is optimized to accelerate development of modern, intelligent, automated analysis of underwater imagery. In addition to the data set, this paper describes the exploratory experiments with deep learning algorithms that were used to create a baseline architecture for future work. Applications for this body of research span a wide spectrum, from accelerated video review and real time algorithms for ROVs, to large biodiversity analyses. Creating application-primed data sets has been an effective tactic to increase research accessibility and model robustness. Recent work on characterizing large data sets has demonstrated the importance of representative data for targeted applications. Attributes such as background, image viewpoint, and scale have dramatic effects on the generalization of algorithms to real-world scenarios.
We began with focusing on weakly supervised localization because we sought an algorithm that could propose bounding boxes that could be manually corrected and verified by human experts. This is a workflow that is known to accelerate annotation time, and reduce the strain of producing so many annotations for a user. The structure of the seed data set was such that most of the frame grabs had image level labels associated with them. The ultimate goal of the FathomNet
data set was to enable modern Convolutional Neural Network (CNN)-based object detection and classification algorithms to be developed for species that existed in seed data sets, crucial in demonstrating the viability and potential of the data set.
2 Technical Approach
With the advent of modern deep learning architectures and the abundance of annotated data sets, the importance of representative data for targeted applications has been effectively demonstrated. Niche data sets such as ObjectNet (Barbu, A. et al., 2019) and Fashion-MNIST (Zalando, 2017) have built upon the generalizeable properties of MNIST (LeCun et al., 1998) and ImageNet (Deng et al., 2009). Until now, a baseline training data set for the analysis of underwater imagery has been unavailable. Using algorithms to accurately detect regions of interest (ROIs) in non-iconic imagery is critical for the progress of automated inspection of video that will be collected on expeditions for years to come. We introduce the provenance of the seed data in Section 2.1, and in Section 2.2 we describe the data set and their respective annotations. Full access to FathomNet and existing models are available in the Supplemental Material section.
2.1 Seed data set
The seed data set used high-resolution video equipment to record hundreds of remotely and autonomously operated vehicle dives each year. This video library contains detailed footage of the biological, chemical, geological, and physical aspects of each deployment. Our data set consists of more than 26,000 hours of videotape that has been archived, annotated, and maintained as a centralized resource. This resource is enabled by the Video Annotation and Reference System (VARS; Schlining and Jacobsen Stout 2006), which is a software interface and database system that provides tools for describing, cataloguing, retrieving, and viewing the visual, descriptive, and quantitative data associated with deep-sea video archives. All of the video resources are expertly annotated by members our team, and there are currently more than 6.8 million annotations and 4,349 concepts (or classes) in the VARS knowledge base, with over 2000 of those concepts belonging to either genera or species.
Using the VARS Query tool, a list of annotations that describe different genera and geologic features can be obtained. Of those annotations, each genus (of midwater and benthic animals) and geologic feature were ranked by the number of associated frame grabs. Drawing on the nomenclature used in the Panoptic Segmentation (Kirilov, He, Girshick, Rother, & Dollar, 2018) task, these correspond to “stuff” and “things”. Upon selecting the top 18 midwater genera, the top 17 benthic genera, and the top 3 geologic features to incorporate into FathomNet (Figure 1), the initial phases of this effort were to automate the classification of “things” (or animals) instead of “stuff” (or geological features), and the image set was then divided into Midwater and Benthic classes.
2.2 Data set description: midwater and benthic classes
The ocean is divided into different zones based on a variety of characteristics that include depth and light. The surface waters of the ocean are commonly described as the region where light penetrates, and generally extends from the surface to 200 m deep. The benthic region extends from the bottom of the ocean to 50 m above it, and the midwater region is the habitat that connects the surface waters and the benthic region. The midwater and benthic regions together make up the largest habitable ecosystem and is arguably the least explored on the planet. While animals that live in the benthic region are often associated with a variety of substrates, midwater animals inhabit a region effectively without boundaries, and has implications for the types of imagery that are collected in both places. These visual differences between habitats necessitates separating midwater and benthic classes for our subsequent efforts. The frame grabs and labels for the top midwater concepts include 18 midwater genera, with a mix of iconic and non-iconic views. The benthic imagery consisted of 17 classes, where each described a benthic animal genus. The nature of the frame grabs and the distribution of species meant that images often only contained a single species of interest, which often corresponded to the label that came along with the image.
2.3 Evolution of data annotation state
The data science and oceanographic communities use the term "annotation" differently. The seed data set was expertly curated, and "annotated", but the annotations lacked localization data for the object(s) of interest. Further, when annotations were associated with a specific video frame or image, those images were not exhaustively annotated. In many cases, a frame grab would correspond to an annotation (i.e., single-label) despite many instances where there were more objects of interest of the same or different classes within the image. For operational purposes, algorithms were desired that could identify all objects of interest within a scene, also known as multi-label image classification. Single label annotations provide training data that is severely mismatched to the desired algorithm outputs (Zhao, Zu & Wu, 2019). The majority of our initial data being singly-labeled while being strongly multi-label in reality meant that our early algorithmic experiments revolved around using noisy labels to help identify ways to augment the data. This approach had a limited degree of success, and soon informed improvements to annotations which included exhaustively annotating multiple labels with bounding boxes within single-label imagery. Based on these recommendations, we began generating localized, multi-label data, and are focused on improving this process going forward.
Finally, we have started to incorporate imagery with other characteristics, to help improve the diversity of the data set. For example, many of the initial frame grabs were from cropped or zoomed-in iconic views of objects, and these views are often vastly different from the desired operational use cases for algorithms developed with this data set. Therefore we have begun including data of non-iconic imagery that were collected in both midwater and benthic environments.
3 Algorithm Experiments
In addition to enriching the data with localization annotations, we experimented with three different classes of algorithms with the desire to accelerate data set generation. As described in Section 2.2, the initial state of the data included significant label noise at the image level due to the single-label assignment in strongly multi-label images. These algorithms included image classification, weakly supervised localization, and object detection.
3.1 Image Classification - ResNet 50
We trained a basic image classification algorithm on the benthic and midwater data sets using an ImageNet pre-trained ResNet50 architecture. To account for the strong multi-label nature of the data we utilized both a Top-1 and Top-3 scoring metric, where the results can be seen in Figure 3. While there are ways to train networks, such as ResNet, for multi-label problems (Gardner, Nichols, 2017;Wang, Jia, Breckon, 2018;Li, Yeh, 2018;Wang et al., 2017), our initial data set only provided single-label imagery. For both subsets of data, 80% was used for training, 10% for validation, and 10% for testing. We report performance on the test split. Further, to test the ability of the algorithms to work in an operational setting, we ran the algorithm on data from video transects. During a transect, an underwater vehicle transits at a consistent speed for a consistent duration using the same imaging field-of-view at different observational depths. Unlike discovery modes, where an underwater vehicle is piloted to search for and observe animals in close quarters, transect data involves observations of animals or targets of interest that move steadily past. These transects are not balanced across classes, and provide markedly different views of the data, particularly for the midwater transects.
For the midwater data, a classifier was trained for 15 species, which were filtered based on the number of images for each genus, and with a cutoff of approximately 1,000 images per genus. This resulted in a data set of 33,064 images. The classifier was trained for the concepts using only the single-class labels, and was able to identify multiple concepts within an image using a Top-N scoring methodology.
A rigorous application of this technique would produce false positives for each of the Top N concepts, however the relative accuracy and occurrence of misidentified classes was demonstrated as part of this process. Figure 4
shows the resulting confusion matrix for theFathomNet data set. Using the aforementioned scoring methodology, we found the Top 1 and Top 3 accuracies to be 85.7 % and 92.9 %, respectively on the midwater test set.
We then evaluated algorithm performance by on a midwater video transect data set. For many midwater taxa of interest, the FathomNet training set consisted of a fair amount of zoomed-in, iconic views of animals. These views contrasted sharply with the limited spatial resolution of targets in the midwater transect footage, as these animals passed quickly by the underwater vehicle at a relatively far distance. Therefore, due to differences in imagery between the midwater transect data (e.g., differences in scale and resolution of objects) and the FathomNet training data (e.g., iconic imagery), we found that the model performed poorly, and is consistent with work characterizing model generalization difficulties around parameters such as scale and viewpoint (Barbu, A. et al., 2019; Oksuz et al., 2020). The difference in spatial scales was akin to showing someone iconic pictures of sports cars, and asking them to identify these vehicles from aerial photos.
We also trained an algorithm on 15 of the 17 benthic classes, initially setting the minimum number of images for the training set at 700 and later removing this limit to include classes that were abundant in the video transect data. This resulted in a data set with a total of 33,064 images. Top 1 and Top 3 accuracies of 72.4 % and 92.8 %, respectively were achieved for the test set. An immediately noticeable difference between the benthic and midwater results is the difference in Top 1 vs Top 3 accuracy metrics. There is a marked improvement in the benthic data set moving to Top 3 accuracy. The reason for this is due to the strongly multi-concept nature of benthic imagery in FathomNet, while the midwater images tended to be mostly single-concept. In many cases the label assigned to a benthic image was not the dominant concept from the data set within the image. This created a number of challenges, but also opened up interesting research areas in multi-label image algorithms and noisy data sets. An example of a multi-concept image is shown in Figure 4. Therefore, the remainder of our efforts and discussion on algorithm development focuses on the benthic imagery because of the more appropriate nature of the test transect video, as well as the similar nature of video and images to other collaborators’ data.
3.2 Weakly Supervised Localization - GradCAM++
To assess our ability to generate an object proposal algorithm from the single-label imagery, we explored weakly supervised localization techniques (Gao, Li, Yu, Morariu, & Davis, 2018;Najibi, Yang, Wang, & Piramuthu, 2018;Papadopoulos, Uijlings, Keller, & Ferrari, 2016). Specifically we used GradCAM++ (Chattopadhyay and Sarkar, 2018) on the ResNet50 classifier from the previous section. We explored two ways to generate saliency maps and bounding box proposals: class-specific search and dominant class search. Dominant class search uses the class label from the output of the classifier to inform which concept to search for in an image, while class specific search selects a specific label to search for. Figure 5 shows how this technique nominally work, with the output saliency map being generated for the dominant class label from the classifier.
The results of the experiments with this algorithm showed that they could be useful under certain fairly restrictive conditions. In general, they worked best when the class being searched for was the most prominent class of interest in the image. Additionally, it worked well at localizing up to a few instances of a class, but started to degrade in effectiveness when increasing past that number. Recent works have proposed limitations to CAM methods for visual explanation (citation Neurips 2018 paper), or improvements (Ablation CAM). From an operational perspective, we note that in order to identify localize multiple concepts within an image, the technique would have to be run using class specific search for every class of interest, greatly increasing computation time. While not unfeasible, we identified other areas of effort as higher priority, and left these investigations to future efforts.
In addition to our algorithm experiments on the seed data footage, we also obtained video from National Oceanic and Atmospheric Administration’s (NOAA) ROV Deep Discoverer, as well as National Geographic Society’s DropCam. We applied GradCAM++ from the ResNet50 trained classifier on videos from each of these external sources, and obtained very promising results (Figure 4).
3.3 Object Detection - RetinaNet
The final class of algorithms that we experimented with were object detection and classification techniques. Specifically we used the RetinaNet (Lin, Goyal, Girshick, He, & Dollar, 2017) single-stage algorithm with a ResNet50 backbone. In order to use this algorithm, the The seed data set Video Lab exhaustively annotated over 3000 benthic images with bounding box information for over 200 species. This resulted in approximately 23,000 bounding box annotations. As with most data sets of this type, we suffered from the long tail problem, with the large majority of annotations belonging to only a handful of classes.
We considered two approaches to overcome this limitation: 1) Collapse all labels to an “object” category and train a strong candidate object detector for classification by a human expert, and 2) Create a hierarchy for the labels and try to operate at limited taxonomic resolution. We experimented with the second approach to see how we might utilize combinations of algorithms, non-expert humans, and expert humans in the review process. Figure5 shows an example of how this workflow could work. Two promising hierarchies were identified, and included a high-level concept (e.g. fish and crustacean), and morphological appearance (e.g., fish-like, laterally flattened, fan-like, and plant-like).
Our results show that we were able to create a strong object of interest detector, but that the automated labels did not provide additional efficiency compared to the existing expert annotators correcting and labeling identified objects of interest. Identifying different hierarchies and potential semi-automated workflows is an ongoing effort.
4 Discussion and Future Steps
The ability for algorithms to work under data set shift and out of distribution (OOD) data has been explored recently in the literature such as ObjectNet and data set Shift (Ovadia, et al, 2019). Typically, explorations have either centered on the ability of the algorithm to generalize under data set shift or detect OOD data. Alternatively, the ObjectNet data set showed that inherent biases in common data sets impact the performance of algorithms to generalize to situations that humans find easy (e.g., non-standard poses and differences in scale). These biases are present in many existing underwater imagery data sets. In particular the issue of scale and iconic vs. non-iconic views can be severely mismatched between training data and unlabeled data or the data that algorithms are used on for inference.
4.1 Hierarchical labeling process
We have learned that automated image and video annotation should start at a higher taxonomic category. We will be investigating how to (1) use algorithmically generated labels at a higher organizational level and (2) non-expert annotators to accelerate the annotation of large amounts of data to augment FathomNet and aid in further algorithm refinement and development. One promising path forward is the use of few-shot learning (Chen, Liu, Kira, Wang, Huang, 2019) to help refine taxonomic labels. In this paradigm, an object detection algorithm trained at one level of hierarchy would create object detections that are filtered to the label that we wish to refine. By using an iterative training and evaluation cycle, we would then rapidly train models for sub-labels of interest. In this way we would be able to rapidly split a hierarchy, even with relatively few labels. These noisy, single-label detectors can then be used in conjunction with the high-performing object detection algorithms to build up a new level of the taxonomy. Once sufficient labels have been produced using this technique, a retraining of the object detection algorithm can be performed, thereby providing a new baseline for object detection.
4.2 Different types of annotations
The main focus of our algorithm efforts on FathomNet pertained to single-label image annotations or object-level bounding boxes. There are a variety of different types of annotations that can inform other workflows. One such avenue is multi-label annotation, which would reduce a large amount of the noise associated with training single-label algorithms on multi-label imagery. Very large taxonomic multi-label algorithms are an open area of research. Another annotation type includes segmentation masks, both for instance segmentation of objects (or “things”), as well as semantic segmentation of scenes (or “stuff”). These masks can help characterize benthic scenes for example, or provide more information for recognition algorithms like Mask R-CNN (He, Gkioxari, Dollar, & Girshick, 2017). One of the difficulties in segmentation approaches is generating the training data. This involves drawing appropriate boundaries around every object of interest, which is a task that is even more tedious than drawing bounding boxes. Recent efforts in this area have provided significant speedups for this workflow (Ling, et al, 2019; Acuna, Kar, Fidler, 2019).
As FathomNet continues to develop and incorporate more imagery from other oceanographic community members, we hope that this effort will ultimately enable scientists, explorers, policymakers, storytellers, and the public to better understand how to be stewards of our oceans. There may be unintended consequences in releasing the images and metadata from the seed data set that include illegal poaching or other ocean ecosystem exploits. Therefore it has been a priority for our team throughout this research thrust to take great care in how we craft the metadata released with the images. If successful, FathomNet will aid in the democratization of ocean research and increase accessibility to ocean data and analysis tools, especially in global communities that traditionally do not have ready access to them. Similar to the My Deep Sea, My Backyard project that originated at the MIT Media Lab Open Ocean Initiative, FathomNet aims to empower communities "around the globe to explore their own deep-sea backyards using low-cost technology, while building lasting in-country capacity."(Amon et al, 2018)
 Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., and Katz, B. (2019). ObjectNet: A large-scale bias-controlled data set for pushing the limits of object recognition models. Advances in Neural Information Processing Systems 32, pp. 9453–9463.
 Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-MNIST: a Novel Image data set for Benchmarking Machine Learning Algorithms. arXiv:1708.07747.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324.
 Schlining, B.M. and N. Jacobsen Stout (2006). The seed data set’s video annotation and reference system.Proceedings of the Marine Technology Society/Institute of Electrical and Electronics Engineers Oceans Conference pp. 1-5.
 Kirillov, Alexander, Kaiming He, Ross B. Girshick, Carsten Rother and Piotr Dollár. (2019) Panoptic Segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9396-9405.
 Stanford, D G and David Nichols Stanford. (2017) Multi-label Classification of Satellite Images with Deep Learning.
 Wang, Q., Jia, N., & Breckon, T.P. (2018). A Baseline for Multi-Label Image Classification Using Ensemble Deep CNN. CoRR, abs/1811.08412.
 Li, Y., & Yeh, M. (2018). Learning Image Conditioned Label Space for Multilabel Classification. CoRR, abs/1802.07460.
 Z. Wang, T. Chen, G. Li, R. Xu and L. Lin. (2017) Multi-label Image Recognition by Recurrently Discovering Attentional Regions IEEE International Conference on Computer Vision (ICCV) pp. 464-472.
 Oksuz, Kemal, Baris Can Cam, Sinan Kalkan, and Emre Akbas.(2020) Imbalance Problems in Object Detection: A Review. ArXiv:1909.00169 [Cs]
 Gao, M., Li, A., Yu, R., Morariu, V.I., & Davis, L.S. (2018). C-WSL: Count-Guided Weakly Supervised Localization. ECCV.
 M. Najibi, F. Yang, Q. Wang and R. Piramuthu. (2018) Towards the Success Rate of One: Real-Time Unconstrained Salient Object Detection,IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1432-1441.
 D. P. Papadopoulos, J. R. R. Uijlings, F. Keller and V. Ferrari. (2016) We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 854-863.
 Chattopadhyay, Aditya, Anirban Sarkar, Prantik Howlader and Vineeth N. Balasubramanian. (2018) Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks.IEEE Winter Conference on Applications of Computer Vision 839-847.
 R. Girshick, J. Donahue, T. Darrell and J. Malik. (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.IEEE Conference on Computer Vision and Pattern Recognition pp. 580-587.
 Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., & Huang, J.-B. (2019). A Closer Look at Few-shot Classification.International Conference on Learning Representations.
 T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal Loss for Dense Object Detection.IEEE International Conference on Computer Vision pp. 2999-3007.
 Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu.(2019) Object Detection with Deep Learning: A Review. ArXiv:1807.05511
 Ovadia, Yaniv, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek.(2019) Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under data set Shift. Advances in Neural Information Processing Systems 32 13991–14002.
 H. Ling, J. Gao, A. Kar, W. Chen, & S. Fidler (2019). Fast Interactive Object Annotation with Curve-GCN. CoRR abs/1903.06874.
 D. Acuna, A. Kar, & S. Fidler (2019). Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations. CoRR abs/1802.07460.
 Amon, Diva, Randi Rotjan, Miriam Simun, Brennan Phillips, Alan Turchik, Katy Croff Bell, Rafael Anta, Kristina Gjerde, and Miriam Simun.(2018) My Deep Sea, My Backyard. Journal of Open Exploration