It is now common for marine scientists to assess fish abundance using multiple underwater video cameras . This innovative method of assessing fish populations is a viable alternative because it is inexpensive and non-lethal compared to traditional methods (i.e. uses of seine nets, fyke nets, gill nets, electrofishing, rotenone, and trawls) 
. A Remote Underwater Video Stations (RUVS)-based approach can also work in complex habitats such as reefs or dense aquatic vegetation where traditional approaches are ineffective. Videos generated from RUVS are now mostly analyzed manually by fish taxonomy experts. These experts estimate fish abundance in different habitats to determine spatial patterns in fish abundance and species composition for a variety of research objectives. Information such as types of species and frequency of occurrence of a particular species are most important in this type of analysis.
However, manual analysis of large amounts of video produced by clusters of RUVS is a tedious process and as it needs experts with specialized domain knowledge , the process becomes expensive. Automatic processing of captured underwater visual data from RUVS would be an ideal solution in such circumstances. Automatic detection of fish and other marine species is an essential step in order to distinguish the fish from the background (e.g. ocean floor, plants, rocks). This detection task is made more complex by the high levels of occlusion (due to schooling by fish), color and texture of fishes. Fig.1 shows some sample frames obtained from different marine sites across southeast Queensland, Australia.
Existing works in the literature are mostly semi-automatic [2, 4] and assumed a constrained environment. An unconstrained video stream involves more complex environments and challenges like illumination, water turbidity, complex background, a variable number of species, changes in orientation and scale due to freely moving fishes. These factors pose a real challenge in recognition of species in an unconstrained environment. In this work, we looked to fully automate the process to obtain all the required information needed for an assessment from a captured underwater video. Two main components involved in this automation are (a) automatic detection of species bounding boxes in the frame and (b) classification of all the detected bounding boxes (Region of Interest) into predefined classes (i.e. species names). The proposed work addresses both challenges in a single pipeline using a deep learning-based end-to-end architecture called ‘Faster R-CNN’.
The objective is to identify all the species present in an underwater video in a real-time scenario. The proposed architecture for fish assessment has many advantages apart from its fully automated properties. The system can successfully detect and recognise multi-oriented and multi-scale samples of species available in the dataset obtained from an unconstrained environment. A wide range of experiments was conducted using three different deep learning models and a dataset was developed with a significant number of surf species.
Ii Related works
Despite significant literature for automatic object detection and recognition using deep learning, limited attention has been given to recognition of species from the underwater video for assessing fish abundance. We provide a brief review of the relevant research and state-of-the-art approaches on fish identification from underwater video footages. The limitations of the approaches in the literature are investigated to identify the gap and scope of works. Existing methods can be categorised into two classes: handcrafted feature-based [5, 6, 4]
and machine learning-based
approaches. A deformable Template Matching-based feature extraction technique was proposed by Rova et al.6]
proposed a fish classification technique which could be used as a partial automation of underwater video processing. A Kalman filter-based technique was used. However, a constant velocity model was assumed which is not very compatible with the unpredictability of fish movements (velocity and directions). The shape-based feature (Fourier) extraction technique was employed which might not perform well when the number of classes increases and with fishes of identical shape. Only three fish species were considered in the experiments whereas many more fish classes can be presents in undersea environments.
Spampinato et al. 
proposed two different methods for fish detection in underwater images and videos contain ten different classes of fish. Three different approaches were proposed for image-based fish species recognition based on spooling and sparse coding-based features. A two-step approach was adopted for fish detection and classification in videos. A background subtraction-based approach was used to detect fishes whereas SIFT-descriptors and SVM-based classifier were used for recognition. However, limitations of shape context-based features and template matching techniques assume a constraint environment which is not applicable to real-time unconstrained underwater environment. Recently, the latest generation of Convolutional Neural Networks (CNNs) outperformed the approaches based on handcrafted approaches in computer vision research. The problem of fish classification was addressed by Salman et al.  using a CNN-based feature and SVM-based classifier. The LifeCLEF fish dataset used in this experiment mainly contains fish templates.
Accurate object detection and classification still remain a challenging problem in the field of computer vision despite a significant progress being made using deep convolutional neural networks on image classification and detection . Recent advancement of deep ConvNets  has significantly improved the object detection and classification task. The object detection is a more challenging job, compared to image classification, as it requires more advanced and complex methods [10, 9] to obtain accuracy. However, convolutional neural networks (CNNs) have now been successfully employed recently [11, 12]. The selective search  method merges superpixels based on low-level features and EdgeBoxes  uses edge information to generate region proposals, and these are now widely used. Shortcomings of proposed methods are that they need as much running time as the detection network to hypothesize object locations.
Here, the recent state-of-the-art methods towards object detection [10, 13, 14] has been discussed. The Region-based Convolution Network Network (R-CNN)  performs excellent object detection by using a deep ConvNet and classify the object proposals. R-CNN uses Selective Search (SS) technique to compute multi-scale object proposal to achieve the scale-invariance capability. However, R-CNN is computationally expensive due to the processing of high numbers of object proposal and provides only rough localization which compromises speed and accuracy.
Fast R-CNN  is an improved version of R-CNN with a much faster training and testing process and it achieves more accuracy compare to R-CNN. R-CNN does not share computation and performs CovNet forward pass for each object proposal. Spatial pyramid pooling nets  proposed a sharing computation technique which speeds up R-CNN but fine-tuning algorithm proposed in SPPnets  cannot update the layers precede the Spatial pyramid pooling. In addition, as it deals with a variable window size of pooling, one stage (end to end) training was difficult. Fast-R-CNN fixes the drawbacks of R-CNN and SPPnet, whiling improving their speed and accuracy. The single-stage training process in Fast R-CNN can update all network layer using a multi-task loss and does not need disk storage for feature caching. In all of the above approaches, the power of CNN has been used only for regression and classification. The concept of Fast R-CNN was extended further in Faster R-CNN  by introducing a Region Proposal Network (RPN). The Faster R-CNN merges the RPN and Fast R-CNN into a single network by sharing their convolutional features using a popular terminology of neural networks with ‘attention’ mechanisms, the RPN guides the network for object regions. RPN consists of several additional convolutional layers, build on top of the convolutional feature map. Although the accuracy of R-CNN and Fast R-CNN were satisfactory, they were computationally expensive which make them unsuitable for real-time applications, unlike Faster R-CNN. We, therefore, selected the Faster R-CNN  as our approach in this investigation.
The object detector called Faster R-CNN  is a particularly successful method for general object detection. It is a single integrated network which consists of two modules: (a) region proposal, and (b) region classifier. Fig. 2 shows a Faster R-CNN architecture which is a single, unified network for object detection. A deep fully convolutional network proposes a set of regions and then the regions are used by the Faster R-CNN  detector. The Region Proposal Networks (RPNs) are designed to predict region proposals with a wide range of scales and aspect ratios. Sharing of convolution at test time with the very efficient object detection network  significantly reduces the marginal cost of proposals computation.
The proposed RPN model  can be combined with a classification model to achieve the detection and classification in an end to end framework. Different CNN-based classification models with different sizes (small, medium and large) were combined with the RPN network in our experiments to obtain Faster R-CNN models of three different sizes (i.e. the number of layers). The RPN consists of a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. In faster R-CNN, RPN was constructed (see Fig. 2) on top of the convolutional feature map which was trained end-to-end to generate high-quality region proposals. The following three classification models (ZF, CNN-M, and VGG-16) were used in our experiments to combine with RPN and compare the performances.
ZF Net : Architecture of this network model is similar to AlexNet with minor modifications. The filter size was reduced to compared to
in AlexNet in the first convolutional layer of ZF net which helps to retain a significant pixel information in the input data. ZF net used ReLUs as activation functions, for error function cross-entropy loss and the network trained as batch stochastic gradient descent.
Architecture of this model is similar to the ZF model with some modifications. A smaller receptive field of the first convolutional layer and a decreased stride was shown to be beneficial. However, convolutional layer 2 uses a larger stride (2 instead of 1) to keep the computation time reasonable. The main difference between this model and ZF is that CNN-M uses fewer filters in the layer 4 (512 instead of 1024).
VGG-16 Net : This CNN model consists of 19 layers that only usedand stride 2. The filter size of is a contrast to ZF Nets filter. An effective receptive field of
was achieved using 3 back to back conv layers. The model has used scale jittering as data augmentation during training and ReLU layers are used after each convolutional layer. The batch gradient descent was used during training.
The Caffe deep learning library
was used for all the experiments presented here. In our experiments, publicly available pertained Caffe models for object detectors were used for initial weights and to enable transfer learning technique. Hence, to take advantage of all network architectures used in our experiments, transfer learning technique from ImageNet was used during fine-tuning of our models. A better performance and a faster convergence can be achieved using the transfer learning technique.
Implementation details: All experiments have been conducted on an Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz Linux cluster node with a GeForce GTX 1080 GPU installed. The python interface code was used to conduct all the experiments. The models are trained with a learning rate of 0.001 and batch size of 128. The RPN batch size was set to 256 for region-based proposal networks (RPN). Regions proposal networks were trained end-to-end using back-propagation and stochastic gradient descent (SGD). Non-maximum suppression (NMS) was employed to the proposals based on the class scores to reduce redundancies arising from RPN proposals. Performance of each network architecture at different iterations was also analyzed. In the training phase, the snapshot of trained models was saved at an interval of 10k iterations. Detections with overlap greater than the 50% Intersection over Union (IoU) threshold with the corresponding ground-truth bounding box are considered as true positive and all other detections as false positive and IoU calculated as:
where and denotes predicted bounding box and ground truth bounding box respectively. The Average Precision (AP) is computed for each class, while mean Average Precision (mAP) denotes the mean over all the computed APs.
Iv Results and Discussions
Dataset: Details about the fish datasets used for the experiments are described here. Underwater videos used in our experiments were provided by the authors as part of a collaborative research program based at University of Sunshine Coast [19, 20]. The videos contain fish communities in marine waters of beaches and estuaries across southeast Queensland, obtained using baited and unbaited GoPro cameras. 4909 images containing 12365 annotated samples of 50 species of fish and crustaceans were used in our experiments. The Vatic interactive video annotation tool  was employed to annotate the data and was standardized in PASCAL VOC  format. The dataset was divided into training, validation, and test sets using a random sampling technique. The training, validation and test set comprises of 70%, 10%, and 20% data respectively.
Detection results: The detection results of several fish species from two sets of experiments are detailed in Table I and Table II with the mean Average Precision (mAP) results. Table I shows the results obtained from three different experiments using three network architectures considered in our experiments. The best result obtained among all iterations are presented here and the VGG-16 network outperformed. Mean AP of 0.72 and 0.71 and 0.71 were obtained after 70k iterations for VGG-16, CNN-M, and ZF respectively when the whole dataset was considered. However, accuracy was improved in experiment ii@ when species only have adequate training samples are considered. Table II shows that maximum mAP of 82.4% was achieved on the VGG16 network. An average time taken for processing an image for detection during testing process was 0.2 seconds (i.e. 5 fps) for VGG-16 and 0.1 seconds (i.e. 10 fps) for ZF and CNN-M network models which imply that the system is capable of processing video in real-time.
|Species||AP on VGG-16||AP on CNN-M||AP on ZF|
|Reticulated surf crab||1.00||1.00||1.00|
|White spotted eagle ray||0.82||0.64||1.00|
|White spotted guitarfish||0.82||0.86||0.89|
|Species||AP on VGG-16||AP on CNN-M||AP on ZF|
|Smooth golden toadfish||1.000||1.000||1.000|
Fig. 3 shows how the mAP improves over iterations (x-axis represents iterations in thousands) during the testing process on three different network architectures and the highest mAP of 0.72 was obtained for VGG-16 at 70k iterations. The class-wise AP analysis has also presented for some sample species in Fig. 4. Fig. 5 shows how the accuracies were improved over iterations in experiment ii@. An mAP of 82.4% was achieved on the test dataset after 90K iterations. The qualitative detection results of several sample frames are shown in Fig. 6. The detected region along with the species name is shown in all the detected frames. Some previous works on fish identification in the literature are significant as a fish classifier. However, our proposed system is more advanced as it detects the region of interest and classifies all the species in a single pipeline. As the existing works on fish identification were not conducted on any standard dataset and there is no public dataset available, a proper comparative study cannot be performed. However, Spampinato et al.  reported an accuracy of 54% on a dataset having only 10 species. An error analysis was performed on frames with incorrect detections. It was found that high levels of occlusion among a school of fishes was the main cause of the error. Fig 7 shows some sample frames with incorrect detection. Two frames with ground-truthing and the same frames after detection are given side by side to aid understanding. Fig. 7(b) shows one false negative case as the ground truth sample is occluded. Fig. 7(d) shows two false positive cases as the pattern of ground truth fish data is identical with some background surf in this particular case.
Automatic assessment of fish/species abundance using remote underwater video stream has tremendous potential over traditional approaches in terms of time and cost-effectiveness. The objective of the work was to develop a system for automatic detection and recognition of species from underwater videos. The significance of such a system has been studied and an appropriate work towards automation was not found in the literature on the assessment of fish abundance. An end to end deep learning approach was adapted to process a video stream and extract all the information required for the assessment. A range of experiments was conducted using different deep learning models and a comprehensive analysis of performance is presented. An mAP of 82.4% was achieved across a very wide variety of marine species. The main contributions of our work are, therefore:
Proposed a high-performance fish identification system by fine-tuning the ‘Faster R-CNN’ which has been adapted to our problem
Presentation of a wide range of experiments for underwater fish detection and identification using three different (small, medium and large sizes) state-of-the-art classification network models
Introduction of a newly developed fish abundance dataset which contains 50 different species from multiple beaches and estuarine sites across southeast Queensland, Australia. The number of species considered in these experiments is significantly higher than previously proposed approaches. The dataset is annotated and standardised in PASCAL VOC  format using the Vatic video annotation tool .
In future, we aim to further improve the performance by enhancing the CNN architecture and training the system with more samples in the training dataset.
This research was partly funded by the National Environment Science Program (NESP) Tropical Water Quality Hub Project No 3.2.3. Videos are made available through a collaboration between researchers at University of Sunshine Coast and Griffith University.
-  A. Salman, A. Jalal, F. Shafait, A. Mian, M. Shortis, J. Seager, and E. Harvey, “Fish species classification in unconstrained underwater environments based on deep learning,” Limnology and Oceanography : Methods, vol. 14, no. 9, pp. 570–585, 2016.
-  K. L. Wilson, M. S. Allen, R. N. M. Ahrens, and M. D. Netherland, “Use of underwater video to assess freshwater fish populations in dense submersed aquatic vegetation,” Marine and Freshwater Research, vol. 66, pp. 10–22, 2015.
-  B. Gilby, A. Olds, N. Yabsley, R. M. Connolly, P. S. Maxwell, and T. Schlacher, “Enhancing the performance of marine reserves in estuaries: Just add water,” vol. 210, pp. 1–7, 06 2017.
-  C. Spampinato, S. Palazzo, P. H. Joalland, S. Paris, H. Glotin, K. Blanc, D. Lingrand, and F. Precioso, “Fine-grained object recognition in underwater visual data,” Multimedia Tools and Applications, vol. 75, no. 3, pp. 1701–1720, 2016.
-  A. Rova, G. Mori, and L. M. Dill, “One fish, two fish, butterfish, trumpeter: Recognizing fish in underwater video,” in MVA, pp. 404–407.
-  M. Gundam, D. Charalampidis, G. Ioup, J. Ioup, and C. Thompson, “Automatic fish classification in underwater video,” in Proc. Gulf and Caribbean Fisheries Institute, vol. 66, 2015, p. 276–282.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Machine Vision Conference, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, 2014.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
-  J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013.
-  L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
-  R. B. Girshick, “Fast r-cnn,” in ICCV, 2015.
-  S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
-  M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” CoRR, vol. abs/1311.2901, 2013.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. L. R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  H. Borland, T. Schlacher, B. Gilby, R. M. .Connolly, N. Yabsley, and A. Olds, “Habitat type and beach exposure shape fish assemblages in the surf zones of ocean beaches,” vol. 570, pp. 203–211, 04 2017.
-  “Umbrellas can work under water: Using threatened species as indicator and management surrogates can improve coastal conservation,” Estuarine, Coastal and Shelf Science, vol. 199, pp. 132–140, 2017.
-  C. Vondrick, D. Patterson, and D. Ramanan.
-  M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” IJCV, vol. 111, no. 1, pp. 98–136, 2015.