Temperate Fish Detection and Classification: a Deep Learning based Approach

05/14/2020
by   Kristian Muri Knausgård, et al.
IEEE
22

A wide range of applications in marine ecology extensively uses underwater cameras. Still, to efficiently process the vast amount of data generated, we need to develop tools that can automatically detect and recognize species captured on film. Classifying fish species from videos and images in natural environments can be challenging because of noise and variation in illumination and the surrounding habitat. In this paper, we propose a two-step deep learning approach for the detection and classification of temperate fishes without pre-filtering. The first step is to detect each single fish in an image, independent of species and sex. For this purpose, we employ the You Only Look Once (YOLO) object detection technique. In the second step, we adopt a Convolutional Neural Network (CNN) with the Squeeze-and-Excitation (SE) architecture for classifying each fish in the image without pre-filtering. We apply transfer learning to overcome the limited training samples of temperate fishes and to improve the accuracy of the classification. This is done by training the object detection model with ImageNet and the fish classifier via a public dataset (Fish4Knowledge), whereupon both the object detection and classifier are updated with temperate fishes of interest. The weights obtained from pre-training are applied to post-training as a priori. Our solution achieves the state-of-the-art accuracy of 99.27% on the pre-training. The percentage values for accuracy on the post-training are good; 83.68% and 87.74% with and without image augmentation, respectively, indicating that the solution is viable with a more extensive dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 9

04/04/2019

Biometric Fish Classification of Temperate Species Using Convolutional Neural Network with Squeeze-and-Excitation

Our understanding and ability to effectively monitor and manage coastal ...
07/17/2020

2nd Place Solution to ECCV 2020 VIPriors Object Detection Challenge

In this report, we descibe our approach to the ECCV 2020 VIPriors Object...
04/25/2020

Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection

In this paper, we propose a general and efficient pre-training paradigm,...
01/11/2022

Classification of Beer Bottles using Object Detection and Transfer Learning

Classification problems are common in Computer Vision. Despite this, the...
06/11/2020

Rethinking Pre-training and Self-training

Pre-training is a dominant paradigm in computer vision. For example, sup...
05/26/2019

Underwater Fish Detection with Weak Multi-Domain Supervision

Given a sufficiently large training dataset, it is relatively easy to tr...
05/06/2020

Automatic Detection and Recognition of Individuals in Patterned Species

Visual animal biometrics is rapidly gaining popularity as it enables a n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coastal marine ecosystems provide habitats for spawning, nursing, and feeding for a diverse fish community. Due to the highly complex and dynamic nature of this environment, it is challenging to monitor and study ecological processes  Perry et al. (2018); Weinstein (2017). High resolution underwater camera technologies have recently made it possible to obtain large volumes of observations from remote areas and allowed for better capture the species’ cryptic behavior and changes in the environment Pelletier et al. (2011). Although comprehensive image and video data can be collected, the processing is of image data in ecological context is mostly manual and therefore very labor-intensive Lopez-Vazquez et al. (2020). As a result, only a portion of the available recordings can be analyzed which is greatly limiting the potential advances that can be made from these data streams. Furthermore, the accuracy of human-based visual assessments are highly dependent on conditions of the underwater environment and taxonomic expertise in interpreting the data Francour et al. (1999). Therefore, an objective analytical tool capable of processing image data fast and efficient is most welcomed by scientists and resource management.

Figure 1: Architecture.

To release the burden of manual processing, and to improve the classification accuracy, computer vision-based approaches have increasingly been employed in marine ecology analysis

Li et al. (2015); Qin et al. (2016); Jin and Liang (2017). For instance, a commercial product, CatchMeter White et al. (2006)

, composed by a lightbox with a camera, offers classification of fish and length estimates. Here, fish are classified by evaluating a threshold based on a contour detection in the images with a very high classification accuracy of 98.8%. However, the fish are photographed in a pre-determined and controlled environment, which hinders applying the approach in the wild. The CatchMeter version described in

White et al. (2006)

does not make use of any AI or machine learning techniques. In natural underwater environments, any classification task is challenged by diversity in background complexity, turbidity and light propagation as the water deepens.

A specific Convolutional Neural Network (CNN) called Fast R-CNN has been applied for object detection to extract the fish from images taken in natural environment and actively ignoring background noise Li et al. (2015). In this approach, an AlexNet Krizhevsky et al. (2012b) is pre-trained on the ImageNet Deng et al. (2009) database and modified to train on a subset of the Fish4Knowledge dataset Huang et al. (2013). In the final step, the Fast R-CNN takes the pre-trained weights and the region proposals made by AlexNet as inputs, and achieves a mean average precision of 81.4%. In another approach Jin and Liang (2017), pre-training is applied to a CNN similar to AlexNet, which has three fully-connected layers and five convolutional layers. Pre-training is carried out using 1000 images from 1000 categories in the ImageNet dataset and the learned weights are utilized by a CNN after adapting it to the Fish4Knowledge dataset. Post-training is then performed with 50 images per category and 10 categories from the Fish4Knowledge. The images from Fish4Knowledge are pre-processed using image de-noising and accuracy achieved on 1420 test images is 85.08%.

The highest reported accuracy for Fish4Knowledge in the literature so far is 98.64%, which was achieved by firstly utilizing filters to the original images to extract the shape of the fish and remove the background, and then employing a CNN with a Support Vector Machine (SVM) for classification 

Qin et al. (2016). That approach is named DeepFish, which has three standard convolution layers and three fully-connected layers. One common feature of previous solutions is that they usually adopt a pre-processing procedure for the images in order to remove the noise in the targeted image as much as possible, and particularly to outline the contour of the fish Jin and Liang (2017); Qin et al. (2016). Although this method can improve the system performance, the procedure of the pre-process must be carefully tuned, as it may remove useful information and result in a negative performance impact. Understandably, different species may have distinct nature of living environment, reflected in the background. Intentionally removing the background of the species in the pre-processing may therefore eliminate useful information. To make use of information from the background as much as possible and at the same time to keep the results not influenced by background noise, we need to employ a robust approach that can tolerate noise and accommodate diversity in classification.

In previous work on fish detection, Liu et al. (2018) have presented an online fish tracking system using YOLO and parallel correlation filters, and included detection and categorization in an end-to-end approach Liu et al. (2018). Similar work is carried out by Xu et al. (2018) who trained a YOLO architecture aimed at detecting a variety of fish species with three very different datasets, obtaining a mean average precision score of 0.5392 Xu and Matzner (2018). Pedersen et al. (2019) extended their work to include marine mammals as well as fish and used the same YOLO techniques Pedersen et al. (2019). Common for all of these approaches is that they trained their network end-to-end.

In this paper, we propose a different method, namely a separate deep learning-based approach for temperate fish detection and classification. In more detail, we have used images, and videos taken by underwater cameras in natural environments, employed YOLOv3 Redmon and Farhadi (2018) for fish detection, and explored CNN using the most recent SE architecture for classification. Because it is common to have multiple species in the same frame, the YOLO algorithm was used for fish detection, and once detected, the algorithm classified the fish to its particular species. Because the Fish4Knowledge dataset is limited to tropical fish species, for the training samples in the classification phase, we collected a new dataset of temperate fish species for this study. Our approach for classification was to train the network on the Fish4Knowledge dataset in order to learn generic features of fish, a step called pre-training. The learned weights were then used as a starting point for further training on the newly collected dataset containing images of temperate fish species, called post-training. This two-step training process is known as transfer learning Yosinski et al. (2014). Note that the proposed approach requires no pre-processing of images, except re-sizing to the appropriate input size for the network. To the best of our knowledge, the adopted techniques have not been applied to temperate fish detection and classification in previous work.

The remainder of the paper is organized as follows: Section 2 describes the datasets adopted for the training process. Section 3 presents a detailed network structure and configurations. In Section 4, the experimental results for the deep learning approach is illustrated and discussed, before the work is concluded in the last section. An abridged version of this article is published in Olsvik et al. (2019).

2 Datasets and Deep Learning Approaches

Fig. 1 presents the overall architecture of our approach. First, a video stream is sent into an object detection component, which is a YOLOv3 CNN. YOLOv3 is pre-trained on ImageNet and fine-tuned for detecting temperate fish species using a custom dataset. This component detects the presence of fish in a single video frame, and moves the rectangular subframes with fish to a classification component built on a CNN-SENet structure. The latter categorizes the fish species, and the overall architecture is thus able to count the number of fish belonging to each species in each frame. The components are trained individually – the fish detection training is completely independent of the fish species classification training. This separation has two main advantages. First, the training data for categorization and object detection is allowed to be separate. It is tedious to outline every single fish in a video stream. Since object detection of fish requires less data than classification of fish species, the biologists can spend their time mostly on specialist work like categorization, rather than outlining objects. Second, detecting the presence of fish is a more straightforward problem than categorizing species, which means that we can prioritize resources accordingly.

Figure 2: Examples from the temperate species dataset used for object detection.

2.1 Object Detection

The object detection component is responsible for detecting the presence of fish in a video stream. The video stream can also be a live, something that limits the applicability of top level accuracy segmentation algorithms. Consequently, YOLOv3 Redmon and Farhadi (2018) was selected as detection algorithm. This CNN architecture provides a reasonable speed/accuracy tradeoff, and is suitable for real time implementation. The object detection takes the (live) video stream as input and outputs objects of fish without any categorization.

YOLOv3 was initialized with weights trained on ImageNet, and then further specialized by training on a new dataset. Fig. 2 shows examples from this temperate fish species detection training dataset with images containing a total of carefully annotated fish. We deliberately designed the set up realistically for the shallow-water fish assemblage found on along the coast in Southern Norway, including the fish species most frequently observed in this ecosystem. We collected video data at several different locations, spanning depths from 1-40 meters. We used images captured at different seasons, time-at-day (including some images captured at night) and during various weather conditions. This ensured that the dataset reflects the natural variability in visibility and light conditions. The variability is to ensure a realistic dataset as possible to ensure high precision when applied in real-life settings.

Further, note that although the detection training dataset is annotated with species, this information is not used in this stage. The object detection solely detects the presence of fish, and the categorization happens in the independent next step. The species information is used as additional data in the subsequent step. Only a fraction of Cod images are used for both detection (YOLO) and classification (CNN-SENet) training, so the datasets could be considered to be nearly non-overlapping. However, including all the temperate species classification training data in annotated form for detection should not be considered difficult, only laborious.

2.2 Classification

In the classification-part, two datasets were used in the test. The Fish4Knowledge dataset Li et al. (2015) and a novel dataset with temperate species from Southern Norway, combining images from multiple surveys and field studies. Fish4Knowledge is used in pre-training of the neural network, while the temperate dataset is used in the post-training. Some differences between the datasets are: (1) The Fish4Knowledge has in addition to the fish images categorized images in trajectories, e.g. a sequence of images taken from the same video sequence or stream. (2) The temperate dataset has in addition to the other species a separate folder for male and female Symphodus melops. Some individuals of male S. melops have also been tracked and captured by camera multiple times.

2.2.1 Fish4Knowledge

The Fish4Knowledge dataset is a collection of images, extracted from underwater videos of fish, off the coast of Taiwan. There is a total of 27230 images cataloged into 23 different species. The top 15 species accounts for 97% of the images, and the single top species accounts for around 44% of the images. The number of images for each species range from 25 to 12112 between the species. This creates a very imbalanced dataset. Further, the images size ranges from approximately pixels to approximately pixels. Another observation in the dataset, is that most of the images are taken from a viewpoint along the anteroposterior axis, or slightly tilted from that axis. In that subset of images, most of these images are from the left or right lateral side, exposing the whole dorsoventral body plan in the image. There are some images from the anterior view, but few from the posterior end. Among all the images there were not many images from the true dorsal viewpoint. Most of the selected species have a compressed body plan, e.g. dorsoventral elongate. This creates a very distinct shape when the images are taken from a lateral viewpoint. Hence, images taken from the dorsal view creates a thin, short shape. The images also have a background that is relatively light, enhancing the silhouette of the fish.

Figure 3: A functional view of the YOLOv3 architecture.
Figure 4: Darknet-53 architecture with input size (based on Redmon and Farhadi (2018)).

2.2.2 Temperate Fish Species

Figure 5: Example images and distribution of the temperate species dataset used for classification.

The temperate dataset is a collection of images from some of the most abundant fish species in coastal areas of Northern Europe. Video recordings from GoPro cameras (HERO4-7+Black) were obtained at three different locations from south to western Norway between 2014 and 2019. In western Norway, Austevoll, the cameras were deployed at 2-5 meters of depth around small reef sites used as breeding sites for many wrasse fishes. The species identified from these videos were Ctenolabrus rupestris, Centrolabrus exoletus and S. melops. In S. melops, most males build nests to care for eggs and are colourful and easily distinguished from the brown coloured females Halvorsen et al. (2016). However, a minority of the males are visually indistinguishable from females and use this camouflage to sneak on other males’ nest to steal fertilization Halvorsen et al. (2017). Because of the morphological appearances of the different sexes, nest-building males are labelled as “males” in the dataset, whereas females and sneaker males are labelled as “females”. Two of the wrasse species (Ctenolabrus rupestris and S. melops) have high commercial importance as they are used as cleaner fish in the aquaculture industry. In the south-eastern Norway, county of Agder, and mid-western Norway, county of Trøndelag, stereo baited remote underwater video (stereo-BRUV) rigs were deployed at 8-35 meters of depth at various shallow coastal habitats. From these videos, we extracted frames showing species from the family Gadidae : Gadus morhua, Pollachius virens, Pollachius pollachius, Molva molva, and Melanogrammus aeglefinus, all with commercial importance. Additionally, some images shows Squalus acanthias, a shark classified as vulnerable globally and critically endangered in the Northeast Atlantic by the IUCN red list of threatened species Fordham et al. (2016).

The temperate dataset has a higher image noise level and more variability compared with the Fish4Knowledge dataset, such as differences in depth, visibility and habitat, and orientation of the fish and distance between camera and fish. This secured a high variability in pictures of each species and a natural representative for observations in wild, but it is also expected to reduce the classification accuracy. Furthermore, a single video frame usually contained more than one fish (e.g., the same species, different species). All videos were recorded in full HD resolution of pixels with default settings. Fig. 5 illustrates samples of the dataset.

3 Object Detection and Classification

In contrast to the available literature, we have separated object detection from classification. This separation allows for both separate training data for fish detection and species classification, and different level of validity in the training data. It also allows for a much more fine-grained classification of species independent from detecting the fish.

3.1 Fish Detection

Fish are detected independent from species recognition through object detection using YOLOv3. YOLO is a state-of-the-art object detector, originally designed for combined detection and classification. Only the detection part is used in this work. YOLO is efficient, and provides relatively high accuracy at the same time as being moderately computationally expensive Redmon et al. (2015); Redmon and Farhadi (2018). Combined with the speed and accuracy of CNN-SENet for species classification, this should enable real time applications even on embedded devices such as NVIDIA Jetson AGX Xavier and Intel Movidius Myriad variants.

YOLOv3 is configured to detect and classify only one class (), namely “fish”, and use an input image of dimension with three color channels in RGB order. Default initial values for the nine object detecion bounding box priors were used (widthheight): , , , , , , , and . These values are recommended for the COCO dataset. By inspection, the fish dataset will contain approximately the same kind of variations in object sizes and orientations, with both horizontally and vertically oriented objects. If we intended to use this algorithm in a structured environment, where for example, all the fish were expected to swim through an apparatus, it would have been interesting to explore a prior distribution favoring slender horizontally oriented rectangular boxes. Note that sizes are given in pixels, relative to the scaled version of any given image.

When training the network, a batch size configuration of and subdivisions was configured. The number of subdivisions required was found experimentally and is dependent on the available training hardware (GPU RAM). Four NVIDIA V100 GPUs in a DGX-2 computer were used. Convolutional weights were initialized with weights pre-trained on ImageNet Russakovsky et al. (2014) data. Next, the training process was started using a single GPU for iterations as “burn-in”. As a consequence of the number of GPUs available, and the relatively small dataset, the default Darknet YOLOv3 learning rate was reduced by a factor of to during this training phase. The effect of different learning rate is visible in Fig. 9 as increased variability from batch 4000. After “burn-in” the training was stopped and then restarted from saved weights using four GPUs. Training was configured to run iterations in total. This is equivalent to approximately epochs given a batch size of 64 and 434 training images. The step yielding the best mean average precision (mAP@50) is selected for detection use. Both the original “Darknet” framework from the YOLOv3 authors and an extended, forked, version was used for running the experiments111https://github.com/AlexeyAB/darknet.

3.2 Species Classification

The species of the fish is identified by classification using a Convolutional Neural Network with an added squeeze and excitation (SE) – using the CNN-SENet structure. A CNN-SENet is an architectural element that re-calibrates channel wise-feature responses adaptively Hu et al. (2017). The architecture of the CNN-SENet, depicted in Fig. 6, is configured with the following parameters. Image size in height (), width () and depth channels; the number of learnable filters (); the batch size () (default 16), the filter size (), and reduction ratio () as described in Hu et al. (2017). Lastly, the number of fish species classifications needs to be added, as parameter . The input layer takes an image of size

with a depth of 3 color channels, R, G, and B. The output is batch normalized before entering the Squeeze-and-Excitation function, called SE block, depicted in Fig. 

7

. The SE block performs a feature re-calibration through the (1) squeeze operation preventing the network from becoming channel-dependent. This exploits contextual information outside the receptive field and is achieved by doing global average pooling on each input channel before reshaping, and (2) the excitation operation that utilizes the output from the squeeze function by fully capture channel-wise dependencies. This is achieved by the two fully-connected (FC) layers sandwiching the reduction layer, and finally, a sigmoid activation layer. Before exiting the SE block, the output from the excitation function is multiplied with the original batch normalized output. This multiplied output is then added to a ReLU layer performing an element-wise activation function, rendering the dimension size unchanged. The output is then sent to a Max Pooling layer, which uses a

filter to reduce and re-size the height and width spatially, rendering output of . This core portion of the network is stacked to the size of the kernel size, in this case, the size of five. The first iteration has a convolutional layer of 32 filters in . The second and third have 64 filters in , the forth 128 filter in , and the fifth 256 filters in

, with all layers applying a horizontal and vertical stride of 1.

Furthermore, the network has 3 FC layers. The first, with 256 neurons, takes the output from the last convolutional layer that is first flattened. The output is then batch normalized before sent to the second FC layer, with 256 neurons. A reduction function is applied after the output from the FC layer is batch normalized. Before entering the last FC layer, with

neurons, a dropout layer of 50% is applied. The final layer, softmax, applies a classifier function to obtain the probability distribution for each class per input image, using a categorical cross-entropy with the Adam optimizer

Kingma and Ba (2014).

Figure 6: CNN-SENet architecture.
Figure 7: Squeeze-and-Excitation block.

In CNN-SENet, there are specific parameters that need to be configured, including dropout percentage, learning rate, and batch normalization, that are discussed presently. The parameters are configured based on the trial-and-error method. For the dropout percentage, clearly, the higher the dropout, the more the information is lost during training because forward- and back-propagation are carried out only on the remaining neurons after dropout is applied. Different percentages of the dropout are tested, and 50% is configured in this study due to the better overall performance achieved. The learning rates when using the Adam optimizer should be tuned to further optimize the network. After numerous trials, the learning rate is configured as 0.001 without decay. For batch normalization, it has been tested, and the results with batch normalization are slightly better than without it. In more detail the accuracy of the testing set without batch normalization is 98.35%, while the accuracy with batch normalization is 99.27%. With the above parameters, the model trains faster and has a higher validation accuracy, which concludes the architecture of CNN-SENet.

To compare CNN-SENet with DeepFish, Table 1 illustrates the main differences between the two. Clearly, CNN-SENet has a more sophisticated structure than DeepFish.

CNN-SENet DeepFish
Image Size
Testing Samples 4126 3098
Network
Architecture
Basic with SE blocks Basic
Classifier Softmax SVM
Convolutional
Layers
5 3
Table 1: Differences between CNN-SENet and DeepFish.

4 Experiments, Results and Discussion

The proposed approach was verified in a two-step approach using separate experiments for fish detection and classification. First performance of fish detection was assessed, then the performance of fish classification.

4.1 Fish Detection

Localization of individual fish in each video stream image occurs with the YOLOv3 based object detector described in Section 3.1

. Detection accuracy is measured using Intersection over Union (IoU) – Jaccard index. This is a measure of overlap between two sets, and a widely used measure for verification of object detection and segmentation algorithms. The approach reaches an average IoU of 0.6802, and an IoU per class 0.9934. The latter number means that a tiny percentage of false objects consisting of mere background was erroneously detected as fish.

The dataset for this experiment was randomly split in a 70% for training and 30% for verification. Fig. 9 shows IoU per epoch for the latter. Fig. 10 and 11 show the training loss and mean average precision, respectively. The precision peaks at 86.96%.

(a) Day time correct detection
(b) Dark late evening correct detection
(c) Day time partially correct detection
(d) Fail case
Figure 8: Three Sample frames of correct fish detection, and one erroneous case, extracted from underwater video-stream.

The validity of our approach is further confirmed in a different setting than the training data. This verification is part of a live stream from an underwater camera located near a semi-submerged restaurant in southern Norway, and which provide highly variable lighting conditions, and different camera angles not part of our training data222A recording of real-time detection is available at https://www.youtube.com/watch?v=bZMJEIWo-rQ&t=4298s. Despite the radically different scenarios, the proposed method is still able to detect fish correctly with very high accuracy. Fig. 8 shows samples from the live stream recording. Three of the examples show fish which are correctly detected, and one failed case. The first case in Fig. 8 shows the standard case during day time, the second shows fish detected during dark evenings with artificial light, and the third case shows most of the fish detected while the fish in the corner are wrongly ignored. In the last occurrence, seaweed is detected as fish.

Figure 9: Training Intersection over Union (IoU) with moving average.
Figure 10: Total training loss after moving average filter.
Figure 11: Mean Average Precision (mAP) with peak value 86.97 % at batch 10273.

4.2 Species Classification

Classification of species is done by categorizing fish identified in the object detection. Accuracy and performance of the new fish classification CNN-SENet are quantified and compared with the state-of-the-art networks represented by Inception-V3, ResNet-50, and Inception-ResNet-V2. Additionally, a simplified version of the CNN-SENet, without the Squeeze-and-Excitation blocks, is included to explore how the spatial relationship between fish image colors and other feature layers affect results Hu et al. (2017).

Three different experiments were performed. Pre-training with Fish4Knowledge, post-training with the new temperate Fish Species dataset described in Subsection 2.2.2 and post-training with an extended version of the new dataset using image augmentation techniques. For all three experiments, the relevant dataset was divided into 70% training images, 15% validation images, and 15% testing images. Both training and validation images are integral parts of the training process, while the testing images were kept out-of-the-loop for independent verification of the “end product”.

All benchmarked networks are trained for 50 epochs with images adapted to their input image size of RGB pixels, with the notable exception of the RGB pixels required by Inception-ResNet-V2.

4.2.1 Pre-training

Figure 12: Confusion matrix for Fish4Knowledge dataset pre-training with CNN-SENet.

Pre-training was performed using a dataset consisting of 19149 Fish4Knowledge images, with an additional 4126 images for verification and 4126 images reserved for testing. The selected training configuration consists of a single run with 50 training epochs and a batch size of 16. Results from pre-training are evaluated using weights from the epoch with the highest validation accuracy, and not necessarily the final epoch.

Network Testing Accuracy Time One Epoch
Inception-V3 99.18% 923 s
ResNet-50 98.86% 646 s
Inception-ResNet-V2 98.59% 2221 s
CNN-SENet 99.27% 197 s
CNN-SENet without
Squeeze-and-Excitation
99.15% 159 s
Table 2: Testing accuracy and time per epoch on pre-training.

4.2.2 Post-training

Figure 13: Confusion matrix for temperate dataset post-training with CNN-SENet.

Post-training was performed using 712 images of four fish classes from the temperate fish species dataset described in Section 2.2.2. An additional 155 images were used for verification during training, and a subset of 155 images of the same classes were reserved for testing. Corkwing wrasse (male), Corkwing wrasse (female), Pollach, and Coalfish were selected for the experiment as a reasonable number of images of different individuals under varying conditions were available for these species.

The post-training process consists of 50 epochs and a batch size of 8. The batch size was reduced, compared to pre-training, to compensate for the relatively small number of available temperate fish images. Weights from the pre-training step are loaded before initiating post-training, and post-training accuracy is evaluated using the weights from the final epoch.

The rationale for this post-training method is to make use of the more or less generic fish identification features learned from the large Fish4Knowledge dataset. Post-training will then start with the network in a “fish-class-sensitive” state and proceed by learning specific features of the temperate species on top of this.

Fish4Knowledge consists of images of 23 different classes. The selected subset of the temperate dataset consists of 4 classes. To prepare the loaded pre-trained model for post-training, the last fully connected (FC) layer with 23 output neurons, suitable for 23 fish classes, is replaced with a similar layer with four output neurons.

Network Testing Accuracy Time One Epoch
Inception-V3 85.42% 33 s
ResNet-50 82.39% 47 s
Inception-ResNet-V2 78.84% 91 s
CNN-SENet 83.68% 9 s
CNN-SENet without
Squeeze-and-Excitation
82.32% 7 s
Table 3: Average testing accuracy over 10 runs and time per epoch on post-training.

4.2.3 Post-training with Image Augmentation

Data augmentation techniques in machine learning aims at reducing overfitting problems by expanding a dataset (base set) by introducing label-preserving transformations. For an image dataset, this means that transformed copies of the original images in the base set are produced. These additional training data enable a network under training to learn more generic features by reducing sensitivity to augmentation operations that transform the image but not severely the characterizing visual features of, for example a fish Krizhevsky et al. (2012a).

The main algorithm flow is the same as for the post-training version, but the dataset was expanded by using the following transformation operations. Images are rotated randomly within a specific range, according to a uniform distribution. Images are vertically and horizontally shifted a random fraction of the image size. Scaling and shearing transformations are applied randomly, and lastly, half of the images are flipped horizontally.

4.3 Results

4.3.1 Pre-training

Results from pre-training on Fish4Knowledge are presented in Table 2. The testing accuracy is on par with or exceeds the level of accuracy achieved with previous state-of-art solutions described in Section 1.

CNN-SENet with Squeeze-and-Excitation achieves 99.15% test accuracy, almost identical results as the Inception-V3 algorithm when it comes to accuracy. However, the run time for each epoch is roughly three times larger for Inception-V3. The training-runtime is expected to be reflected in prediction. CNN-SENet without Squeeze-and-Excitation is faster than the SE-version, but also slightly less accurate during these tests.

Inception-ResNet-V2 achieves the lowest test accuracy and also the highest time consumed for each epoch during training. The required input image size is , compared to for the other networks under test. As the required resolution is higher than the resolution of most Fish4Knowledge images, the necessary upscaling process may negatively affect accuracy. Additionally, the larger input size also dramatically increases the computational complexity and leads to a longer time on each epoch.

A confusion matrix for the CNN-SENet pre-training run is included, as shown in Fig. 12. Fish 01 seems to attract more wrong predictions than the other species. The reason for this is unknown, but the imbalance in the dataset could explain some of the behavior, as the ability to learn Fish 01 will be more rewarding during training as it occurs more frequently.

4.3.2 Post-training with and without image augmentation

Results from the post-training experiment indicates that this is a more challenging image recognition task. Without image augmentation, the highest average testing accuracy achieved was 85.42% using the Inception-V3 CNN algorithm as, listed in Table 3. CNN-SENet performance is a few percent below, but with a significantly better training time for each epoch. All bench-marked algorithms show significantly reduced accuracy compared to the results from pre-training. The temperate species dataset used for post-training is challenging, in the sense that it contains few images overall. The dataset also consists of pictures of fish under low visibility conditions and situations where the fish silhouette is not always prominent.

Image augmentation, as described in Section 4.2.3, improves the results for post-training for all benchmarked algorithms, as shown in Table 4. The ResNet-50 network reaches just above 90% testing accuracy. CNN-SENet accuracy increases approximately four percentage points compared to post-training without image augmentation. The training time for each epoch does not change notably using image augmentation, so the metric was omitted from Table 4.

Network Testing Accuracy
Inception-V3 88.45%
ResNet-50 90.20%
Inception-ResNet-V2 82.39%
CNN-SENet 87.74%
CNN-SENet without
Squeeze-and-Excitation
83.55%
Table 4: Average testing accuracy over 10 runs on post-training with image augmentation.

5 Conclusions

In this study, we implemented an in-depth deep learning-based approach for temperate fish detection and classification. YOLOv3 has been used for detection purposes, and CNN-SENet has been adopted for classification. The experimental results show that the YOLOv3 technique can successfully detect an individual fish in different complex environmental conditions. The object detection approaches a mean average precision of 86.96%, and the CNN-SENet architecture achieves the state-of-the-art accuracy of 99.27% on the Fish4Knowledge dataset without any data augmentation or image pre-processing. For temperate fish, the obtained average accuracy is 83.68%. The lower accuracy can be explained by the comparatively smaller temperate species dataset combined with high variation in image data. The detection algorithm was also tested successfully in real-time on a live 25 FPS Full HD underwater video stream. In short, we show that our proposed deep learning approach is a powerful and useful tool for the automatic analysis of fish species. It has a high potential to release the burden on scientists working with the study of videos and pictures from underwater ecosystems.

References

  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    ,
    pp. 248–255. Cited by: §1.
  • S. Fordham, S.L. Fowler, R.P. Coelho, K. Goldman, and M. Francis (2016) Squalus acanthias. the iucn red list of threatened species 2016: e.t91209505a2898271.. External Links: Document Cited by: §2.2.2.
  • P. Francour, C. Liret, and E. Harvey (1999) Comparison of fish abundance estimates made by remote underwater video and visual census. Naturalista Siciliano 23, pp. 155–168. Cited by: §1.
  • K. T. Halvorsen, T. K. Sørdalen, C. Durif, H. Knutsen, E. M. Olsen, and et al. (2016) Male-biased sexual size dimorphism in the nest building corkwing wrasse (symphodus melops): implications for a size regulated fishery. ICES Journal of Marine Science 73 (10), pp. 2586–2594. External Links: /oup/backfile/content_public/journal/icesjms/73/10/10.1093_icesjms_fsw135/2/fsw135.pdf Cited by: §2.2.2.
  • K. T. Halvorsen, T. K. Sørdalen, L. A. Vøllestad, A. B. Skiftesvik, S. H. Espeland, E. M. Olsen, and H. editor: Jonathan Grabowski (2017) Sex- and size-selective harvesting of corkwing wrasse (symphodus melops)—a cleaner fish used in salmonid aquaculture. ICES Journal of Marine Science 74 (3), pp. 660–669. External Links: /oup/backfile/content_public/journal/icesjms/74/3/10.1093_icesjms_fsw221/1/fsw221.pdf Cited by: §2.2.2.
  • J. Hu, L. Shen, and G. Sun (2017) Squeeze-and-excitation networks. CoRR abs/1709.01507. External Links: 1709.01507 Cited by: §3.2, §4.2.
  • P. X. Huang, B. B. Boom, and R. B. Fisher (2013) Fish recognition ground-truth data. Note: [Online; accessed 30.01.2018] Cited by: §1.
  • L. Jin and H. Liang (2017) Deep learning for underwater image recognition in small sample size situations. In OCEANS 2017-Aberdeen, pp. 1–4. Cited by: §1, §1, §1.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §3.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012a) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1097–1105. Cited by: §4.2.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012b) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • X. Li, M. Shang, H. Qin, and L. Chen (2015) Fast accurate fish detection and recognition of underwater images with fast r-cnn. In OCEANS’15 MTS/IEEE Washington, pp. 1–5. Cited by: §1, §1, §2.2.
  • S. Liu, X. Li, M. Gao, Y. Cai, R. Nian, P. Li, T. Yan, and A. Lendasse (2018) Embedded online fish detection and tracking system via yolov3 and parallel correlation filter. In OCEANS 2018 MTS/IEEE Charleston, Vol. , pp. 1–6. External Links: Document, ISSN 0197-7385 Cited by: §1.
  • V. Lopez-Vazquez, J. Lopez-Guede, S. Marini, E. Fanelli, E. Johnsen, and J. Aguzzi (2020) Video image enhancement and machine learning pipeline for underwater animal detection and classification at cabled observatories. Sensors 20, pp. 726. External Links: Document Cited by: §1.
  • E. Olsvik, C. M. Trinh, K. M. Knausgård, A. Wiklund, T. K. Sørdalen, A. R. Kleiven, L. Jiao, and M. Goodwin (2019) Biometric fish classification of temperate species using convolutional neural network with squeeze-and-excitation. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 89–101. Cited by: §1.
  • M. Pedersen, J. B. Haurum, R. Gade, and T. B. Moeslund (2019) Detection of marine animals in a new underwater dataset with varying visibility. In CVPR Workshops, Cited by: §1.
  • D. Pelletier, K. Leleu, G. Mou-Tham, N. Guillemot, and P. Chabanet (2011) Comparison of visual census and high definition video transects for monitoring coral reef fish assemblages. Fisheries Research 107 (1), pp. 84 – 93. External Links: ISSN 0165-7836 Cited by: §1.
  • D. Perry, T. A. B. Staveley, and M. Gullström (2018) Habitat connectivity of fish in temperate shallow-water seascapes. Vol. 4, pp. 440. External Links: ISSN 2296-7745 Cited by: §1.
  • H. Qin, X. Li, J. Liang, Y. Peng, and C. Zhang (2016) DeepFish: accurate underwater live fish recognition with a deep architecture. Neurocomputing 187, pp. 49–58. Cited by: §1, §1.
  • J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Link, 1506.02640 Cited by: §3.1.
  • J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Link, 1804.02767 Cited by: §1, Figure 4, §2.1, §3.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §3.1.
  • B. G. Weinstein (2017) A computer vision for animal ecology. Journal of Animal Ecology 87 (3), pp. 533–545. External Links: https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/1365-2656.12780 Cited by: §1.
  • D. White, C. Svellingen, and N. Strachan (2006) Automated measurement of species and length of fish by computer vision. Fisheries Research 80 (2-3), pp. 203–210. Cited by: §1.
  • W. Xu and S. Matzner (2018) Underwater fish detection using deep learning for water power applications. In 2018 International Conference on Computational Science and Computational Intelligence (CSCI), Vol. , pp. 313–318. External Links: Document, ISSN null Cited by: §1.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. CoRR abs/1411.1792. External Links: 1411.1792 Cited by: §1.