The large scale of industrialization and urbanization is providing good living conditions for human beings. However, it has brought serious environmental pollution, including water, air and soil pollution Kulwa et al. (2019), which raises the risk of diseases such as lung cancer. To eliminate such pollution (pollutants), the use of environmental microbiological method offers higher efficiency, lower cost and harmless compared to the use of chemical methods. It involves the use of Environmental Microorganisms (EMs) for monitoring, controlling and decomposing pollutants. For example, Epistylis is employed as a sign of poor quality of water and Actinophrys is used for decomposition of organic wastes in sludges Kosov et al. (2018). Thus, identification of proper EMs and their corresponding physiological characteristics is necessary. Generally, there are four methods used for identification of EMs. First is the chemical method, which is accurate, but it creates secondary pollution of chemical reagents Li et al. (2019). Second is the physical method, which requires expensive equipment Li et al. (2019). The third is the molecular biological method, which distinguishes EMs by sequence analysis of genome Yamaguchi et al. (2015). This method needs expensive equipment, is time consuming and requires professional researchers. Fourth is the morphological method, which needs an experienced operator to observe EMs under a microscope and give identification by shape characteristics Maier and Gentry (2015), Kosov et al. (2018). This approach is laborious, time-consuming, inconsistent, and subject to the moods of the operator.
In order to eliminate such drawbacks, automatic image processing techniques are used for the identification of EMs. Image segmentation is a crucial stage in feature extractionMithra and Emmanuel (2019) and classification Song et al. (2016), so we develop a system for segmentation of EM images. The majority of EM samples are obtained from complex environments where large amount of impurities like rubbish is present, which leads to noisy image problems. Moreover, some essential EMs have transparent like body features such as Ceratium and Actinophrys. This renders less information of the foreground for segmentation tasks, which leads to under-segmentation and poor segmentation results. Furthermore, some EM images suffer from low contrast between the foreground and background, such as Codosiga and Vorticella, which leads to poor segmentation results. To jointly overcome all segmentation challenges above, we use Pairwise Deep Learning Features (PDLFs) concatenated on the convolutional network. The Pairwise Deep Learning Feature Network (PDLF-Net) work flow is shown in Figure 1.
The steps shown in Figure 1 from (a) to (f) respectively, are described below. (a) Weakly visible classes: In this study we use an in-house dataset which is is also publicly available in NEUZihan (2021) and published in Li et al. (2021). It contains 21 classes of EMs. Therefore, from 21 classes, the eight most weakly visible classes are selected.
(b) Data augmentation: To increase the dataset for training the proposed CNN, augmentation is performed on both original weakly visible dataset and their corresponding ground truth images. (c) Feature extraction: Firstly, Shi and Tomas interest points’ locations are identified on each image, then all images are meshed into patches (of size
pixels) which are centred at interest points. Then, deep learning features are extracted from each patch using the pre-trained VGG-16 (which is pre-trained on the ImageNeT dataset) and stored.
(d) Feature pairing: Using the Delaunay triangulation, triangles are identified from interest points, then the middle points of edges of each triangle are identified and used as reference points for pairing the feature maps ( the end of each edge corresponds to extracted features on interest points). (e) Joint pairwise feature maps formation: The paired features and original features (from interest points) are combined to form a joint pairwise feature for each image. The resultant joint feature map has an average dimension of pixels size for each image. (f) Concatenation and training: At this last stage, concatenation of the augmented images and their corresponding joint pairwise feature maps are performed at different input stages of the base model (SegNet), to produce the segmented output image.
The contributions of this paper can be folded into three as described below;
1. By extracting deep learning features from small image patches of size that are centered at the positions of corner interest points, we integrate the abilities of interest points/ descriptors (hand-crafted features) and deep learning features. The Shi and Tomas theorem is employed to determine the interest points. This allows the network to focus on fine information which is related to edges and corners, thus increasing the segmentation performance and overcome the problem of low contrast and transparency of weakly visible EM.
2. Speculating that the middle point between two nearby patches (interest points) have intermediate spatial features, we pair the feature maps of two nearby interest points, to highlight more features around the foreground which could not be learned by base SegNet model. The pairing is achieved using the Delaunay triangular theorem, which concentrates the triangles inside the foreground, thus increasing the focus of the network to learning more foreground which overcome the segmentation challenges in weakly visible EM.
3. We concatenate the joint pairwise feature maps to different input scales of the encoder blocks of the base model (SegNet), which generally increase the segmentation results of the network. The joint pairwise feature maps are formed by combining interest point based features and intermediate pairwise features for each image separately.
This paper is organized as follows: Section 2 gives a review of related works on microorganisms image segmentation methods (particularly in subsection 2.1) while the review on feature extraction and pairwise feature methods are given in subsection 2.2. Section 3 describes in detail our proposed methods and different key points of our contributions. Experimental results and analysis are discussed in section 4. Lastly, conclusion and future works are given in section 5.
2 Related Works
In this section, different works related to our work are reviewed. Section 2.1 gives a review on segmentation of microorganisms images. Due to the importance of feature extraction in our work, different related works on feature extraction and pairwise features are reviewed in section 2.2. Finally, the contributions of our work are given at the end of section 2.2.
2.1 Microorganims Image Segmentation
Different techniques are implemented to enhance good segmentation performances of microorganisms. These techniques can be categorized into classical and machine learning based techniquesKulwa et al. (2019). Table 1, gives a summary of the categories and subcategories of microorganisms image segmentation methods.
|Categories||Subcategories||Specific methods examples||Related works|
|Classical||Threshold||Otsu, adaptive and global||Khan et al. (2015)|
|Edge based||Canny, Sobel||Hiremath et al. (2011)|
|Region based||Maker watershed||Battenberg and Bischofs-Pfeifer (2006)|
|ML||Unsupervised||k-means, SOM||Raof et al. (2017), Rulaningtyas et al. (2017)|
|Supervised||U-net, SVM, VGG-16||Ito et al. (2018), Matuszewski and Sintorn (2018), Górriz et al. (2018)|
Classical methods are the traditional techniques which have found broad applications. For instance, in Khan et al. (2015) outstanding results are achieved by applying Otsu thresholding in the segmentation of floc and filament. In order to enhance shape feature extraction, an active contour method is used in Hiremath et al. (2011) for segmentation of Rotavirus-A. A seed watershed algorithm is applied in Battenberg and Bischofs-Pfeifer (2006) for segmentation of Bacillus subtilis bacteria in clustered biofilm. Generally, classical methods are associated with challenges such as, they can not work direct on colour images, they need pre-processing like denoising and colour conversion and they cannot perform well on images which have uneven background colours. To overcome above challenges machine learning based methods have been adopted for segmentation.
. Unsupervised machine learning (ML) techniques build their mathematical models from a set of data that contain only input without target output labels (segmentation can be referred to as pixel level classification, in that context the target labels are the individual pixel values/ranges in the ground truth mask images. Where, for the case of of unsupervised ML they are not required. An example of unsupervised ML algorithms is the-means clustering). These techniques unsupervisely discover the data pattern and cluster them into segments Dhanachandra and Chanu (2017). For instance, in order to automate the detection of pulmonary tuberculosis (TB) which is caused by Mycobacterium tuberculosis,
-means and self organizing map (SOM) clustering were proposed in the segmentation of the basilli from Ziehl-Neelsen sputum smearsRaof et al. (2017) and Rulaningtyas et al. (2017). While in Ghosh et al. (2011), a modified fuzzy divergence clustering method which is based on Cauchy membership function is leveraged in the segmentation of Plasmodium vivax from C channel CMYk color model of images containing the parasites in blood smears. Although unsupervised methods are simple to apply, their ability to learn the pattern of data is inadequate in transparent images, which is the case for the weakly visible EM.
In recent years the use of supervised methods has shown promising results in segmentation tasks. Supervised machine learning algorithms build mathematical models from a set of labeled data. Example of supervised techniques are convolution neural networks (CNN), support vector machine (SVM) and naive Bayes model. Due to the ability of CNN to capture pattern of data in challenging datasets, they have been used in many works. For instance,Ito et al. (2018) increases the receptive field by applying filter size on fully convolutional network (FCN), this results in an outstanding segmentation performance of 99.7% accuracy on feline calicivirus images. In Matuszewski and Sintorn (2018), Górriz et al. (2018)
, in order to tackle the challenge of imbalance between the foreground and background, a dice coefficient is applied as a loss function in U-net for segmentation of the rift valley virus andLeishmania parasites. To exploit fully the benefits of CNN, a large amount of training dataset is needed. One of the challenges we have in the weakly visible EM is the scarcity of datasets, However the innovation of strong models such as SegNet Badrinarayanan et al. (2017) and U-net Long et al. (2015) which are capable of working in small number of datasets, gives us a suitable option for our dataset. Moreover, SegNet shows more superiority for having few parameters and hence faster to train, because it passes pooling indeces to the upsampling layers and does not use the heavy deconvolution layers. U-net has been applied in many works for segmentation of EM. Nevertheless, to the best of our knowledge no any work has been done on segmentation of EMs using SegNet, except for one work which uses SegNet directly without any network changes from the original one on sementation of yeast cells Aydin et al. (2017). Thus, in this paper we attempt to leverage SegNet for segmentation of weakly visible microorganisms.
2.2 Feature Extraction and Pairing of Features
Feature extraction is an important stage in the image processing pipeline. In most cases features are used in image classification and object matching works such as Ochoa et al. (2007), Agrawal et al. (2008) and Zhu et al. (2017). Mainly there are two categories of feature extraction methods, hand crafted and feature learning Bengio et al. (2013), as indicated in table 2.
|Categories||Specific feature (techniques) examples||Related works|
|Hand crafted||Geometric features (Area, perimeter), Local features (SIFT, SURF), Colour, Texture||Lindeberg (2013b), Zou et al. (2016b), Li et al. (2013), Zou et al. (2016a)|
|Feature Learning||Deep learning (VGG-16, AlexNet, ResNet), BoVW||Rajaraman et al. (2018), Morioka and Satoh (2010), Morioka and Satoh (2011), Lazebnik et al. (2005)|
Hand crafted features are manual features which are extracted based on prior knowledge. For example, color (ie. RGB, HSV, LAB, HUE color modes), texture which is defined by the spatial distribution of pixels in the neighbourhood of an image (ie. energy, entropy, homogeneity, correlation, and contrast Kavitha and Suruliandi (2016)
) , geometric features (area, perimeter and length), global shape (ie. Krawtchouk moment) and local shape features (ie. SURF and SIFT). Local features are the collection of basic and frequent features that can be used to estimate a class’s shape knowledge as they learns from finite samples of training data. Besides, two classes which are fairly similar cannot be distinguished by local features alone. Utilizing global features convey greater discriminative information of a class domain by making use of more specific and uncommon featuresLim and Galoogahi (2010). Hand crafted features, particularly local features (SIFT and SURF) are very useful in detection of interest points. Interest points are distinctive spots/regions that help to distinguish between different objects (images). Lindeberg (2013a). Corner, blob, and ridge descriptors are examples of interest points. They play an important role in image classification and matching tasks. For example, in Lindeberg (2013b), Zou et al. (2016b) image matching of EMs is achieved used SIFT features, where these features are derived from corner interest points of 10 channels of different color modes. In Li et al. (2013)
edge and Fourier descriptors are applied for classification of EMs using SVM classifier. Interest points (descriptors) are useful in classification and image matching due to the fact that they are invariant to changes of illumination, rotation, and translation. Besides, local discriminant information content is abundant in the local image structure surrounding the interest pointSchmid et al. (2000). Thus, we leverage the corner descriptors’ locations in enhancing the segmentation of weakly visible EM. However, corner descriptors (hand crafted features) are not sufficient to present diverse appearance of weakly visible EM. Therefore, we complement them by using deep learning features (feature learning).
Feature learning (features) are high dimension features generated by the composition of local features such as SIFT. Bag of visual words (BoVW) Bolovinou et al. (2013), sparse coding (which analyse a large number of images to learn set of bases where each expresses a characteristics pattern of a patch Afzali et al. (2016)), and deep learning features are examples of feature learning Kulwa et al. (2022). In most cases deep learning features are genereted from training the deep (convolutional) neural networks such as VGG-16, ResNet, and AlexNet. Deep learning networks represent high level features composed from low level ones. They have superior descriptive power than hand crafted features methods Kuzovkin et al. (2018), because they replicate the feature extraction capability of visual cortex in human brain Hinchey et al. (2007). VGG-16 is among the most superior and used models in segmentation and classification tasks because of its high ability in learning features. For example, in Rajaraman et al. (2018), VGG-16 achieves an outstanding performance on classification of viral pneumonia and bacteria from x-ray images. In Kosov et al. (2018)
, a VGG-16 pre-trained is used as a base model for segmentation in the Deeplab-VGG, this is achieved by replacing the fully connected layers with average pooling, three convolutions and interpolation layer, then use it for initial segmentation of EMs. Leveraging the capability of VGG-16, in this study we employ it in extracting the deep learning features at every location of the detected corner descriptor. Because of its robustness and simplicity the Bag of visual words (BoVW) is among the most used feature learning technique. However, because of the orderless representation of local features in it, it does not achieve maximum performance. To remedy that and improve the performance of BoVW, some studies have considered spatial arrangement of features to discover higher order in BoVW for object matching and classificationSavarese et al. (2006), Zhang and Chen (2009). Among the methods of arranging spatial features is by pairing of close visual words Ling and Soatto (2007). For instance, in Lazebnik et al. (2005) and Liu et al. (2008) pairing is done on visual words (where Prior to pairing, feature descriptions are mapped to the visual words, and then pairing is carried out on the visual words). Yet, the underlying distribution of pairs of neighboring local feature descriptors appears to be ignored by the pairing of visual words. To address that, Morioka and Satoh (2010) and Morioka and Satoh (2011) suggested that the pairing of spatial close local descriptors (such as SIFT) can be done before the building of BoVW. This seem to achieve maximum improvement on classification of challenging dataset. Motivated by the concept of pairing features and to the best of the authors’ knowledge, there is no any work which has been done on pairing of deep learning features for segmentation task, thus in this study we pair deep learning features generated from corner interest points’ locations and concatenate them to the base model for segmentation of weakly visible EMs.
This section desctribes in details the novel techniques used in this paper. The main focus being on tackling the segmentation challenges on weakly visible EMs. These are EMs which show poor segmentation results in our initial tests using the original base model SegNet. Example of weakly visible EMs are shown on figure 2.
As observed in figure 2, weakly visible EMs suffer from low contrast, transparency and indistinct boundary between background and foreground. To be able to achieve better segmentation results, the following techniques are applied.
SegNet is one of the powerful models in computer vision for semantic segmentationBadrinarayanan et al. (2017). It consists of the encoder and decoder, as shown in figure 3.
The encoder of the SegNet consists of 13 convolutional layers similar to VGG-16, without the last fully connected layers. Thus, the encoder network is largely reduced in parameters compared to VGG-16 and can easily be trained. Each of the 13 encoder layers constitutes of a series of convolutional layer with 64 filter banks (contrary to the original SegNet which use filter size, we apply
which eliminates negative values follows. To achieve translation invariance over small spatial shift of input images, max-pooling with window size of
and stride 2 (non overlapping window) follows, which results into output being sub-sampled by the factor of 2 after each step. The application of 13 max-pooling down sampling layers in the encoder achieves more robust pixel level classification but there is a loss in spatial resolution of feature maps (boundary details). To overcome this, the boundary information in the encoder feature maps are captured and stored before next sub-sampling in each stage by storing the max-pooling indices which are more efficient for restoring boundary information and require less memory. The decoder network (which has similar convolution layers in up sampling manner) upsamples the input feature maps using the memorized max-pooling indices from corresponding encoder feature maps. Each upsampling is followed by convolution and batch normalization layer to produce dense features that are similar in size to the corresponding inputs at the encoder. Finally, the softmax is used as the classification layer. We utilize SegNet as the base model for binary segmentation of weakly visible EMs. For all experiments we use binary cross entropy as a loss function and SGD optimizer with learning rate of 0.01 and momentum of 0.9. Although ReLu has shown some drawbacks such as decreasing the performance in the gradient descent operations because all gradient values would be zero when the activation values are zeroGoceri (2019), we still opt to use it instead of LeakyReLu which provides effective learning even when the values of activation are zero. This is because during our preliminary experiments on activation functions, the average results for ReLu were slightly highter than LeakyReLu by the margin of 0.19% accuracy.
3.2 Feature Extraction
Due to the challenges on weakly visible EMs dataset, the base model misses fine information from images during training, which gives poor segmentation results when using SegNet alone. Therefore, we use external pairwise features to enhance the performance of the base model by combining the advantage of interest points’ locations (hand crafted features) and deep learning features. Specific techniques are describes below;
3.2.1 Shi and Tomas Intest Points’ Location
In order to enhance the segmentation results, we choose to use corner interest points, because from test/initial experiments the base model misses tiny outer corners and boundaries on the weakly visible EMs due to low contrast and transparency on images. A corner is as a place or point in the image where a small change in location causes a significant change in intensity in both the horizontal (X) and vertical (Y) axes. It can also be described as the intersection of points on an object’s contour edges that preserve significant object’s features Peng et al. (2016). Shi and Tomas corners theorem is one the most superior corner theorems Shi and others (1994). Simply the Shi and Tomas theorem operates on three steps;
Firstly, it is to find the window which produces high variation in intensity with a small change in the and -axis. Numerically, to find a window that can produce large variation, let the window be centred at and an intensity at this point be . is an individual intensity at a position which varies from 0 to 255 for gray level image. When the window is shifted by , the intensity at the new location will be and is the difference in intensity due to shift. For a corner, this difference must be high. Therefore we maximize this term by differentiating it with respect to and . Letting be the weights of pixels over the rectangular or a Gaussian window, Then, which is the difference between the original and the shifted window, is defined as :
Applying the Taylor series with only the first order, which is
Rewritting the shifted intensity using the above formula:
Let: , and,
and are image derivatives in X and Y directions respectively. Then,
Expanding the above equation,
Taking u,v out and rewritting in matrix notation, the equation becomes;
Where, is a symmetric
matrix whose eigenvalues are used to determine whether the scanned window contains a corner.
Secondly, Calculating the score value associated with scanned window Shi and others (1994). It is given by;
where, and are eigenvalues of the matrix .
Thirdly, is to determine points along the shift of the window that can be considered as corners. For the point to be considered as corner, the score value must be greater than the specified value (if both the and are greater than the minimum threshold values respectively).
Shi and Tomas theorem show superiority by having stability, invariant to scale changes, invariant to translation and invariant to rotation Shi and others (1994), moreover, comparing with Harris corner points which we applied in our previous work Zou et al. (2016b), Shi and Tomas gives better results and more useful interest points than Harris’. Thus, we use it determine corner points on every image. Example of images with corner points indicated on them are shown in figure 4.
As can be seen from the figure 4, the interest points are capable of identifying corner points that contain unique information about the EMs, which were ignored by the base model (SegNet) during our initial tests for the base model. It should be noted that, in this study we limit the number of corner points between 10 to 15 (due to computational complexity of the feature extraction model). Then the coordinates of each corner point are identified and stored. We take advantage of the corner points by meshing each image into patches of size which are centred at each corner points as shown in figure 6 part (a) and (b). Then from each patch, we extract deep learning features using convolution neural network VGG-16.
VGG-16 is a very deep convolution neural network for image recognition, proposed by Simonyan et al in Simonyan and Zisserman (2014). It is upgraded from AlexNet by replacing large sized kernel filters (11 and 5) with . It has achieved high accuracy in many image classification tasks. It contains 21 layers with only 16 weight layers, which include 13 convolution layers with very small receptive fields of (which gives its capability to capture the pattern of tiny information fields), followed by max-pooling layers of size and stride 2 which decreases the spatial resolution of the feature maps. In the end there are three fully connected layers, which combines all learned features from previous layers and generalize them for classification. ReLu activation function is applied to all hidden layers. Lastly is the classifier layer. In order to leverage the fully connected (FC) layers, we extract deep learning features on the last FC layer. The dimension of each extracted feature is about pixels size. The figure 5 shows the VGG-16 network layers and the point form which deep learning features are extracted.
Due to the small number of weakly visible EM which can not train the VGG-16 from scratch for better results, we use the transfer learning concept to optimize the VGG-16 extracted features. VGG-16 network, pre-trained on the ImageNet dataset has proven success in many works when fine turned on other datasets for classificationShin et al. (2016). Therefore, we fine tune the pre-trained VGG-16 using weakly EMs and extract deep learning features. For each image, 10 patches of size are meshed out and from each patch deep learning features are extracted (each patch is centred at interest points’ coordinate). Then 10 features for each image are stored parallel to their corresponding interest points’ coordinates. Figure 6 summarizes the process of deep learning features extraction.
3.3 Feature Pairing
To pair feature maps which have been extracted from the interest points’ coordinates, we use the Delaunay triangulation theorem.
3.3.1 Delaunay Triangulation (DT) Theorem
DT theorem is one of the most robust graphical theorems for the representation of data. It is the triangulation theorem which forms triangles (Delaunay triangles) by connecting each data (coordinates) to its nearest neighbour, such that the circumcircle associated with each triangle does not contain a point in its interior Khan et al. (2016). Geometrically, Delaunay triangulation for a given set A of discrete data in a plane is a triangulation (DT), such that no data in A is inside the circumcircle of any triangle in DT(A). Delaunay triangulation maximizes the minimum angle of all the angles of the triangles in the triangulation Delaunay et al. (1934). It is very effective for presentation of scattered data as it concentrates all data inside the major circumcircle formed by the most outer triangle as shown in figure 7 (b). Due to strong presentation power, it is used in many image matching works Dou and Li (2014), Flores et al. (2017). Moreover, it is tolerable to spatial displacement of data (image objects) because it keeps the same association of the nearest objects within the image, regarded that the distortion is uniform all over the image.
The Delaunay triangle edges are formed by connecting nearest neighbour data points. This means two points (vertices) which share the same edge (line) have close related characteristics (features). Thus, the middle point of the edge contains features which are an average of the edge end point features. Although (from our experiments) few middle points might be out of the EM’s body which will have non similar characteristics between the edge end points; these points are very few (less than 5% of all the middle points). More than 95% of the middle points are within the main body of the EM (foreground) and have intermediate characteristics between the corresponding edge end points as it can be observed in figure 7 (c). Owing to this, we pair the features which correlates to the vertices sharing same edge, so as to get the features of the middle point of edges. By so doing, we increase the foreground’s influence during segmentation as shown in figure 7
(c). The pairing of features is done by using the geometric principle of the the middle point of straight line, because the edges of the triangles are straight lines. This is done by averaging the two feature vectors (maps) corresponding to each edge end coordinates as described in the equation8 and 9. The edge coordinates are the interest points’ coordinates with their corresponding features ( dimension) extracted from patches.
Let the coordinate of the two end points of an edge be represented by and . The corresponding feature maps of the patches centred at these two points be and .
The middle point coordinate is given by;
The pairwise feature map which corresponds to middle point , is given by;
In average 36 to 43 pairwise features () are formed from 10 original features for each image.
3.4 Joint Pairwise Feature Formation
At this stage, we join the features formed on the interest points’ coordinates (…) and pairwise features (…). The average amount of pairwise features for each image is between 36 and 43. 10 features originate from interest points. Thus, we form the joint feature maps by appending these features vertically. This joining style has shown best results from the tests done during experiments. The average joint feature maps sizes range from to for different images. Therefore, each joint feature map corresponds to one original image. Because the dominant features are pairwise features, we name the features as joint pairwise features (Pairwise features). After formation of joint features, they are stored parallel to their original images and ground truth images.
3.5 Concatenation and Training
Both the original images and their corresponding joint pairwise features point to similar ground truth (GT) images. During training, the original images and corresponding ground truth images are fed to the input (first block) of the base model (SegNet). The joint pairwise features are resized at different sizes to fit the spatial dimensions of the encoder blocks of the the SegNet. These dimensions are , , , and for first, second, third, fourth and fifth blocks respectively. Then we concatenate the joint pairwise features at different blocks of the encoder in the SegNet, as shown in figure 8 of the general proposed network.
We apply different options for concatenating the joint feature maps to the SegNet model. Example of the options are, concatenation at block 1 only, block 2 only, block 3 only, block 1 and 2 only, block 3 and 5 only, block 1, 2 and 5 only.
4.1 Experimental Settings
During experiments, we use Environmental Microorganism Dataset 5th Version (EMDS-5), which is a newly released version of EMDS. The dataset contains 21 classes of EMs. However, in this research, we select only 8 classes which show poor performance on the base model SegNet during our initial experiments. We name these images as weakly visible EMs. Particularly, these classes are Actinophrys which is denoted as weak data class 1 (DC1), Codosiga denoted as DC2, Epistylis denoted as DC3, Paramecium denoted as DC4 and Rotifera, Vorticella, Keratella Quadrala, Stylongchia denoted as DC5, DC6, DC7, DC8 respectively. Each class contains 20 original microscopic images and their corresponding ground truth (GT) images. Therefore, in total there are 160 EMs. It should be noted that every image contains one microorganism in it (not in colonies) except for class DC3 and DC4 where some images contain two microorganisms of the same species. An example of such EMs can be seen in Fig. 11. True corners of the foreground are most important in this research. Thus, in order to reduce the possibility of false corners we crop all images which have outer highlighted square frames at the edges of the images and remain with only the true background and foreground. Then all images are resized to pixel sizes so as to fit in the SegNet input layer size.
4.1.2 Training, Validation and Testing Dataset
The dataset is divided into training, validation and testing in ratio 1:1:2 respectively. However, in order to overcome the overfitting due to small dataset and improve the performance of our segmentation models, we applying augmentation on all original weakly visible EM and their corresponding GT images. We augment by rotating them by 90, 180 and 270 degrees, and flipping them vertically and horizontally. This result into 960 images in total while having 30:30:60 images for each class for training, validation and testing respectively. Then from each RGB image joint pairwise feature maps are extracted and distributed into same ratio 30:30:60 corresponding to each class.
4.1.3 Experimental Environments
To conduct the experiments, we use a work station with Intel (R) Core(TM) i7-7700 CPU with speed of 3.60Hz. RAM of 32GB and NVIDIA GeForceGTX 1080 8GB. For implementation of the networks, we use python 3 and Keras framework with Tensorfow as backend.
4.2 Evaluation Metrics
In order to evaluate quantitatively and compare the segmentation results of different approaches, we use accuracy (Acc), Dice, intersection of union (IoU, volumetric overlap error (VOE), Sensitivity (Sens), Precision (Prec) and Specificity (Spec). Accuracy: measures the percentage of pixels in an image which are correctly classified. Accuracy and specificity sometimes mislead results on segmentation when the object of interest is small compared to background (which is the case for our dataset). Because these measures are biased mainly on how well negative pixels (background) is predicted. Thus, we use more than one metric for correct analysis of the results. Dice coefficient: also known as F1 score, is widely used for evaluation of segmentation performance. The definition of Dice is given in table 3. Intersection over union: Also known as Jaccard coefficient, measures the percentage overlap between the target mask and the prediction output. Volumetric overlap error: Is the complement of Jaccard coefficient. Table 3, summaries definition of these metrics.
|IoU, Sens||,||VOE, Prec||,|
From table 3, represents the predicted foreground by the model.
represents the foreground in the ground truth image. During segmentation of the EM, images are partitioned into two class pixels representing the foreground (the EMs) and the background. True positive (TP): is an outcome when the model correctly predicts the positive class. True negative (TN): is when the model predicts the negative class correctly. False negative (FN): is the outcome when the model predicts negative while it is actually positive. True negative (TN): is when the model predicts negative and it is actually negative. All the evaluation metrics are defined based on these terms TN, TP, FN, FP as shown in table3. For analysis purposes, the greater the values of accuracy, Dice, IoU, sensitivity, precision and specificity indicate better segmentation results and the smaller the value of VOE indicates better results and vice versa.
4.3 Evaluation of the Pairwise Deep Learning Features Network (PDLF-Net) on small Dataset Without Augmentation
Because the PDLF-Net originates from SegNet, therefore in this section we compare the segmentation performance of the PDLF-Net and SegNet on a small dataset (Each class having 5:5:10 dataset for training, validation and testing respectively). In our initial experiments we examined the performance of the PDLF-Net on different options of concatenating the joint pairwise features to different blocks of the encoder, such as concatenation at one block only of the PDLF-Net encoder as shown in figure 8, two blocks simultaneously, three blocks simultaneously, four blocks simultaneously and five block simultaneously. Referring to figure 8, these concatenation options can be described as concatenation at block 1 only, block 2 only, block 3 only, block 4 only, block 5, block 1 and 2 only, block 1 and 3 only following this order up to block 1, 2, 3, 4, and 5 only. We found that the performances are better when the concatenation is only at one block either block 1, block 2, block 3, block 4 and block 5 only. The increase in the number of concatenation blocks simultaneously leads to over-segmentation. Thus, we focus our research on concatenation at one block for all other experiments which we present in this paper. We compare and examine the performance of the PDLF-Net on small dataset of weakly visible classes by treating each class alone. Table 4 shows the performance of the PDLF-Net with concatenation at different blocks and the original SegNet.
|SegNet [%]||Block 2 [%]|
|Block 5 [%]|
From table 4, the application of pairwise features show improvement of the segmentation results. Block 5 and Block 2 results of the PDLF-Net are presented because they show consistent improvement in all classes compared to other block options. This is because, the deep layers at block 5 (bottom neck layers) in the deep network (PDLF-Net) are responsible for learning specific features of the foreground, therefore adding the joint pairwise features at block 5 emphases more the network to focus on learning the foreground (EM) thus improves the performance. The application of pairwise features on different blocks improves the segmentation performance by 6.21% acc, 2.9% IoU, 2.8% Dice, 14.83% sens and 15.57% spec on weak data class 1 (DC1). 6.06% acc, 1.95% IoU, 2.08% Dice, 4.98% sens, 7.23% prec and 5.15% spec on weak data class 3 (DC3). 6.30% acc, 2.59% IoU, 2.61% Dice, 17.26% sens, 0.25% prec and 15.94% spec on DC4. 5.00% acc, 8.00% IoU, 8.48% Dice, 19.32% sens, 17.00% prec and 20.6% spec on DC5. 12.90% acc, 4.32% IoU, 4.30% Dice, 7.45% sens and 8.76% spec on DC6. 2.73% IoU and 2.70% Dice on DC8. (The comparison above is obtained by taking the original SegNet result for a particular dataset class as a reference and compare it with maximum value of the PDLF-Net result of any block in that particular data class). The average performance results of the original SegNet and PDLF-Net at block 2 and 5 on all classes are given in figure 9 (This is obtained by averaging the results of all classes on a particular method separately and drawing the performance chart for each method).
From the general figure 9, even though the number of dataset for training is very small the PDLF-Net shows improvement in IoU, Dice, VoE, Sens, Prec, Spec and Acc by about 1.66%, 1.50%, 1.55%, 5.34%, 1.65%, 5.37% and 0.28% respectively. Generally the PDLNet shows improvement, however, the individual errors (VOE) are still high as shown in table 4. This is due to the small dataset which cause the networks not to generalize well during training. In order to reduce such errors and increase segmentation performance more, we apply augmentation on all weakly visible dataset and their corresponding GT images.
4.4 Evaluation of the PDLF-Net on Augmented Dataset
In order to enhance the performance of PDLF-Net, we augment all the weakly visible EMs and their GT images. Then joint pairwise features are extracted from each image and concatenated to different blocks. Each block is trained and tested independently for each dataset class. Table 5 shows the results of the most performing network concatenation configurations.
|SegNet [%]||Block 2 [%]|
|Block 3 [%]||Block 4 [%]|
|Block 5 [%]|
From table 5, there is an overall improvement of segmentation performance contributed by all blocks. Comparing with the original SegNet, the PDLF-Net shows improvement by an increase of 2.1% acc, 4.11% IoU, 2.90% dice, 2.96% sens, 3.15% prec and 2.96% spec on weakly visible data class 1 (DC1). 1.51% acc, 4.28% IoU, 3.33% dice, 3.08% sens, 0.29% prec and 3.20% spec on DC2. 2.34% acc, 0.94% IoU, 0.8% dice, 4.80% sens, 1.49% prec and 3.76% spec on DC3. 4.45% acc, 6.71% IoU, 5.26% dice, 3.32% sens, 3.05% prec and 2.86% spec on DC4. 3.18% IoU, 2.24% dice, 0.44% sens, 1.5% prec and 0.71% spec on DC5. 0.53% IoU, 0.37% dice, 0.30% sens, 2.00% prec and 1.56% spec on DC6. 3.19% acc, 3.46% IoU, 2.67% dice, 4.89% sens, 4.19% prec and 5.27% spec on DC7. 5.15% acc, 5.59% IoU, 4.46% dice, 5.18% sens, 7.92% prec and 5.97% spec on DC8. The average performances on all dataset for original SegNet and PDLF-Net are given in figure 10.
Figure 10 shows that the average improvement of about 1.09% acc, 2.20% IoU, 1.75% dice, 2.00% sens, 2.17% prec and 2.15% spec is observed on segmentation using PDLF-Net compared to the original SegNet. The overall average maximum results achieved by the PDLF-Net are 89.33%, 63.26%, 77.35%, 36.74%, 88.10%, 91.79% and 87.48% by acc, IoU, Dice, VOE, sens, prec and spec respectively. Moreover, the visual comparison of segmented images on original SegNet and PDLF-Net at blocks 2, 3, 4 and 5 are given in figure 11.
Comparing the observation performance from figure 11, the PDLF-Net shows better segmentation results. For instance, in data class DC3 and DC4 ( and rows from the top) SegNet in (c) has not been able to show the foreground while there is a good segmented output of the same image by PDLF-Net in (e) and (f). In DC8 (last row), SegNet over-segments the image while good visual results are observed by PDLF-Net when concatenation of pairwise feature is at block 2, 3 and 5. Generally, the visual results show great improvement of segmentation results when using PDLF-Net.
4.5 Evaluation of the PDLF-Net on Test Dataset
To evaluate more the effectiveness of the PDLF-Net, we examine it on the test dataset. The test dataset contains 480 images, which are twice in number to the training and validation sets. The average segmentation performance of the PDLF-Net on test dataset for all classes is shown in figure 12. The graph shows a comparison of both the test set and validation set for each block performance.
From figure 12, each pair of bars from left to right, are of similar configuration (blocks) applied on validation and test sets respectively. The performance of the PDLF-Net is almost similar in both validation and test sets although the number of test dataset is twice. This shows the great effectiveness of the PDLF-Net on unseen dataset (test set). The highest average performances of PDLF-Net on test dataset are 89.24% accuracy, 63.20% IoU, 77.27% Dice, 35.15% VOE, 89.72% sensitivity, 91.44% precision and 89.30% specificity .
4.5.1 Evaluation of the model performance on more challenging Test Datasets
In order to evaluate the performance of the PDLF-Net on more challenging dataset, we test it on images which have been subjected to rotation, illumination change and additional noise as indicated in the figure 13. To observe the improved capability of the PDLF-Net on learning image features on challenging dataset, we compare it with the base model SegNet.
It can be observed from the graphs on figure 13 that the PDLF-Net performs better on all challenging images with an average improved performance of more than 2.00% against SegNet on each metric. This justify the capability of the PDLF-Net to capture more spatial features in noisy, transparent and low contrast images.
4.5.2 Training and Testing Time Evaluation
In this section we compare the training and testing time of the PDLF-Net against other well-known CNN based segmentation models as shown in Table 6.
|Training time (min)||12.17||11.98||11.18||11.08||10.53||7.78||9.56|
|Testing time (sec)||6.19||5.92||6.15||6.12||5.92||3.21||4.31|
From table 6, although the training and testing times for the PDLF-Net are a bit higher compared to other models, they are generally still low and feasible for practical segmentation tasks.
4.5.3 Comparison of the PDLF-Net Against Other State-of-the Art Segmentation Networks
We conduct comparison tests of the proposed model against U-net, FCN, SegNet, Canny edge based segmentation, Otsu thresholding, -means clustering and region growing segmentation techniques on the same test dataset. Because the classical methods (Canny, Otsu and region growing) need post-processing to have better segmentation results, we use same post-processing techniques for all so as to unify the results. During test experiments PDLF-Net, SegNet, Unet and FCN are all trained using augmented training EMs dataset and tested on the same test dataset. The classical methods are subjected to test datasets only. The results obtained for each networks are represented in table 7.
It can be observed from the table 7 that the PDLF-Net performs better than other networks by having the highest values in all metrics. The average good performing blocks for the PDLF-Net are block 3 and block 5.
4.6 Method’s Limitations
It should be noted that, although the proposed method has shown potential on EMs, it focuses only on segmentation of one (single) or two microorganisms on the image and not biofilms. Example of the segmentation of two EMs on the same image can be seen in Fig. 11 in classes DC3 and DC4, while other classes contain only one EM on every image. In the future work, we will extend our scope to testing our novel method on microorganim dataset with more than two microorganisms, clusters, and biofilms.
5 Conclusion and Future Work
In this research we propose a Pairwise Deep Learning Feature Network for segmentation of weakly visible EMs. It combines the advantages of both hand crafted features (by identifying the Shi and Tomas interest points of the foreground ) and deep learning features (by extracting deep learning features on the patches which are centered on each interest point). Then, in order to learn the intermediate spatial characteristics between the nearby interest points, we pair the extracted deep learning features using the Delaunay triangulation theorem. The results show that the proposed network upon improving the performance of the base mode SegNet, it can focus more on the foreground which can overcome the segmentation challenges on image such as noise and low contrast. Apart from being useful in segmentation of EMs, the proposed network can find more application in segmentation of brain tumor and breast cancer images.
During initial experiments we tested the pairwise deep learning features on binary classification of two EMs classes using SVM. Promising results were obtained. Therefore, the pairwise features can also be suitable not only in segmentation tasks but also in classification and image matching works.
In the future work, we plan to use other superior convolution neural networks such as Inception, Xception and DenseNet for extraction of deep learning features to improve more our segmentation results.
We thank Prof. B. Zhou, Dr. F. Ma (University of Science and Technology Beijing, China), Prof. Y. Zou (Freiburg University, Germany), B.E. X. Zhu (Johns Hopkins University, US) and B.E. B. Lu (Huazhong University of Science and Technology, China) for their previous cooperations in this work. We also thank Miss Z. Li and Mr. G. Li, for their important discussion. This work is supported by “National Natural Science Foundation of China” (No. 61806047).
- Medical image registration using sparse coding of image patches. Computers in biology and medicine 73, pp. 56–70. Cited by: §2.2.
- Censure: center surround extremas for realtime feature detection and matching. In European Conference on Computer Vision, pp. 102–115. Cited by: §2.2.
CNN based yeast cell segmentation in multi-modal fluorescent microscopy data.
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 753–759. Cited by: §2.1.
- Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.1, §3.1.
- A system for automatic cell segmentation of bacterial microscopy images. Arkin Laboratory for Dynamical Genomics, Lawrence Berkeley National Laboratory. Cited by: §2.1, Table 1.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2.2.
Bag of spatio-visual words for context inference in scene classification. Pattern Recognition 46 (3), pp. 1039–1053. Cited by: §2.2.
- Bulletin de l’académie des sciences de l’urss. Classe des sciences mathématiques et na (6), pp. 793–800. Cited by: §3.3.1.
- A survey on image segmentation methods using clustering techniques. European Journal of Engineering Research and Science 2 (1), pp. 15–20. Cited by: §2.1.
- Image matching based local delaunay triangulation and affine invariant geometric constraint. Optik 125 (1), pp. 526–531. Cited by: §3.3.1.
- Fingerprint verification methods using delaunay triangulations.. Int. Arab J. Inf. Technol. 14 (3), pp. 346–354. Cited by: §3.3.1.
- Plasmodium vivax segmentation using modified fuzzy divergence. In 2011 International Conference on Image Information Processing, pp. 1–5. Cited by: §2.1.
- Diagnosis of alzheimer’s disease with sobolev gradient-based optimization and 3d convolutional neural network. International journal for numerical methods in biomedical engineering 35 (7), pp. e3225. Cited by: §3.1.
- Leishmaniasis parasite segmentation and classification using deep learning. In International Conference on Articulated Motion and Deformable Objects, pp. 53–62. Cited by: §2.1, Table 1.
- Swarms and swarm intelligence. Computer 40 (4), pp. 111–113. Cited by: §2.2.
- Segmentation and identification of rotavirus-a in digital microscopic images using active contour model. In Thinkquest~ 2010, pp. 177–181. Cited by: §2.1, Table 1.
- Virus particle detection by convolutional neural network in transmission electron microscopy images. Food and environmental virology 10 (2), pp. 201–208. Cited by: §2.1, Table 1.
- Texture and color feature extraction for classification of melanoma using svm. In 2016 International conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp. 1–6. Cited by: §2.2.
- Local adaptive approach toward segmentation of microscopic images of activated sludge flocs. Journal of Electronic Imaging 24 (6), pp. 061102. Cited by: §2.1, Table 1.
- On feature based delaunay triangulation for palmprint recognition. arXiv preprint arXiv:1602.01927. Cited by: §3.3.1.
- Environmental microorganism classification using conditional random fields and deep convolutional neural networks. Pattern Recognition 77, pp. 248–261. Cited by: §1, §2.2.
- A new pairwise deep learning feature for environmental microorganism image analysis. Environmental Science and Pollution Research, pp. 1–18. Cited by: §2.2.
- A state-of-the-art survey for microorganism image segmentation methods and future potential. IEEE Access 7, pp. 100243–100269. Cited by: §1, §2.1, §2.1.
- Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications biology 1 (1), pp. 1–12. Cited by: §2.2.
- A maximum entropy framework for part-based texture and object recognition. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 1, pp. 832–838. Cited by: §2.2, Table 2.
- Classification of environmental microorganisms in microscopic images using shape features and support vector machines. In 2013 IEEE International Conference on Image Processing, pp. 2435–2439. Cited by: §2.2, Table 2.
- A survey for the applications of content-based microscopic image analysis in microorganism classification domains. Artificial Intelligence Review 51 (4), pp. 577–646. Cited by: §1.
- EMDS-5: environmental microorganism image dataset fifth version for multiple image analysis tasks. Plos one 16 (5), pp. e0250631. Cited by: §1.
- Shape classification using local and global features. In 2010 Fourth Pacific-Rim Symposium on Image and Video Technology, pp. 115–120. Cited by: §2.2.
- Scale selection properties of generalized scale-space interest point detectors. Journal of Mathematical Imaging and vision 46 (2), pp. 177–210. Cited by: §2.2.
- Scale selection properties of generalized scale-space interest point detectors. Journal of Mathematical Imaging and vision 46 (2), pp. 177–210. Cited by: §2.2, Table 2.
- Proximity distribution kernels for geometric context in category recognition. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.2.
Integrated feature selection and higher-order spatial feature extraction for object categorization. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.2.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.1.
- Microorganisms and organic pollutants. In Environmental microbiology, pp. 377–413. Cited by: §1.
- Minimal annotation training for segmentation of microscopy images. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 387–390. Cited by: §2.1, Table 1.
- Automated identification of mycobacterium bacillus from sputum images for tuberculosis diagnosis. Signal, Image and Video Processing 13 (8), pp. 1585–1592. Cited by: §1.
- Building compact local pairwise codebook with joint feature space clustering. In European Conference on Computer Vision, pp. 692–705. Cited by: §2.2, Table 2.
- Compact correlation coding for visual object categorization. In 2011 International Conference on Computer Vision, pp. 1639–1646. Cited by: §2.2, Table 2.
- External Links: Cited by: §1.
- Detection of individual specimens in populations using contour energies. In International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 575–586. Cited by: §2.2.
- Harris scale invariant corner detection algorithm based on the significant region. International Journal of Signal Processing, Image Processing and Pattern Recognition 9 (3), pp. 413–420. Cited by: §3.2.1.
- Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs. Applied Sciences 8 (10), pp. 1715. Cited by: §2.2, Table 2.
- Segmentation of tb bacilli in ziehl-neelsen sputum slide images using k-means clustering technique. CSRID (Computer Science Research and Its Development Journal) 9 (2), pp. 63–72. Cited by: §2.1, Table 1.
- Colour segmentation of multi variants tuberculosis sputum images using self organizing map. In Journal of Physics: Conference Series, Vol. 853, pp. 012012. Cited by: §2.1, Table 1.
- Discriminative object class models of appearance and shape by correlatons. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 2033–2040. Cited by: §2.2.
- Evaluation of interest point detectors. International Journal of computer vision 37 (2), pp. 151–172. Cited by: §2.2.
- Good features to track. In 1994 Proceedings of IEEE conference on computer vision and pattern recognition, pp. 593–600. Cited by: §3.2.1, §3.2.1, §3.2.1.
- Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §3.2.2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.2.
- Segmentation, splitting, and classification of overlapping bacteria in microscope images for automatic bacterial vaginosis diagnosis. IEEE journal of biomedical and health informatics 21 (4), pp. 1095–1104. Cited by: §1.
- In situ dna-hybridization chain reaction (hcr): a facilitated in situ hcr system for the detection of environmental microorganisms. Environmental microbiology 17 (7), pp. 2532–2541. Cited by: §1.
- Efficient kernels for identifying unbounded-order spatial features. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1762–1769. Cited by: §2.2.
- Spatial locality-preserving feature coding for image classification. Applied Intelligence 47 (1), pp. 148–157. Cited by: §2.2.
Environmental microbiological content-based image retrieval system using internal structure histogram. In Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, pp. 543–552. Cited by: Table 2.
- Content-based microscopic image retrieval of environmental microorganisms using multiple colour channels fusion. In Computer and Information Science, pp. 119–130. Cited by: §2.2, Table 2, §3.2.1.