Deep localization of protein structures in fluorescence microscopy images

10/09/2019 ∙ by Muhammad Tahir, et al. ∙ 19

Accurate localization of proteins from fluorescence microscopy images is a challenging task due to the inter-class similarities and intra-class disparities introducing grave concerns in addressing multi-class classification problems. Conventional machine learning-based image prediction relies heavily on pre-processing such as normalization and segmentation followed by hand-crafted feature extraction before classification to identify useful and informative as well as application specific features.We propose an end-to-end Protein Localization Convolutional Neural Network (PLCNN) that classifies protein localization images more accurately and reliably. PLCNN directly processes raw imagery without involving any pre-processing steps and produces outputs without any customization or parameter adjustment for a particular dataset. The output of our approach is computed from probabilities produced by the network. Experimental analysis is performed on five publicly available benchmark datasets. PLCNN consistently outperformed the existing state-of-the-art approaches from machine learning and deep architectures.



There are no comments yet.


page 2

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Protein subcellular localization refers to the spatial distribution of different proteins inside a cell. To understand various cellular processes, it is crucial to comprehend the functions of proteins, which are in turn highly correlated to their native locations inside the cell [1, 2, 3, 4, 5]. Protein functions can be better grasped by identifying the protein subcellular spatial distributions. For instance, proteins at mitochondria perform aerobic respiration and energy production in a cell [2, 6]. During the drug discovery procedures, precise information about the location of proteins can also help in identifying drugs [1]. Similarly, information about the location of proteins before and after using certain drugs can reveal their effectiveness [2, 7]. Proteins residing in different locations are dedicated to performing some particular functions, and any change in their native localizations may be a symptom of some severe disease [6]. Therefore, capturing the change in proteins’ native locations is significant in detecting any abnormal behavior ahead of time that may be of importance to some diagnostic or treatments.

Microscopy techniques are employed to capture subcellular localization images of proteins in a cell, which were previously analyzed using traditional wet methods. However, advances in microscopy techniques have brought an avalanche of medical images in a considerable amount hence, manual analysis and processing of these medical images become nearly impossible for the biologists. Moreover, a subjective inspection of images may lead to errors in decision-making process [1, 5, 8]. It is highly likely that the images generated for proteins belonging to the same class may look visually different. Similarly, proteins belonging to two different classes may look alike. Such a situation leads to poor performance of classification systems. These problems are resolved by applying different hand-crafted feature extraction strategies to capture multiple views from the same image [9]. Hence, this is a cumbersome job and may fail to discriminate with high accuracy.

Due to the reasons mentioned above, automated computational techniques continued to be the focus of attention for many researchers in computational biology and bioinformatics over the last two decades [6]. Consequently, substantial advancement has been observed concerning the automated computational methods, improving the performance of protein subcellular localization from microscopy images.

Our primary contributions are 1): we introduce a novel architecture, which exploits different features at various levels in distinct blocks, 2): we investigate the effect of different components of the network and demonstrate improved prediction accuracy, and 3): we provide extensive evaluation on five datasets against a number of traditional and deep-learning algorithms.

max width=

Fig. 1: Image datasets for protein localization; each image belongs to a different class. Most of the images are sparse.

Ii Related Works

Murphy’s group has instituted pioneering work in the machine learning computational methods, to accurately predict subcellular locations of proteins from fluorescence microscopy images. In this connection, Boland et al.

have proposed to utilize Zernike moments and Haralick texture features in conjunction with a neural network to classify protein images from CHO dataset

[10] into five distinct categories. Next, as an extension to their earlier work, Murphy et al. [11] have introduced a quantitative approach in which they not only employed Zernike moments and Haralick texture features for the description of protein images but also presented a new set of geometric and morphological features. Back-propagation neural network, linear discriminator, K-nearest neighbors and classification trees were investigated for the stated purpose. Adopting the previous feature extraction techniques, Huang and Murphy [12] formed an ensemble of three classifiers to localize the subcellular proteins.

The proposed approaches by Murphy group have demonstrated significant performance in discriminating protein localization images. However, they had to apply a number of feature extraction techniques where each technique is dedicated to capture certain aspect of protein images. Following the works of Murphy et al., a novel feature extraction technique known as Threshold Adjacency Statistics (TAS) is proposed in [13]

, in which, the input image is converted into binary images using different thresholds. In the next step, from each binary image, nine statistics are computed, which serves as an input to support vector machines (SVM) for classification. TAS is able to extract meaningful information from protein images with low computational complexity. However, appropriate threshold selection has great impact on the performance of generated features. Moreover, Chebira 

et al. [8] reported a multi-resolution approach using Haar filters to decompose an input image into multi-resolution subspaces, extracting Haralick, morphological, and Zernike features, and performing classification in respective multi-resolution subspaces. Results obtained in this way are combined through a weighting algorithm to yield a global prediction independently. The proposed multi-resolution approach demonstrated enhanced performance that comes on the expense of increased computational cost.

Nanni & Lumini [14] presented the concatenated features of invariant LBP, Haralick, and TAS in conjunction with a random subspace of Levenberg-Marquardt Neural Network. LBP technique is the choice of many researchers for texture classification due to its intrinsic properties for example, they are rotation invariant, resistant to illumination changes, and computationally efficient. Despite its simplicity, it is capable of extracting minute details from image texture. However, its noise sensitive nature may lead to poor performance. Building on the success of LBP features, Nanni et al. [15] have put forward variants of LBP for feature extraction, exploiting the topological arrangements for neighborhood calculation as well as several encoding schemes for the assessment of local gray-scale differences. The resultant features are then fed to a linear SVM for training. These variants curbed the noise sensitive behaviour of LBP that enhanced its discriminative power.

In fluorescence microscopy, Li et al. [16] combined the concepts from LBPs and TASs to develop a novel technique: Local Difference Patterns (LDPs), engaging SVM as a classifier. LDPs are invariant to translation and rotation that showed better performance compared to other simple techniques like Haralick features and TASs. Similar to [17], Tahir and Khan [9] employed GLCM that exploits statistical and Texton features. The classification is performed through SVM, and the final prediction is obtained through the majority voting scheme. The proposed technique has shown some-how better performance through efficient exploitation of two simple methods. Moreover, Tahir et al. [18] enhanced the discriminative power of TAS technique by incorporating seven threshold ranges resulting in seven binary images as compared to three [13]

. The seven SVMs are trained using features from each binarized image, while the majority voting scheme delivers the final output. Though, the performance of this technique is better compared to its classical counterpart, it requires calculation of additional threshold values that make it computationally expensive.

The core issue with classical machine learning methods is the identification of appropriate features for describing protein images with maximum discriminating capability and selection of proper classifier to benefit from those features. Any single feature extraction technique usually extracts only one aspect of essential characteristics from protein images. Hence, different feature extraction strategies are applied to extract diverse information from the same image. Additionally, segmentation and feature selection may also be required to obtain more relevant and useful information from protein images, which may result in more computational cost, time, and efforts 

[19]. In case, the extracted features reasonably describe the data under consideration, it cannot be guaranteed that the same technique works for data other than the one for which it has been crafted [20].

In recent years, convolutional neural networks (CNNs) have attracted the focus of many researchers in a variety of problem domains [4]. Deep learning provides solutions to avoid cumbersome tasks related to classical machine learning problems [4, 5, 19, 20]. The deep learning prediction systems learn features directly from the raw images without the need for designing and identifying handcrafted feature extraction techniques. Similarly, in CNNs pre-processing is not a primary requirement compared to classical prediction models.

Fig. 2: A glimpse of the proposed network used for localization of the protein structures in the cell. The composition of R, R, P and P are provided below the network structure, where the subscript have a small number of convolutions as compared to .

In the field of computational biology and bioinformatics, Dürr and Sick [21] applied a convolutional neural network to biological images for classification of cell phenotypes. The input to the model is a segmented cell rather than a raw image. More recently, Pärnamaa and Parts [4] developed CNN model named DeepYeast for classification of yeast microscopy images and localization of proteins. Shao et al. [5] coupled classical machine learning with CNNs for classification. In this connection, AlexNet [22] is used to extract features from natural images, which are followed by partial parameter transfer strategy to extract features from protein images. Next, feature selection is performed on the last fully connected layer using Lasso model [23], and the resultant features were fed into RBF-SVM for final output.

To classify efficiently, Godinez et al. [24] developed a multi-scale CNN architecture that processes images at various spatial scales over parallel layers. For this purpose, seven different scaled versions are computed for each image to feed into the network. Each image is processed through three convolutional layers, establishing a convolutional pathway for each sequence, which works independent of the other and captures relevant features appearing at a particular scale. Next, Kraus et al. [20] trained DeepLoc, a convolutional neural network, consisting of convolutional blocks and fully connected layers. The convolutional layers identify invariant features from the input images while fully connected layers classify the input images based on the features computed in the convolutional layers.

Lately, Xiao et al. [19] analyzed various types of deep CNNs for their performance against conventional machine learning techniques. Comparable to DeepYeast [4], Xiao et al. [19]

implemented 11-layer CNN with batch normalization, similar to VGG 

[25]. The authors further experimented and analyzed VGG, ResNet [26], Inception-ResNet V [27], straightened ResNet (modified version), and CapsNet [28]. Besides, as a separate experiment, image features are extracted using convolutional layers of VGG (employing batch normalization), and the conventional machine learning classifier replacing the last fully connected layer. The obtained results using various CNN models proved their efficiency compared to conventional machine learning algorithms. Recently, Lao & Fevens [29] employed ResNet [26] and many of its variants for cell phenotype classification from raw imagery without performing any prior image segmentation. They demonstrated the capabilities of WRN [30], ResNeXt [31], and PyramidNet [32].

Our method, protein localization convolutional neural network, namely, PLCNN, employs a multi-branch network with feature concatenation. Each branch of the network computes different image features due to its block structure based on distinct skip connections. Unlike traditional methods, no pre-processing or post-processing is performed to achieve favorable and data-specific results. In the next section, we provide details about our network111Code available at

Iii Proposed Network

Recently, plain networks such as VGG [25], residual networks such as ResNet [26] and densely concatenated networks such as DenseNet [33] have delivered state-of-the-art performance in object classification, recognition, and detection while offering stability of the training. Inspired by the elements of mentioned networks, we design a modular network for localization of protein structures in cell images. The design consists of three types of modules: 1) without any skip connections, 2) with skip connections, and 3) with dense connections. Figure 2 outlines the architecture of our network.

Network elements Our proposed network has a modular structure composed of different modules. The variation of each module is depicted via the colors employed. The orange color represents the residual part, and the blue blocks are for non-residual learning. Similarly, the golden block uses dense connections to extract features from the images. The outputs of each residual and non-residual blocks are concatenated except the first blocks. In our experiments, we typically employ filters of size and in the convolutional layers. Next, we explain the difference between the blocks.

Apart from the noticeable difference between the modules based upon the connection types, the modules are distinct in their composition of elements. To be more precise, our network is governed by four meta-level structures; the connection types in the modules, the number of modules, the elements in the modules, and the number of feature maps.

The high-level architecture of our network can be regarded as a chain of modules of residual and non-residual blocks, where the concatenation happens after each block. The output of each concatenation is fed into each convolutional layer followed by ReLU to compress the high number of channels for computational efficiency. At the end of the network, the output features of residual, non-residual and dense parts are stacked together, flattened, and passed through the fully connected layer to produce probabilities equal to the number of classes. The class with the highest probability is declared as the protein type present in the image.

The simplest of the three modules is the plain one, which comprises convolutional layers each followed by ReLU and a final max-pooling operation at the end of the module. Moreover, there are two types of plain modules

i.e. P and P where the difference lies in the number of convolutional layers. The former contains two and the latter contains three layers.

The residual modules consist of two convolutions where batch normalization and ReLU follow the first one while the second one is followed by only batch normalization. The input of the block is added to the output of the second batch normalization. This structure is repeated two times in each R residual module while for R

, a strided convolution is added between the two structures to match the size for the corresponding plain modules before concatenation. The architecture of R

and R are shown in the lower part of Figure 2. Features block takes its inspiration from DenseNet [33], where each layer is stacked with the previous layers. These modules aim at learning the kernel weights to predict accurate probabilities. The skip connections in residual and dense modules help to propagate the loss gradients without a bottleneck in forward as well as backward direction.

Formulation: Let us suppose that an image is passed via a deep network having

layers, where each layer implements a non-linear transformation function

and represents the index of the layer. can be composed of compound operations, e.g. convolution, batch normalization, ReLU, or pooling, then the output of the layer can be denoted as .

Non-residual modules: Non-residual convolutional networks pass the input through module to get the features of i.e. connecting the output with the input via a single feed-forward path, which gives rise to the following layer transition


where represents the output of the non-residual non-linear transformation module.

Residual modules: On the other hand, residual blocks connect the input with the output using a skip, also known as bypass, connection over as


where indicates features from the residual module.

Dense connections: The dense modules employ dense connections, which receive features from all the previous layers as input:


where represents concatenation of feature maps from layers . Similarly, refers to the output features from the dense module.

Composite function: Inspired by [26] and [33], we also define the composite function M having three operations: convolution followed by batch normalization and ReLU.

Channels compression: The number of channels are reduced after the final concatenation in the dense module as well as after the concatenation of the feature maps from non-residual and residual modules to improve the model compactness and efficiency.

Label prediction: As a final step, the features of all the modules are stacked and passed through a fully connected layer, after softmax produces probabilities equal to the number of classes present in the corresponding dataset. The highest probability is considered to be the predicted class as


where represents the last transformation function. Similarly, is the fully connected operation, and operator selects the highest probability and maps it to the predicted class label .

Network loss: The output of the fully-connected layer is fed into the softmax function, which is the generalization of the logistic function for multiple classes. The softmax normalizes the values in 0 and 1 interval where normalized values add up to 1. The softmax can be described as


where represents actual values corresponding to mutually exclusive classes.

We employ cross-entropy as a loss function, which computes the difference between the predicted probabilities and the actual distribution of the class. One-hot encoding is used for the actual class distribution, where the probability of the real class is 1 and all other probabilities are zero. the cross-entropy loss is given by


where is the actual probability and

is the estimated probability of class


Iv Experimental settings

First, we detail the training of our network. Subsequently, we discuss the datasets used in our experiments. These include HeLa [34], CHO [10], LOCATE datasets [13], and Yeast [4]. Next, we evaluate our network against conventional algorithms such as SVM-SubLoc [1], ETAS-Subloc [18], and IEH-GT [9] as well as convolutional neural networks such as AlexNet [22], ResNet [26], GoogleNet [35], DenseNet [33], M-CNN [24] and DeepYeast [4]. In the end, we analyze various aspects of the proposed network and present ablation studies.

Iv-a Training Details

The input to our network during training are the resized images of 224224 from the corresponding datasets. Training and testing are performed via 10-fold cross-validation, and there is no overlap i.e. both are disjoint in each iteration. We also augment the training data by applying conventional techniques such as flipping horizontally and vertically as well as rotating the images within a range of , where

. We also normalized the images using ImageNet 


mean and standard deviation.

We implemented the network using PyTorch framework and trained it using P100 GPUs via SGD optimizer 

[36]. The initial learning rate was fixed at with weight decay as and momentum parameter as

. The learning rate was halved after every 60 epochs, and the system was trained for about 200 epochs. The training time was variable for each dataset; however, as an example, the training for CHO dataset 

[10] took around 14 minutes to complete 200 epochs. The batch size was selected to be . The residual component of the network was initialized from the weights of ResNet [26], the non-residual part from VGG [25], and the densely connected section from DenseNet [33] weights.

Iv-B Datasets

We analyzed the performance of PLCNN approach on five benchmark subcellular localization datasets that are described as follows.

max width= Method HeLa CHO Endo Trans Yeast Machine Learning SVM-SubLoc 99.7 - 99.8 98.7 - ETAS-SubLoc - - 99.2 91.8 - IEH-GT - 99.7 - - - CNN Specific GoogleNet 92.0 91.0 - - - M-CNN 91.0 94.0 - - - DeepYeast - - - - 91.0 PLCNN (Ours) 93.0 100.0 99.8 99.6 91.0

TABLE I: Performance comparison with machine learning and CNN-Specific algorithms. The “Endo” and “Trans” is the abbreviation for LOCATE Endogenous and Transfected datasets, respectively. Best results are highlighted in bold.

max width= Methods Datasets Alexnet ResNet DenseNet PLCNN (Ours) Yeast 80.9 81.5 81.7 91.0 HeLa 85.1 86.5 87.9 93.0

TABLE II: Performance against traditional CNN methods using Yeast and HeLa datasets. The best results are in bold.
  • HeLa dataset: HeLa dataset [34] is a repository of 2D fluorescence microscopy images from HeLa cell lines where each organelle is stained with a corresponding fluorescent dye. Overall, there are 862 single cell images distributed in 10 categories.

  • CHO dataset: CHO dataset [10] is developed from Chinese Hamster Ovary cells that contains 327 fluorescence microscopy images distributed in five different classes.

  • LOCATE datasets: LOCATE is a compilation of two datasets [13] i.e. LOCATE Endogenous and LOCATE Transfected, each containing 502 and 553 subcellular localization images distributed in 10 and 11 classes, respectively.

  • Yeast dataset: We have used the Yeast dataset developed by Parnamaa & Parts [4] that consists of 7132 microscopy images distributed over 12 distinct categories. To augment the original dataset, the images were cropped into 6464 patches, generating 90000 samples in total. These patches are distributed exclusively into 65000 training, 12500 validation, and 12500 testing.

Iv-C Comparisons

In this section, we provide comparisons against state-of-the-art algorithms on the datasets, as mentioned earlier. The proposed PLCNN results are reported without any customization or parameter adjustment for a particular dataset.

Iv-C1 Binary phenotype classification

We employ BBBC datasets [37], where only two types of phenotype classes are present, namely, the neutral and positive control phenotypes. During comparison on this dataset, our algorithm, as well as other deep learning methods including M-CNN [24], achieved perfect classification. The problem on the mentioned datasets is a simple binary classification; hence, reporting results on BBBC datasets [37] become trivial.

Fig. 3: The average quantitative results of ten execution for each method on the HeLa dataset. Our PLCNN method consistently outperforms with a significant margin.

Iv-C2 Multi class subcellular organelles classification






Fig. 4: Visualization results from Grad-CAM [38]. The visualization is computed for the last convolutional outputs, and the corresponding algorithms are shown in the left column the input images.

Traditional models: We present a comparative analysis of PLCNN model against the conventional machine learning models. For a fair comparison, we train and test our network using the same datasets configuration. Table I highlights the performance of PLCNN and machine learning-based models. Although SVM-SubLoc [1] has achieved accuracy for HeLa dataset [34], the pre-processing requires widespread efforts and are highly time-consuming. Likewise, identifying suitable representative features is also a cumbersome job. Moreover, the SVM-SubLoc [1], ETAS-Subloc [18], and IEH-GT [9] are ensemble methods i.e. combination of multiple traditional classification algorithms; that is why, the performances are higher than many of the complex systems [14, 39, 40] for HeLa and CHO datasets.

Tahir et al. [1] captured multi-resolution subspaces of each image before extracting features. Similarly, the model ETAS-SubLoc [18] for the feature extraction, performs extensive pre-processing to produce multiple thresholded images from a single protein image. Similarly, IEH-GT [9] has achieved performance accuracy for the CHO dataset where the authors had to employ several handcrafted pre-processing and feature extraction steps for such efficient classification.

Comparative analysis in Table I reveals that PLCNN outperforms all methods on all datasets except HeLa [34] even though it does not require any pre-processing. The PLCNN achieved accuracy for HeLa [34] that is lower than that of SVM-SubLoc [1]; the latter employs an ensemble of classifiers. Furthermore, the traditional algorithms [1, 9] are usually tailored for specific datasets and hence only perform well on particular datasets for which they are designed. These algorithms fail to deliver on other datasets; hence indicating limited generalization capability.

On the other hand, PLCNN performs well across multiple subcellular localization datasets. Mainly, LOCATE Transfected dataset has been observed to be one of the most challenging datasets where the highest performance accuracy reported so far using conventional machine learning algorithms with careful feature extraction technique is . PLCNN achieved accuracy on this dataset, improving upon the traditional techniques by .

max width= Methods Dataset Split SVM ETAS Ours CHO 90%-10% 99.6 47.0 100.0 80%-20% 99.6 50.4 99.3 70%-30% 99.3 57.1 98.9 60%-40% 98.7 86.8 99.0 Endogenous 90%-10% 99.0 98.0 99.8 80%-20% 98.8 97.8 99.7 70%-30% 95.8 96.2 99.7 60%-40% 95.8 96.2 99.7 Transfected 90%-10% 98.0 93.4 99.6 80%-20% 97.8 93.6 99.2 70%-30% 96.2 93.8 99.3 60%-40% 95.1 92.5 97.9

TABLE III: The effect of decreasing the training dataset. It can be observed that the performance decrease for traditional ensemble algorithms with the decrease in training data while, on the other hand, PLCNN gives a consistent performance with a negligible difference.

CNN-Specific models: Here, we discuss models, which are specifically designed for protein localization. The results are presented in Table I. Our algorithm is the best performer for all the datasets amidst the CNN-Specific models. It should be noted here that although GoogleNet [35] is not a CNN-Specific model; since M-CNN [24] compared against it, we have also reproduced the numbers from [24]. Our model in top-1 accuracy on HeLa [34] and CHO [10] improves by 2% and 6%, respectively on the existing deep learning models. Most of the CNN-algorithms ignore LOCATE Endogenous and Transfected datasets. Here, we also present the results of both datasets, which can be a baseline for future algorithms. Moreover, the performance of our network is similar to the DeepYeast algorithm due to small patches of size 6464 and limited information in the patches.

CNN-Agnostic models: We provide the comparisons in Table II for HeLa and Yeast datasets against CNN-Agnostic models i.e. the networks designed for general classification and detection such as ResNet [26], DenseNet [33] etc. The performance of state-of-the-art algorithms is lower than PLCNN, where it is leading by 5.1% on HeLa dataset and 2.9% on Yeast, from the second-best performing model i.e. DenseNet [33]. Although the improvement on Yeast is small compared to HeLa, the former is a challenging dataset due to the small size (6464) of the images.

Iv-D Ablation studies

In this section, we investigate and analyze various aspects of our PLCNN model.

Fig. 5: Confusion matrix for CHO dataset. The rows present the actual organelle class while the columns show the predicted ones. The results are aggregated for 10-fold cross-validations. The accuracies for each class are summarized in the last row as well as columns.

Influence of dataset size: To show that our model is robust and performs better, we start from a 90%:10% training:testing split and then reduce the training partition by 10% each time and increase the testing set by 10%. Figure 3 presents the performance of each model on HeLa dataset [34]. The results of ResNet [26] and AlexNet [22] are below 80% while our’s is the highest when the split is 60%. Meanwhile, if 90% dataset is reserved for training, then the testing part is 10%, and the accuracy of our method is 5.1% higher than the second-best performing DenseNet [33] model. Overall, our method is leading for all the training and testing partitions, which indicates the robust architecture of our algorithm.

Effect of overfitting: Next, we investigate the effect of overfitting in our model and classical algorithms. The performances of the classical algorithms on HeLa and CHO datasets are very high. However, this could be due to the small amount of data reserved for testing, usually between 5% to 10%. We present the effect of decreasing training data size in Table III, which illustrates that the high results of the classical methods are due to overfitting. When training data decreases, the accuracy drops. Note that results are shown in Table  III for traditional algorithms are the ensemble-based accuracies that may occlude our claim regarding overfitting 222Detailed results for the individual members of the same ensemble are given in the supplementary materials.. On the other hand, our PLCNN performs consistently better on all three datasets, as the drop in performance is almost negligible.

Image attentions: Attention mechanisms [41, 42]

are used in many computer vision applications to learn about the focus of networks. Though, we have not explicitly applied the attention in our network, we illustrate here that our method focuses on the object of interest. We utilize Grad-CAM 

[38] to visualize attention of the networks. The features before the last layer are collected and provided to the Grad-CAM [38]. Figure 4 illustrates the focus of each CNN method on sample images from CHO and Yeast datasets. Our method provides the best results due to the correct identification of proteins present in the images.

Fig. 6: Confusion matrix for Yeast dataset. The predicted organelle are shown in the columns while the true values are present in the rows. The summaries of accuracies are given in the last row and column.

Confusion matrices: We present confusion matrix for CHO dataset [10] in Figure 5 for the PLCNN, aggregating the results for all cross-validations. The correctly classified organelle classes are given along the diagonal of the matrix, while the non-diagonal elements show the misclassified organelles. Mostly, the non-diagonal elements are zeros. The overall accuracy is in the final element of the diagonal, while the individual accuracies are summarized in the right column and last row. Besides, the diagonal shows the accuracies contributed by each organelle. Our PLCNN perfectly classifies four out of the five organelle types. The only incorrect classification is for Gogli, where the accuracy is 97.7% as our PLCNN confuses one image of Gogli and Lysosome. These results are consistent with the traditional classifiers, and the incorrect classification may be due to the very similar patterns in these images.

Our results are better than the previous best-performing method i.e. M-CNN [24]; for example, PLCNN accuracy on “Nucleolus” is 100% while the M-CNN [24] is only 81%. Similarly, our method is also superior in performance to Ljosa et al. [43], which requires manual intervention.

Figure 6 displays the confusion matrix computed for Yeast dataset [4]. Again, the correctly classified elements are along the diagonal while the incorrect ones are along the non-diagonal spaces. The PLCNN performs relatively better on six protein types i.e. “Cell”, “Mitochondria”, “Nuclear”, “Nucleus”, “Peroxisome” and “Spindle” while for the remaining classes, the accuracy is more than 82%, this may be due to the low number of training images. The most confusion is between the protein types of “Cell” and “Cytoplasm”, which equates to be 0.4%.

Prediction confidence: We compare the prediction confidence of traditional classifiers trained on Yeast and HeLa datasets against our PLCNN on four images as shown in Figure 7. Each image has the prediction probabilities for each algorithm underneath. The red color shows when the prediction is incorrect, whereas the green is for the correct outcome. It can be observed that our method predicts the correct labels with high confidence, while the probability is very low when the prediction is incorrect. The image in the first column in Figure 7 is very challenging due to minimum texture, and almost no structure. All the methods failed to identify the type of protein in the mentioned image correctly. However, the competing methods prediction scores are much higher than ours. Similarly, our algorithm confidence is always high when the prediction is correct and low when it is incorrect. This shows the learning capability of our network.

AlexNet 0.32 0.57 0.79 0.82
ResNet 0.69 0.68 0.49 1
DenseNet 0.60 0.79 0.75 0.90
PLCNN 0.26 0.88 1 1
Fig. 7: The correct predictions are highlighted via green while the red depicts incorrect. Our method prediction score is high for true outcome and vice versa.

V Conclusion

We have proposed PLCNN approach to analyze protein localization images. Our proposed approach can predict subcellular locations from fluorescence microscopy images utilizing intensity values as input to the network. PLCNN consistently outperformed the existing state-of-the-art machine learning as well as deep learning models over a diverse set of protein localization datasets. Our approach computes the output probabilities of the network to predict the protein localization quantitatively.

The image attention analysis reveals that the PLCNN network can capture objects of interest in protein imagery while ignoring irrelevant and unnecessary details. The generalizing capability of our proposed approach is validated from its consistent performance across all the utilized datasets over several images from different backgrounds. Comparative analysis reveals that our proposed approach is either better or comparable to the current state-of-the-art models.


  • [1] M. Tahir, A. Khan, and A. Majid, “Protein subcellular localization of fluorescence imagery using spatial and transform domain features,” Bioinformatics, 2011.
  • [2] E. Glory and R. F. Murphy, “Automated subcellular location determination and high-throughput microscopy,” Developmental cell, 2007.
  • [3] Y. T. Chong, J. L. Koh, H. Friesen, S. K. Duffy, M. J. Cox, A. Moses, J. Moffat, C. Boone, and B. J. Andrews, “Yeast proteome dynamics from single cell imaging and automated analysis,” Cell, 2015.
  • [4] T. Pärnamaa and L. Parts, “Accurate classification of protein subcellular localization from high-throughput microscopy images using deep learning,” G3: Genes, Genomes, Genetics, 2017.
  • [5] W. Shao, Y. Ding, H.-B. Shen, and D. Zhang, “Deep model-based feature extraction for predicting protein subcellular localizations from bio-images,” FCS, 2017.
  • [6] Y.-Y. Xu, L.-X. Yao, and H.-B. Shen, “Bioimage-based protein subcellular location prediction: a comprehensive review,” FCS, 2018.
  • [7] D. N. Itzhak, S. Tyanova, J. Cox, and G. H. Borner, “Global, quantitative and dynamic mapping of protein subcellular localization,” Elife, 2016.
  • [8] A. Chebira, T. Merryman, G. Srinivasa, Y. Barbotin, C. Jackson, R. Murphy, and J. Kovacevic, “A multiresolution approach to automated classification of subcellular protein location images,” BMC Bioinformatics, 2007.
  • [9] M. Tahir and A. Khan, “Protein subcellular localization of fluorescence microscopy images: Employing new statistical and texton based image features and svm based ensemble classification,” Inform. Sciences, 2016.
  • [10] M. V. Boland, M. K. Markey, and R. F. Murphy, “Automated recognition of patterns characteristic of subcellular structures in fluorescence microscopy images,” Cytometry: The Journal of the International Society for Analytical Cytology, 1998.
  • [11] R. F. Murphy, M. V. Boland, M. Velliste, et al., “Towards a systematics for protein subcellular location: Quantitative description of protein localization patterns and automated analysis of fluorescence microscope images.” in ISMB, 2000.
  • [12] K. Huang and R. F. Murphy, “Boosting accuracy of automated classification of fluorescence microscope images for location proteomics,” BMC Bioinformatics, 2004.
  • [13] N. A. Hamilton, R. S. Pantelic, K. Hanson, and R. D. Teasdale, “Fast automated cell phenotype image classification,” BMC Bioinformatics, 2007.
  • [14] L. Nanni and A. Lumini, “A reliable method for cell phenotype image classification,” AI in medicine, 2008.
  • [15] L. Nanni, A. Lumini, and S. Brahnam, “Local binary patterns variants as texture descriptors for medical image analysis,” AI in medicine, 2010.
  • [16] C. Li, X.-h. Wang, L. Zheng, and J.-f. Huang, “Automated protein subcellular localization based on local invariant features,” The protein journal, 2013.
  • [17] B. Zhang and T. D. Pham, “Phenotype recognition with combined features and random subspace classifier ensemble,” BMC bioinformatics, 2011.
  • [18] M. Tahir, B. Jan, M. Hayat, S. U. Shah, and M. Amin, “Efficient computational model for classification of protein localization images using extended threshold adjacency statistics and support vector machines,” Computer methods and programs in biomedicine, 2018.
  • [19] M. Xiao, X. Shen, and W. Pan, “Application of deep convolutional neural networks in classification of protein subcellular localization with microscopy images,” Genetic epidemiology, 2019.
  • [20] O. Z. Kraus, B. T. Grys, J. Ba, Y. Chong, B. J. Frey, C. Boone, and B. J. Andrews, “Automated analysis of high-content microscopy data with deep learning,” Molecular systems biology, 2017.
  • [21] O. Dürr and B. Sick, “Single-cell phenotype classification using deep convolutional neural networks,” JBS, 2016.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [23] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), 1996.
  • [24] W. J. Godinez, I. Hossain, S. E. Lazic, J. W. Davies, and X. Zhang, “A multi-scale convolutional neural network for phenotyping high-content cellular images,” Bioinformatics, 2017.
  • [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [27]

    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in

    AAAI, 2017.
  • [28] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in NIPS, 2017.
  • [29] Q. Lao and T. Fevens, “Cell phenotype classification using deep residual network and its variants,” IJPRAI, 2019.
  • [30] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv, 2016.
  • [31] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in CVPR, 2017.
  • [32] D. Han, J. Kim, and J. Kim, “Deep pyramidal residual networks,” in CVPR, 2017.
  • [33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
  • [34] M. V. Boland and R. F. Murphy, “A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of hela cells,” Bioinformatics, 2001.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.
  • [36] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv:1609.04747, 2016.
  • [37] V. Ljosa, K. L. Sokolnicki, and A. E. Carpenter, “Annotated high-throughput microscopy image sets for validation.” Nature methods, vol. 9, no. 7, pp. 637–637, 2012.
  • [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
  • [39] L. Nanni, S. Brahnam, and L. Alessandra, “Selecting the best performing rotation invariant patterns in local binary/ternary patterns,” in

    International conference on IP, computer vision, and pattern recognition

    , 2010.
  • [40] C.-C. Lin, Y.-S. Tsai, Y.-S. Lin, T.-Y. Chiu, C.-C. Hsiung, M.-I. Lee, J. C. Simpson, and C.-N. Hsu, “Boosting multiclass learning with repeating codes and weak detectors for protein subcellular localization,” Bioinformatics, 2007.
  • [41] S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,” arXiv:1906.12021, 2019.
  • [42] ——, “Real image denoising with feature attention,” arXiv preprint arXiv:1904.07396, 2019.
  • [43] V. Ljosa, P. D. Caie, R. Ter Horst, K. L. Sokolnicki, E. L. Jenkins, S. Daya, M. E. Roberts, T. R. Jones, S. Singh, A. Genovesio, et al., “Comparison of methods for image-based profiling of cellular morphological responses to small-molecule treatment,” JBS, 2013.