Deep Learning-Based Strategy for Macromolecules Classification with Imbalanced Data from Cellular Electron Cryotomography

08/27/2019 ∙ by Ziqian Luo, et al. ∙ Tsinghua University Carnegie Mellon University 18

Deep learning model trained by imbalanced data may not work satisfactorily since it could be determined by major classes and thus may ignore the classes with small amount of data. In this paper, we apply deep learning based imbalanced data classification for the first time to cellular macromolecular complexes captured by Cryo-electron tomography (Cryo-ET). We adopt a range of strategies to cope with imbalanced data, including data sampling, bagging, boosting, Genetic Programming based method and. Particularly, inspired from Inception 3D network, we propose a multi-path CNN model combining focal loss and mixup on the Cryo-ET dataset to expand the dataset, where each path had its best performance corresponding to each type of data and let the network learn the combinations of the paths to improve the classification performance. In addition, extensive experiments have been conducted to show our proposed method is flexible enough to cope with different number of classes by adjusting the number of paths in our multi-path model. To our knowledge, this work is the first application of deep learning methods of dealing with imbalanced data to the internal tissue classification of cell macromolecular complexes, which opened up a new path for cell classification in the field of computational biology.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Biological pathways depend on the function of macromolecular complexes, whose structure and spatial organization are critical to the function and dysfunction of pathways. Due to the limitations of data acquisition techniques, the native structural information of macromolecular complexes is extremely difficult to obtain [1]. With the development of biotechnology, Cryo-electron tomography (Cryo-ET) enables 3D visualization of cellular tissue in near-native state and sub-molecular resolution [2, 3, 4], making it a powerful tool for analyzing macromolecular complexes and their spatial organization within single cells [5].

However, it is often observed that the macromolecular complex data collected is imbalanced. The protein concentration difference can be as large as seven orders of magnitude [6]. One type of macromolecular complexes may dominate over other types, resulting in a low accuracy. In fact, the problem of data imbalance also occurs in most real-world classification problems. The collected data follows a long tail distribution i.e., data for few object classes is abundant while data for others is scarce. This phenomenon is termed the data-imbalanced classification problem [7]

. Although the problem of data-imbalanced classification occurs frequently in the computer vision field, research work on this topic has been rare in recent years. Almost all competitive datasets avoid data-imbalanced during the evaluation and training procedures. For instance, the case of the popular image classification datasets (such as CIFAR−10/100, ImageNet, Caltech−101/256, and MIT−67), efforts have been made by the collectors to ensure that, either all of the classes have a minimum representation with sufficient data, or that the experimental protocols are reshaped to use an equal number of images for all classes during the training and testing processes

[8, 9].

In this paper, we conduct extensive experiments and explore various methods for dealing with the problem of data-imbalanced classification, such as data sampling, bagging, boosting, Genetic Programming based method. We rigorously prove that the above various methods are indeed effective, and apply them to the classification of macromolecular complexes in single cells, and achieved a competitive classification performance. In particular, we make the following key contributions:

  • We summarize various well-known methods for dealing with data-imbalanced classification problems in order to improve classification performance and further find the best combinations with our own model among the methods.

  • We apply the method of dealing with imbalanced data to the classification of cell macromolecular complexes for the first time.

  • We propose a novel method to solve the data-imbalanced problem, termed as multi-path convolutional neural network (CNN), which significantly improves the classification performance over traditional methods. Moreover, our model is flexible enough to cope with different number of classes by adjusting the number of paths in our multi-path CNN model.

Ii Related Work

Ii-a Cryo-electron tomography (Cryo-ET)

Cryo-electron tomography (Cryo-ET) is an imaging technique used to produce high-resolution () three-dimensional views of samples, typically biological macromolecules in cells [9]. Cryo-ET is a specialized application of transmission electron cryomicroscopy (CryoTEM) in which samples are imaged as they are tilted, resulting in a series of 2D images that can be combined to produce a 3D reconstruction, similar to a CT scan of the human body. In contrast to other electron tomography techniques, samples are immobilized in non-crystalline (”vitreous”) ice and imaged under cryogenic conditions (C ), allowing them to be imaged without dehydration or chemical fixation, which could otherwise disrupt or distort biological structures [9, 10]. Cryo-electron tomography (Cryo-ET) [11, 12, 13] enables the 3D visualization of structures at close-to-native state and in sub-molecular resolution within single cells [14, 15, 16, 17].

Ii-B Inception3D Network

[18] propose a 3D variant of tailored inception network [19], denoted as Inception3D. Inception network is a recent successful CNN architecture that has the ability to achieve competitive performance with relatively low computational cost [18]. CNN [17] are well-known for extracting features from a image by using convolutional kernels and pooling layers to emulates the response of an individual to visual stimuli. This work [20] is the first application of deep learning for systematic structural discovery of macromolecular complexes among large amount (millions) of structurally highly heterogeneous particles captured by Cryo-ET. It represents an important step towards large scale systematic detection of native structures and spatial organizations of large macromolecular complexes inside single cells.

Ii-C Mixup: Data-Dependent Data Augmentation

Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. [21] propose mixup, a simple learning principle to alleviate these issues. Essentially, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples, and here is how the mixup training loss is defined:

Ii-D Focal loss

[22, 49]

discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.

[23] proposed a novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

Ii-E Sampling

Ii-E1 Oversampling

Oversampling method achieves sample balanced by increasing the number of minority samples in the classification. The most straightforward way is to simply copy a few samples to form multiple records. However, the disadvantage of this method is that if there are few sample features, it may lead to over-fitting problems. Improved oversampling method by adding random noise, interference data, or certain rules to generate new synthetic samples in a few classes, such as SMOTE algorithm. The process is described about SMOTE method in Algorithm 1.

  Algorithm 1 SMOTE(T, N, K)

 

1:Number of minority class Samples ; Amount of SMOTE ; Number of nearest neighbors
2: synthetic minority class samples
3:(* If is less than , randomize the minority class samples as only a random percent of them will be SMOTEd.*)
4:if  then
5:     Randomize the T minority class samples
6:     
7:     
8:end if
9: (*The amount of SMOTE is assumed to be in integral multiples of 100.*)
10:=Number of nearest neighbors
11:=Number of attributes
12:[ ][ ]: array for original minority class samples
13:: keeps a count of number of synthetic samples generated, initialized to 0
14:[ ][ ]: array for synthetic samples (*Compute k nearest neighbors for each minority class sample only.*)
15:for  to  do
16:     Compute nearest neighbors for i, and save the indices in the
17:     Populate()
18:end for
19:Populate() (*Function to generate the synthetic samples.*)
20:while  do
21:     Choose a random number between and , call it .
22:     This step chooses one of the nearest neighbors of .
23:     for  to  do
24:         Compute: dif = Sample[nnarray[nn]][attr] - Sample[i][attr]
25:         Compute: = random number between 0 and 1
26:         Synthetic[newindex][attr] = Sample[i][attr] + gap dif
27:     end for
28:     
29:     
30:end while
31:return (* End of Populate. *)

 

Ii-E2 Undersampling

Another popular method [24, 46] that results in having the same number of examples in each class. However, as opposed to oversampling, examples are removed randomly from majority classes until all classes have the same number of examples. While it might not appear intuitive, there is some evidence that in some situations undersampling can be preferable to oversampling [25]. A significant disadvantage of this method is that it discards a portion of available data. To overcome this shortcoming, some modifications were introduced that more carefully select examples to be removed. E.g. one-sided selection identifies redundant examples close to the boundary between classes [26]. More general approach than undersampling is data decontamination that can involve relabeling of some examples [27].

Iii Our Method

Iii-a Overview of our model

Fig. 1: Overview architectures of our model.

With the inspiration from Inception 3D network [28, 44, 45], we propose a novel model termed as multi-path CNN, combaning with mixup and focal loss method, which could significantly improve performance on the imbalanced Cryo-ET data. The overview of our proposed model is as shown in Figure 1.

Iii-B Mixup on the Cryo-ET dataset

Motivated by [29, 42, 43], we introduce a simple and data-agnostic data augmentation routine, termed as mixup. In a nutshell, mixup constructs virtual training examples

where ,

are raw input vectors, and

, are one-hot label encodings and are two examples drawn at random from our training data, and

. Therefore, mixup extends the training distribution by incorporating the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets.

Iii-C Multi-path CNN

Many subtomogram data from Cryo-ET are imbalanced due to their different ratio in the cell. However, there is little work that had been done to solve the problem of imbalanced data from Cryo-ET. In this section, we describe detaily our proposed multi-path CNN model showed in Figure 2, which will be useful to cope with the imbalanced data from Cryo-ET with the combination of the related works from Section 2.

Fig. 2: Model of multi-path CNN

Unlike traditional CNN which only has a single path with serial combinations of convolutional kernels and pooling layers, our multi-path CNN has multiple parallel combinations of convolutional kernels and pooling layers, based on the composition of our imbalanced data. More specifically, the number of classes from Cryo-ET equals to the number of parallel paths in our multi-path CNN model. Each path will try to learn from the imbalanced Cryo-ET data and become the best classifier corresponding to a certain type of data before the concatenate layer. The reason about firstly trying to determine each single path for each type of data from Cryo-ET is that it will be easier to find a classifier through deep learning that will behave well in recognizing a single class of data, regardless of other classes.

In order to find the best classifier for a certain type of data from Cryo-ET, we firstly carry out a lot of experiments with serial CNN and find its best structure to recognize a single type from the imbalanced dataset. All paths will be concatenated together when they are identified respectively. It is worth mentioning that sampling methods have not been used in the whole process since it may lead to overfitting or underfitting. Each single path learns how to do the classification job from the original imbalanced data and whole multi-path CNN learns how to balance between these paths and make a more precise decision towards all the types of data from Cryo-ET.

Iii-D Filter Concat

The final model obtains the most suitable convolution kernel on each path, so that the model effect is optimal, and then combines the best parameters learned by the models on the four paths, and finally enters the new pooling layer through a filter. The final classification result is obtained by softmax through the full connection of the L layer.

Iii-E Focal Loss for imbalanced classification

We use an -balanced variant of the focal loss:

We adopt this form in our experiments as it yields slightly improved accuracy over the non--balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.

pridicted positives predicted negatives
Real positives TP FN
Real negatives FP TN
TABLE I: Confusion matrix for binary classification
proteasome_d ribosome TRiC proteasome_s
Number 1043 80 125 386
Ratio(%) 63.481 4.869 7.608 23.494
TABLE II: Class ratio of Cryo-ET dataset.

Iii-F Evaluation metrics

Evaluation metrics play an important role in assessing the classification performance and guiding the model design. Most of the traditional methods dealing with the imbalanced data concentrate on binary classification. In binary classification problem, class labels can be divided as positive and negative. As the confusion matrix shows in Table 1, true positive (TP) and true negative (TN) denote the number of positive and negative samples that are correctly classified while false negative (FN) and false positive (FP) denote the number of positive and negative samples that are wrongly classified.

(a) proteasome_d
(b) ribosome
(c) TRiC
(d) proteasome_s
Fig. 3: The 2D visualizations of the 3D macro molecules

Accuracy is the most commonly used metric to evaluate model performance, however, it is no longer a proper measure in imbalanced classification problem since the minor class has minimal impact on accuracy compared with the major class. To solve the problem, a pair of metrics, precision and recall, have been adopted.

Meanwhile, F-Score is used to integrate precision and recall into a single metric for convenient evaluation of model.

Where represents the weight between precision and recall. During our evaluation process, we set = 1 since we regard precision and recall has the same weight thus -score is adopted.

However, in multi-class classification, we use Macro -Score to evaluate the result.

where n represents the number of classification and is the score on nth category.

G-mean is another recognized metric derived from confusion matrix.

In multi-class classification, we also use Macro G-mean to evaluate the result, measure the balanced performance of a learning model.

where n represents the number of classification and is the G-mean on nth category.

Iv Experiments

Iv-a Dataset details

Furthermore, reference-free classification and averaging were tested on a dataset consisting of 125 TCP-1 ring complex (TRiC) subtomograms, 386 single capped proteasome (proteasome_s) subtomograms, 1043 double capped proteasome (proteasome_d) subtomograms, and 80 ribosome subtomograms extracted from a tomogram of rat neuron with expression of poly-GA aggregate. All subtomorgams were two times binned to size

(voxel size: 1.368 nm). The tilt angle range was to .

The four types of macro molecules in our Cryo-ET dataset, which are proteasome_d, ribosome, TRiC, proteasome_s, whose imbalanced ratio are shown in Table 2. The Figure 3 are the 2D visualizations of the 3D macro molecules.

Classes in Cryo-ET path 1 path 2 path 3 path 4
F1 G-mean F1 G-mean F1 G-mean F1 G-mean
proteasome_d 71.7 81.2 68.3 79.2 69.0 78.5 70.8 80.2
ribosome 69.7 78.1 74.2 82.6 72.3 79.4 71.4 77.6
TRiC 72.6 81.2 71.7 79.8 74.8 83.8 74.1 82.4
proteasome_s 71.2 80.4 69.2 78.5 72.7 81.9 73.3 83.0
TABLE III: Binary classification by four different paths
Model Cryo-ET
Macro F1 Macro G-mean
Multi-path CNN 68.1 78.1
Multi-path CNN with boosting 69.3 80.3
Multi-path CNN with bagging 69.0 79.8
Multi-path CNN with SMOTE 69.7 78.6
Multi-path CNN with undersampling 69.1 77.3
Multi-path CNN with GP 70.4 78.2
Multi-path CNN with mixup 70.9 80.2
Multi-path CNN with focal loss 71.1 80.7
Multi-path CNN with SMOTE + boosting 70.1 78.2
Multi-path CNN with SMOTE + bagging 71.4 80.2
Multi-path CNN with SMOTE + GP 72.6 81.5
Multi-path CNN with mixup+focal loss 73.6 84.7
TABLE IV: Multi-path CNN combined with different strategies on imbalanced Cryo-ET

In order to train and test our multi-path CNN, we shuffle and split our dataset with two parts. There are 1307 samples in the training set and 327 samples in the testing set.

Iv-B Baseline Methods

Iv-B1 Bagging

[30] introduced the concept of bootstrap aggregating to construct ensembles. It consists in training different classifiers with bootstrapped replicas of the original training data-set. That is, a new data-set is formed to train each classifier by randomly drawing (with replacement) instances from the original data-set (usually, maintaining the original data-set size). Hence, diversity is obtained with the resampling procedure by the usage of different data subsets. Finally, when an unknown instance is presented to each individual classifier, a majority or weighted vote is used to infer the class [31].

Iv-B2 Boosting

Boosting (also known as ARCing, adaptive resampling and combining) was introduced by Schapire in 1990 [32, 33, 34]

. Schapire proved that a weak learner (which is slightly better than random guessing) can be turned into a strong learner in the sense of probably approximately correct (PAC) learning framework. AdaBoost

[35] is the most representative algorithm in this family, it was the first applicable approach of Boosting, and it has been appointed as one of the top ten data mining algorithms [36].

Model Cryo-ET
Macro F1 Macro G-mean
Multi-path CNN with mixup + focal loss 73.6 84.7
GSVM-RU 69.5 80.3
TABLE V: Comparison between multi-path CNN model and traditional method
Model Cryo-ET
Macro F1 Macro G-mean
Four-path CNN with mixup + focal loss 73.6 84.7
GSVM-RU(based on four classes) 69.5 80.3
Three-path CNN with mixup + focal loss 74.3 85.5
GSVM-RU(based on three classes) 70.1 81.6
Two-path CNN with mixup + focal loss 76.4 87.2
GSVM-RU(based on two classes) 71.2 81.9
TABLE VI: Comparison of different number of paths in multi-path CNN

Iv-B3 Genetic Programming (GP)

GP[37]

is an evolutionary algorithm technique inspired from biological evolution to find computer programs that perform a user-defined task, which can evolve biased classifiers when data sets are unbalanced. In GP, programs representing different solutions to a problem are combined with other programs to create new hopefully better programs; this process is repeated over a number of generations until a good solution is evolved

[38, 39, 40]. [41] proposed GP methods utilize the unbalanced data “as is” in the learning phase, requiring no prior knowledge about the problem domain, to evolve classifiers with good classification ability on both minority and majority classes.

Iv-C Identification of each path in Multi-path CNN

Before carrying out the experiment corresponding to the whole model in Figure 2, we have carried out lots of experiments to identify each suitable path in the multi-path CNN model. We name the four paths with path 1, path 2, path 3 and path 4, from left to right in the model in Figure 2. Each path has the best binary classification result on one of the four classes in Cryo-ET. For example, as shown in Table 3, path 1 behaves best on proteasome_d, while path 3 has the best result on TRiC class. During the experiment in each single path, we degenerate the multi-class classification problem with binary classification, say, the CNN branch in path 1 will only tell whether the input belongs to proteasome_d class and path 3 will only judge whether the input belongs to TRiC class.

Fig. 4: Multiclass classification achieved by binary classifier

Iv-D Multi-path CNN with traditional strategies on imbalanced dataset

We have done some experiments to combine the multi-path CNN with recognized and effective strategies like boosting, bagging, SMOTE, focal loss and mixup method, towards the imbalanced dataset. The results are shown in Table 4.

Experiments have also been done to compare the multi-path CNN with traditional classifier, Granular Support Vector Machine with Repetitive Undersampling(GSVM-RU),

[47, 48]. Besides SVM modeling, GSVM-RU adds another hyper-parameter G, the number of negative granules. To solve the problems in multi-class classification with binary classifier, we use the method in Figure 4 to decompose the multi-class problem into a binary class problem. The experiment results are shown in Table 5.

Several pairs of experiments have also been done to show that our model be adjusted to two or three classes classification problem, rather than restricted in four classes classification problem. We conduct GSVM-RU on different number of classes as baseline for comparison. The results are shown in the Table 6.

Iv-E Discussion

Cryo-ET has become a powerful tool for 3D visualization of cellular components in sub-molecular resolution and near-primary ecology [50]. However, imbalanced classification in cellular tomograms is difficult due to the high complexity of image content and imaging limitations. In order to complement the existing method, in this paper, we propose a multi-path CNN combined with mixup and focal loss strategy which will have the best classification result on the imbalanced data from Cryo-ET. The above experiment results demonstrate the power of our approach and they have also indicated that by changing the number of paths in our multi-path, the model can be adapted to cope with imbalanced classification problems with different number of classes. The work provides useful steps for imbalanced classification in cell tomography. To the best of our knowledge, our work is the first application of CNN-based network with focal loss and mixup method in Cryo-ET data analysis. Our approach is a useful complement to current technology

V Conclusion

In this paper, we apply the method of dealing with imbalanced data to the classification of cell macromolecular complexes for the first time, which opened up a new path for cell classification in the field of computational biology. In order to solve the imbalanced data problem from Cryo-ET, we propose a multi-path CNN model combined with recognized strategies to deal with data imbalance issue like sampling, bagging and boosting and genetic programming. We have also made combinations among the methods and with our model. The multi-path CNN model consists of several independent paths that behave best in each class respectively. By adjusting the number of the paths in the model, we can deal with a more generalized classification problem with different number of classes. Experiments and comparisons with traditional classifiers have shown that the model can work effectively on the imbalanced data from Cryo-ET. In the future, we will also consider more issues in the field of computer and bio-related technologies to promote the development of computational biology.

Vi Acknowledgement

This work was supported in part by U.S. National Institutes of Health (NIH) grant P41 GM103712.

References