Log In Sign Up

Weak-supervision for Deep Representation Learning under Class Imbalance

by   Shin Ando, et al.

Class imbalance is a pervasive issue among classification models including deep learning, whose capacity to extract task-specific features is affected in imbalanced settings. However, the challenges of handling imbalance among a large number of classes, commonly addressed by deep learning, have not received a significant amount of attention in previous studies. In this paper, we propose an extension of the deep over-sampling framework, to exploit automatically-generated abstract-labels, i.e., a type of side-information used in weak-label learning, to enhance deep representation learning against class imbalance. We attempt to exploit the labels to guide the deep representation of instances towards different subspaces, to induce a soft-separation of inherent subtasks of the classification problem. Our empirical study shows that the proposed framework achieves a substantial improvement on image classification benchmarks with imbalanced among large and small numbers of classes.


page 1

page 2

page 3

page 4


Deep Over-sampling Framework for Classifying Imbalanced Data

Class imbalance is a challenging issue in practical classification probl...

Automated Imbalanced Learning

Automated Machine Learning has grown very successful in automating the t...

ReMix: Calibrated Resampling for Class Imbalance in Deep learning

Class imbalance is a problem of significant importance in applied deep l...

Class Imbalance Techniques for High Energy Physics

A common problem in high energy physics is extracting a signal from a mu...

Discriminative Sparse Neighbor Approximation for Imbalanced Learning

Data imbalance is common in many vision tasks where one or more classes ...

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Negative sampling schemes enable efficient training given a large number...

Galaxy Image Classification using Hierarchical Data Learning with Weighted Sampling and Label Smoothing

With the development of a series of Galaxy sky surveys in recent years, ...

1 Introduction

The advances of deep learning models that enable automatic extraction of discriminative features from a massive amount of labeled data have eased the burden of hand-engineering them in many classification applications. The preparation of ground-truth labels, in turn, has become critical for those applications, and in cases where its cost is too steep, techniques to exploit additional information, such as transfer learning and weakly-supervised learning, are employed.

The problem of class imbalance can occur in cases where the preparation of labeled data is difficult for specific classes. The imbalanced settings can typically deteriorate the retrieval measures for the classes of minority [2, 7], as well as the representation learning of deep neural nets [1].

On the topic of class imbalance, a large portion of the studies have focused on binary cases and multi-class cases with less than ten classes. However, deep learning models commonly address problems with a much larger number of classes, at which the impact is much more difficult to handle. In this paper, we attempt to leverage a type of weak-labels to address class imbalance over a large number, e.g., up to one hundred, of classes. Weak-labels are side-information used in weakly-supervised learning to complement a limited amount of labeled data. They are generated automatically or by inexpensive means such as labeling functions [10] and crowdsourcing, and usually of low-quality or abstract-level [13].

In this paper, we consider external knowledge in the form of abstract-labels assigned to every training instances, providing a categorization of the original classes. For example, training instances with classes: {airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck}, may be categorized with abstract-labels: {animals, vehicles} using a set of rules: {bird, cat, deer, dog, fish, horse}animals, {airplane, automobile, ship, truck}vehicles. We assume that, as in this example, each class is categorized by only one label.

Such categorization can contain relevant information regarding the hierarchical structure of the classes, and provide useful guidance for learning the structure of the deep representation to counter the effect class imbalance which may be over- or under-estimation of class boundaries. Our intuition to exploit such abstract-labels, to this end, is to acquire deep representation which induce the separation of the subtasks, i.e., discriminating among a subset of classes categorized to each label. We implement a framework to associate an independent subspace of the deep features to each label and guide the instances towards the targets projected onto the corresponding subspaces.

We build on the framework of Deep Over-sampling (DOS) [1] which drew inspiration from the classic synthetic minority over-sampling [3] that re-balances the class distribution by augmenting synthetic minority-class instances sampled from the neighborhood of existing ones into the training data. DOS integrates re-sampling into deep learning, by implementing an additional back-propagation for the output of the embedding layers to directly guide its representation learning.

The caveat on the weakly-supervised learning is that the weak-label may not always be of ideal granularity or hierarchical structure. That is, enforcing instances onto orthogonal subspaces that reflect the abstract-label categorization may not be entirely beneficial. Alternatively, we attempt to induce a soft-separation of subtasks using the gradient of squared-sum error to gradually separate the representations of different labels in terms of cosine distance. The proposed framework can also benefit from the multi-task learning framework, where the net parameters are simultaneously trained based on the standard class prediction, which can counteract the detrimental aspects of weak-supervision.

2 Background

2.1 Class imbalance

Class imbalance is a practical issue, where a large discrepancy in the number of samples among classes causes the learning algorithm to over-generalize for the classes in the majority. Its effect on the retrieval measures for the minority classes, which is of the primary interest for many applications, is critical.

The typical approaches to counter class imbalance include re-sampling, instance-weighting, and cost-sensitive learning. The re-sampling approach directly addresses the imbalance by over- or under-sampling on the training data. Synthetic Minority Over-sampling (SMOTE) is a popular over-sampling method, which worked successfully with many traditional classification models.

It was demonstrated in [1] that use of SMOTE on the deep representation acquired by a convolutional neural net (CNN) under the effect of imbalance do not yield as much merit as it does with hand-engineered features. To address this issue, they proposed Deep Over-sampling (DOS), which implemented the sampling of the minority class into the deep learning process. The synthetic samples were used as supervising targets for the representation learning, which provided an additional feed-back to the embedding layers of the CNN to improve the in-class and inter-class separation among deep representation.

With regards to the number of classes, the previous studies on class imbalance have mainly focused on binary classification and cases with a small number of classes. Imbalance among a large number of classes, meanwhile, has received limited attention. In recent surveys, the class imbalance problems been categorized between binary or multi-class and only few multi-class methods have addressed more than ten classes [6, 7, 5].

2.2 Weakly-supervised Learning

Supervised learning, deep learning especially, requires a large amount of labeled data, which can be costly in some applications. Weakly-supervised learning exploits various side-information from crowdsourcing, heuristic labeling, external knowledge base, etc., to achieve better performances


. Techniques, such as semi-supervised learning, multi-instance learning, and learning with noisy-labels are employed to address various conditions of weak-labels, including coverage, granularity, or accuracy.

Heuristic labeling, as opposed to hand-curated annotations, require little cost for assigning weak-labels to the training data. In [11], labeling functions written by domain experts were used to generate labels specifying the hierarchy of sub-tasks in image and document classification. The relation between the weak-labels and the ground-truth labels were captured using an exponential family generative model which is integrated into training the discriminative model.

As mentioned in the previous section, we consider abstract-labels that categorize the original classes and is relevant to the task at hand, such that dividing subsets of classes with abstract-labels can reduce the complexity of the classification problem. But as with other weakly-supervised learning scenarios, we take into account that hierarchical structure of the label may not be ideal and possibly introduce added perplexity.

2.3 Preliminary Results

To demonstrate the motivation of our study, we conducted an experiment with artificially imbalanced settings. We modify the CIFAR-10 image benchmark [8] by removing 80% of the samples from selected classes. Then, we trained a CNN with VGG16 layers of pre-trained weights [12] and two fully-connected layers of randomly initialized weights.

Following imbalanced settings were compared in our analysis: 1) maintaining 100% of samples of all classes (Full Data), 2) maintaining 100% of samples in six classes and removing 80% of samples from four classes (Imbalanced), 3) removing 80% of samples from all classes (Balanced). In setting (2), we refer to the four classes from which we removed the samples as minority classes and the rest as majority classes. For evaluation, each experiment was repeated ten times from different initial parameters. The four minority classes and the removed samples were chosen randomly for each repetition.

In Table 2, we compare the accuracies, which is standard when using the full data. The drop-off from (1) to (2) indicates the impact of the class imbalance. We note that the parameters of VGG16 net is trained on a balanced dataset, thus reduces the impact of imbalance, and that the discrepancy is much larger if the entire parameters are obtained from the imbalanced training set. Furtheremore, the significant difference between (2) and (3) suggests that the imbalance in sample sizes can be detrimental even if the number of samples is larger in total.

(1) Full Data (2) Imbalanced Reduction (3) Balanced Reduction

Table 2: Class-wise precision/recall

Majority precision recall Minority precision recall Balanced precision recall

Table 1: Accuracies

For further investigation, we evaluated the class-wise precision and recall in (2) and (3) as shown in Table

2. The first two rows show the measurements on the majority and the minority classes from (2) and the third row shows the measurements from (3). The impact of class imbalance is shown strongly in the recall of the minority classes and the precision of the majority classes, which are significantly worse than in (3), with smaller but balanced number of samples.

From the preliminary results, we found that class imbalance can affect the discriminative power over the classes in the majority as well as the minority, and addressing it can be more crucial than the preparation of majority class samples.

3 Deep Subspace Sampling

3.1 Basic Definitions

We denote the layers of the CNN as two groups: the embedding layers and the classification layers. The former projects the input onto the deep feature space, and the latter makes the class prediction from the feature vectors. We denote the function of the embedding layers as

, where

is the domain of input data. The function of the classification layers, whose output is the vector of class probabilities over

classes, is denoted as .

The training data is a set of input/output pairs denoted by , where and the output takes a values from a set of classes . Additionally, weak-supervision is provided by a deterministic labeling function , where denotes a set of abstract-labels. We denote the abstract-label of the data by , i.e., .

Let denote the set of projections of the input by the tentative embedding function . We define the subset of belonging to class as . The subset of projections with abstract-label is denoted by .

3.2 Deep Over-sampling

Following the DOS framework [1], we implement an in-class neighborhood sampling over the deep projections to generate a multi-task learning training set. For each in , a subset of in-class neighbors defined as


is selected. We generate each instance of multi-task training set is as a tuple where is the original input/output, its neighbors, and random weights which sums to 1, i.e., .

By randomly sampling the weights , we can sample different numbers of tuples from a common original input/output , and we generate the training set such that the sample sizes are balanced among all classes.

The network architecture for multi-task learning includes two outputs: one at the classification layers and the other at the embedding layers. The back propagation for the former output is implemented with a standard, cross-entropy loss on class prediction. For the latter, a squared-sum loss is defined as follows.



) is minimized when the deep representation of the original input is at an interpolation of the neighbors. This loss thus sets a target for the embedding function, which guides the representation towards the class-mean, as the local means distribute closer to the class mean than the original samples. The DOS framework makes up for induces smaller in-class variance by iteratively updating the representation with this process and

3.3 Subspace Selection

The proposed framework builds on DOS to exploit abstract-labels, by guiding instances of different labels towards independent subspaces. This section describes two approaches for selecting such subspaces: fixed subspace allocation and supervised subspace selection.

Let denote the subspace corresponding to the label and its basis. For simplicity, we define the dimensionality of all subspaces to be , such that . The fixed subspace allocation simply assigns a subset of variables to each label, defining, in turn, a subspace where all complementary variables are zero. The basis is defined such that for and for all other .

In the supervised subspace selection, we attempt to find subspaces that preserve the discriminative information relevant to the subtask of each label. To that end, we adopt a supervised dimensionality reduction method, such as linear discriminant analysis, as a function into the following process. The function takes as an input and returns a primary component

, such as the first eigenvector, that maximizes its classification objective for a subset of classes


A brief description of the supervised subspace selection is given as follows.

  1. Initialize the basis

  2. Set .

  3. For .

  4. Select a label

  5. Compute a primary component by supervised dimensionality reduction over

  6. Update

  7. Update with a projection of such that

  8. Back to step 3

In essence, this process iteratively allocates a discriminative component to a subspace and removes it from the the original or residual representation. In step 4, the label is selected randomly and evenly so that each label is chosen times over the entire repetition, thus at step 3. The steps 3-8 are not necessarily repeated until since in general. In such a case, the basis of the residual subspace can simply be appended to the basis of all subspaces.

Fig.2 illustrates our intuition for guiding the deep representation towards independent subspaces. The blue markers indicate the tentative vector representation in the deep feature space. Each instance comes from different classes, as indicated by callout texts and the shapes of the markers. Let us assume that the abstract-labels of the airplane and ship classes, indicated by solid markers, is label 1 and that of fish and bird classes, indicated by circled markers, is label 2. Let - and -axes represent the subspaces associated with labels 1 and 2, respectively.

The red markers indicate the synthetic targets generated from in-class neighbors. The targets for label 1-classes are generated on subspace 1 and those for label 2-classes are generated on subspace 2. Descending the gradient of the squared-sum error, their representations are updated to be closer to the red markers. The green markers indicate the updated vector representation.

By guiding representation towards these subspaces, we aim to induce a structure where the subtask can be addressed more independently, even if their representations are not strictly orthogonal, as the cosine distances among the classes assigned to different labels can grow larger.

3.4 Multi-task learning

This section describes the multi-task learning framework which integrates representation learning using the subspaces described in the previous section. The overview of the framework is illustrated in Fig.2.

Figure 2: Subspace Projections

Figure 1: DS3 Framework

On the left side of the figure, the basic architecture of a deep neural net comprised of the embedding layers and the classification layers is shown. Their outputs are the deep representation and the class prediction, respectively. The input and the output makes up the original training instance, and a single-task learning can be conducted with a back-propagation from the prediction error.

on the right side of Fig.2, the components related to representation learning with re-sampling and weak-supervision are shown. The synthetic target and the weak-label are additionally included in the training instance for multi-task learning. Another back-propagation explicitly for the embedding layers is prompted by the difference between the synthetic target and the deep representation. The proposed framework is referred to as Deep SubSpace Sampling (DS3).

The training instance for multi-task learning is a tuple , where

. The loss function for the classification function

is the standard cross-entropy loss


Additionally, the loss function for the embedding function is defined as a weighted mean squared-sum error


where denote the projection of deep feature vector onto the subspace and is a trade-off coefficient against the first loss in (3).

Recalling that is a randomized weight vector, one of the merit from (4) is inducing robustness to avoid overfitting, similar to that of adding noise to the output of different layers. Another merit comes from guiding the instances toward an interpolation of the in-class neighbors and closer to the class mean. Subsequently, it can increase the inter-class discrepancies in the deep feature space.

The two back-propagations based on the above two loss functions constitutes a multi-task learning of a standard classification learning and an explicitly supervised representation learning. Note that while the propagation from (3) updates parameters of all layers, (4) only affects those of the embedding layers.

3.5 Subspace Sampling Algorithm

The Deep Subspace Sampling framework combines the merits of over-sampling and explicitly supervised representation learning with weak-supervision by abstract-labels. Its multi-task learning framework allows for (a) augmenting training set with synthetic projections of the minority class samples, (b) inducing robustness with randomized targets, and (c) separating subspaces to acquire discriminative information for different subtasks.

The overview of the algorithm is shown in Algorithm 1.

1:  Input: Training set , class-wise over-sampling size , abstract-labels , # of training rounds
2:  Output: A trained CNN
3:  function SubspaceSelect: subspace selection method
4:  Method:
5:  Initialize CNN by single task learning with
6:  for  do
7:     Compute projections from
8:     for  do
10:     end for
12:     for  do
13:        Set resampling size
14:        for  do
15:           NeighboorhoodSampling
16:           Generate random weight
18:        end for
19:     end for
20:     Update CNN by multi-task learning with
21:  end for
Algorithm 1 Deep Subspace Sampling algorithm

The initial embedding and classifier functions are obtained by a standard training a CNN with the original, imbalanced data, at line 5. The basis for each subspace is updated after each update of the CNN, at lines 8-10, assuming that the subspace selection method is supervised selection. If the subspace selection method is fixed allocation, it can be executed at initialization at line 6, i.e., before the start of the outer loop. The synthetic targets are re-computed at each iteration as in-class neighbors are updated with the deep representation.

The trade-off coefficient is an important hyper-parameter which controls the speed of the descent towards orthogonal representation, which could be detrimental if it is too strong, relative to the classification learning. Its value was selected empirically as described in the next section. The computationally intensive operations in this process, outside of deep learning, are the dimensionality reduction at line 9 and the neighbor search at line 13. A practical run time analysis is also provided in the next section.

4 Empirical Results

The empirical study is organized in three parts: (1) comparison between the two subspace selection methods, (2) sensitivity analysis on essential parameters, and (3) comparative analysis with baseline methods.

4.1 Datasets

The four image classification benchmarks were used in this experiment are: CIFAR-10/100 [8], STL-10 [4], and SVHN [9]. The properties of the datasets are summarized in Table 3. All results are reported on the default test split.

Dataset #channelssize #images per class (train/test) #classes/abstract-labels CIFAR-10 5000/1000 10/2 SVHN 7000/2000 10/2 STL-10 500/800 10/2 CIFAR-100 500/100 100/8

Table 3: Summary of Datasets

For the CIFAR-10 and STL-10 datasets, the abstract-labels {animals,vehicles} were assigned by semantic rules111CIFAR-10: {airplane, automobile, ship, truck}vehicle, {bird, cat, deer, dog, frog, horse}animal, STL-10: {airplane, car, ship, truck}vehicle, {bird, cat, deer, dog, horse, monkey}animal

. For the CIFAR-100 dataset, the twenty super-classes in the original dataset were used as abstract-labels. For SVHN, labels {odd, even} were assigned according to the class digits. With this set of labels, we measure the effect of abstract-labels that do not provide relevant information for the task.

The imbalanced settings were set up by randomly selecting 50% of the classes from each label to be the minority classes and removing 80% of their samples also at random.

4.1.1 Settings and Evaluation

The same CNN architecture is trained by a standard single-task learning, the DOS framework, and the DS3 framework. The first two models are used to provide the baselines for a comparative analysis. For the CIFAR-10/100 and STL-10 datasets, we employed the architecture of VGG16 [12] joined to two fully-connected layers of randomly initialized weights, or C64-C128-C256-C512-C512-F-F. is the dimensionality of the deep representation vectors, which was set to , and is the number of classes. For the SVHN dataset, we employed the architecture used in [1], two convolutional layers with 6 and 16 filters, respectively joined to two fully-connected layers, or C6-C16-F400-F120.

The number of training rounds was set to 8, after empirical analysis described in a later subsection. The neighborhood sampling size is set to five for the DOS and the DS3 framework. Deep learning was conducted on NVIDIA TITAN V graphic card with 2560 cores and 12 GB global memory. For evaluation, we measured the overall accuracy and three retrieval measures: precision, recall, and F1-score averaged over the majority and the minority classes, respectively. We include the accuracy for comparison against full-data performances.

4.2 Subspace Selection

We first compare between the two subspace selection methods: the supervised selection and fixed allocation. Table 5 summarizes their retrieval measures on artificially imbalanced CIFAR-10 and STL-10 datasets.

Fixed Supervised (min/maj) (min/maj) CIFAR-10 Pr 0.919/0.809 0.928/0.801 Re 0.792/0.914 0.778/0.920 F1 0.851/0.857 0.845/0.854 STL-10 Pr 0.882/0.666 0.886/0.666 Re 0.576/0.907 0.571/0.903 F1 0.694/0.772 0.692/0.766

Table 5: Convergence of DS3 on CIFAR-100

Table 4: Subspace selection comparison

The third and the fourth columns show the measurements for the Fixed allocation and the Supervised selection, respectively. The precision (Pr), recall (Re), and F1 are averaged over the minority (min) and the majority (maj) classes. The averages are taken over ten repetitions.

From the results on F1 and accuracy, the fixed allocation yielded slightly better average over the supervised selection and also smaller deviations. In addition, we observed that fixed allocation showed small but significant advantage in the average recall of the minority classes. Overall, we found fixed allocation to be a preferable approach with regards to the retrieval performances and the computational complexity. In the following, we report the results of DS3 with fixed subspace allocation.

4.3 Sensitivity and Convergence Analysis

Next, we evaluated the sensitivity of the DS3 framework regarding the trade-off weight with a grid search over five values: in ten repetitions over imbalanced CIFAR-10 and STL-10.

We observed that all measures show similar tendencies, and only report the average accuracies for brevity. The results are summarized in Table 6. On both datasets, the accuracies were reasonably robust over the tested values. We report the results in the following section for .

0.0025 0.005 0.01 0.02 0.04 CIFAR-10 0.852 0.853 0.854 0.853 0.852 STL-10 0.748 0.757 0.757 0.753 0.752

Table 6: Sensitivity Analysis

Additionally, we empirically analyzed the convergence of accuracy. Fig.5 shows a typical run of DS3 on CIFAR-100. The - and -axes indicate the iteration and the accuracy, respectively. We observed that the performance can start to decline after rounds, and selected for the following experiment.

4.4 Comparative Analysis

In the comparative analysis, we report the average F1 scores on the minority and majority classes for the proposed and the baseline methods, as all retrieval measures showed similar tendency over the four datasets. The accuracies are also reported for comparison with standard full data training. The results are summarized in Table 7.

CNN DOS DS3 CNN (Full) CIFAR-10 0.829/0.826/0.835 0.853/0.851/0.854 0.854/0.851/0.857 0.857 SVHN 0.520/0.462/0.648 0.758/0.746/0.763 0.755/0.744/0.762 0.850 STL-10 0.723/0.667/0.757 0.741/0.694/0.772 0.757/0.702/0.786 0.801 CIFAR-100 0.558/0.355/0.629 0.594/0.355/0.629 0.612/0.390/0.635 0.638

Table 7: Comparative Analysis (Accuracy/minority F1/majority F1)

In each of the first three columns, three measurements: accuracy, average F1 over minority classes, and average F1 over majority classes, of the baseline methods and DS3. The last column, CNN (Full), shows the accuracy of the baseline CNN trained with the full set of original data. The difference between the first and the last column thus provides a reference to the impact of class imbalance. We observed that DOS was able to make up for a large portion of the impact on CIFAR-10 and SVHN. Meanwhile, the margin of improvement by DOS is relatively much smaller for CIFAR-100 and STL-10.

DOS showed a substantial setback on CIFAR-100, which may be attributed to the large number of classes. That is, the usefulness of synthetic samples may possibly diminish with a larger number of minority classes to supplement. With regards to STL-10, the difficulty may be attributed to the smaller number of training samples.

The DS3 exhibited large advantages over DOS for CIFAR-100 and STL-10 and a slight improvement with CIFAR-10. In the case of CIFAR-10, the performance of DOS is very close to that of CNN trained with full data and have little room for improvement.

In the case of SVHN, improvements by DS3 were not expected as the abstract-labels did not provide information relevant to the task. The results showed that the difference between DS3 and DOS were not significant, thus DS3 did not yield negative effects from irrelevant labels.

To summarize, the DS3 framework was able to make larger improvements over the baseline DOS in more difficult problems, i.e., with a larger number of classes or with fewer samples. In the other problems, it achieved better or nearly equivalent performances as the baseline even in cases where the abstract-labels were not relevant. Finally, the run time of DS3 compared to the standard training of CNN ranged from 24% to 41% increases among the four datasets. While the increase in computational time is inevitable due to the neighborhood sampling, the trade-off is justifiable in cases of substantial imbalance.

5 Conclusion

We proposed the Deep Subspace Sampling framework for utilizing automatically-generated abstract-labels in deep representation learning to enhance its robustness against class imbalance. It exploits the abstract labels to learn deep representation such that the discriminative information for subsets of classes are acquired in separate subspaces, which can help reduce the effect of class imbalance on the structure the deep feature space.

In the empirical study, the proposed framework showed advantages over the previous work on difficult problems with larger number of classes and/or smaller number of samples, and also maintained competitive performance given weak-labels which are not relevant to the task at hand.

In this paper, we limited the description of the proposed approach to handling one set of abstract-labels. However, it can naturally exploit multiple sets of labels by designating a subspace for each combination of labels.


  • [1]

    Ando, S., Huang, C.Y.: Deep Over-sampling Framework for Classifying Imbalanced Data. In: Machine Learning and Knowledge Discovery in Databases. pp. 770–785. LNCS 10534, Springer International Publishing, Cham (2017).

  • [2] Branco, P., Torgo, L., Ribeiro, R.P.: A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49(2), 31:1–31:50 (2016).
  • [3] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res. 16(1), 321–357 (2002).
  • [4]

    Coates, A., Lee, H., Ng, A.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR W&CP, vol. 15, pp. 215–223, (2011).

  • [5] Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3(2), 105–120 (2017).
  • [6] He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE Trans. on Knowl. and Data Eng. 21(9), 1263–1284 (2009).
  • [7] Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5(4), 221–232 (2016).
  • [8] Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto (2009),
  • [9] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
  • [10] Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Rè, C.: Snorkel: Rapid Training Data Creation with Weak Supervision. In: Keynote for the Third International Workshop on Declarative Learning Based Programming, in conjunction with the Thirty-Second AAAI Conference on Artificial Intelligence (2018)
  • [11] Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017).
  • [12] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014),
  • [13] Zhou, Z.H.: A brief introduction to weakly supervised learning. National Science Review 5(1), 44–53 (2018).