1 Introduction
Progressive Neural Network Learning (PNNL) [8, 25, 3, 4, 9, 21, 22, 19]
aims to build the network’s topology incrementally depending on the training set given for the specific problem at hand. At each incremental training step, a PNNL algorithm adds a new set of neurons to the existing network topology and optimizes the new synaptic weights using the entire training set. Thus, throughout the network’s topology progression, the number of times the PNNL algorithm iterates through the entire training set is very high. For large datasets, this approach leads to an enormous computational cost and long training process. In this paper, we propose to perform the optimization of each incremental training step using only a subset of the training data. Our motivation in doing so is twofold. Firstly, optimizing with respect to a subset of the training data leads to lower overall computational cost; Secondly, the use of different subsets of data at each incremental step promotes specialization of different sets of neurons at capturing different patterns in the data.
The idea of subset sampling for training machine learning methods has been proposed in different contexts in literature. With the motivation of reducing the expense of labeling data for training, methods following the active learning paradigm
[15] seek to define a sampling strategy that selects a sample to be labeled among a large pool of unlabeled data for the next learning round. While the active learning paradigm considers the problem of data selection in an (initially) unsupervised setting, in the context of PNNL we take advantage of the available labeling information for subset selection. Directly related to our work are methods selecting a subset of data formed by the most representative samples [14, 2, 16, 1]. These methods however only consider the data selection process once based on the input data representations and the available labels. The selected subset of data is then used to train a model with fixed capacity. Different from this line of works, we propose to perform subset sampling at every incremental step of the PNNL process with selection strategies that can also take into account the data representations learned by the current network’s tolpology.When building a learning system, the development process often requires running multiple experiments to select the best values for the hyperparameters associated with the learning model. For neural networks, such hyperparameters correspond to the values used e.g. for the weight decay coefficient, or the dropout percentage. In existing PNNL algorithms, the value associated with each hyperparameter is fixed throughout the entire training process, and the best combination of the hyperparameter values is usually selected by following a grid search strategy training multiple models each corresponding to a different combination of hyperparameter values. Different from that (traditional) approach, we propose to incorporate the hyperparameter selection process into each incremental training step, enabling adaptive hyperparameter assignment during the network’s topology progression process. Coupled with the speed up gained from subset sampling, this further accelerates the overall training process and improves generalization performance as indicated by our experimental results.
The remainder of the paper is organized as follows: Section 2 reviews Progressive Neural Network Learning and the subset sampling strategies in different learning contexts. Section 3 describes the proposed progressive network training method. In Section 4, we detail our experimental setup and present empirical results. Section 5 concludes our work.
2 Related works
2.1 Progressive Neural Network Learning
In Progressive Neural Network Learning (PNNL), an algorithm starts with an initial network topology and gradually increases the capacity of the model by adding and optimizing new blocks of neurons following an iterative optimization process [8, 25, 3, 4, 9, 21, 22, 19, 20, 23]. When a new set of neurons is added to the current network topology, different PNNL algorithms determine different rules to form new synaptic connections from the new neurons to the existing ones. For example, in IELM [8] and BLS [4], the progression strategies only allow the algorithms to learn networks with one and two hidden layers, respectively, while other PNNL algorithms such as PLN [3], StackedELM [25] or HeMLGOP [22] can generate multilayer networks.
Regarding the adopted optimization strategies, many algorithms employ random hidden neurons to relax the optimization objective to a convex form and use convex optimization techniques to achieve global solutions such as [8, 4, 25, 3]. While this approach is computationally efficient and often comes with certain theoretical guarantees, most algorithms are sensitive to hyperparameter selection and require extensive evaluation of a large set of hyperparameter values. Besides, these algorithms often construct very large network topologies to achieve good performance. Recently, the authors in [22] proposed HeMLGOP, a PNNL algorithm that combines both randomization process and stochastic optimization to progressively train networks of heterogeneous neurons. Since HeMLGOP not only optimizes the network’s topology but also the functional form of each neuron, the resulting network are both compact and efficient. This, however, comes with a much higher training computational cost compared to those employing random neurons and convex optimization.
As a variant of HeMLGOP algorithm which only optimizes the network’s topology with the standard Perceptron, Progressive Multilayer Perceptron (PMLP) yields a good tradeoff between optimization complexity, topology compactness and learning capability. Thus, in this paper we apply our proposal to speedup and enhance PMLP. Although our investigation in this paper limits to only PMLP, the proposed method can be generalized to all PNNL algorithms as will be indicated in the next Section.
2.2 Subset Sampling
In Active Learning, queryacquiring or poolbased method refers to a class of algorithms that uses different sampling strategies to select the most informative samples from a pool of unlabeled data. The most representative examples in this category of methods include the informationtheoretic method in [11], the ensemble method in [12]
, and the method based on uncertainty heuristics in
[18]. For a comprehensive review of active learning methods, we refer the reader to [15].Subset sampling methods have also been proposed in different contexts. For example, submodular function optimization for selecting a subset of samples was proposed in [1] to speed up neural network training. To study sample redundancy, [2] performs clustering using representations generated by a pretrained model, while [24] measures the importance of a sample via the gradient information. In the context of dataset compression and distributed learning, [16] optimizes sample selection and model’s parameters iteratively based on convex optimization.
3 Proposed Method
3.1 Subset Sampling
Let us denote by the training set formed samples with and being the th sample and its label, respectively. Let us also denote by the function induced by the neural network’s topology at the progression step , where representing the set of parameters to optimize, and representing the set of hyperparameters.
At step , instead of optimizing with respect to on , we propose to solve the optimization problem on a subset having cardinality , i.e.:
(1) 
where
denotes the loss function. To this end, we evaluate three different sample selection methods defined based on the following criteria:

Random Sampling: at each progression step , we form by uniformly selecting samples from . Although random sampling has been theoretically proven to be inferior to other sampling strategies in many learning contexts [5, 6, 15, 1], as it will be shown by our empirical study, this is not necessarily the case for PNNL. Throughout the architecture progression process, random sampling ensures diverse sets of samples being iteratively presented to the network, thus promoting diversity of the newly added neurons with respect to the existing ones.

TopM Sampling based on missclassification: at each progression step , this method computes the loss induced by each sample in using the network’s topology learned at step , i.e., , and selects the top samples which induce the highest loss values. Since the loss values directly provide supervisory signal when updating the model’s parameters, by conditioning on the current model’s knowledge expressed via
, this strategy enforces a given algorithm to learn new blocks of neurons which can correctly classify the most difficult cases.

TopM Sampling based on diverse missclassification
: while the previous sampling method solely considers the most difficult to classify samples, this strategy also aims to promote diversity and reduce similarity among the selected samples. To do so, we perform KMeans clustering using
as inputs. The number of clusters , which is predefined, can be set using simple heuristics such as being equal to the number of classes in classification tasks. We also compute the loss value induced by each sample using and select the top samples that induce highest loss values for every cluster, with .
3.2 Online Hyperparameter Selection
Models  Caltech256  MIT  CelebA 
Subset Percentage 10%  
PMLPRandom  
PMLPTopLoss  
PMLPCTopLoss  
Subset Percentage 20%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss  
Subset Percentage 30%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss  
Full Set  
PMLP  
StackedELM [25]  
PLN [3]  

In most existing PNNL algorithms, the value of each hyperparameter is fixed throughout the network’s topology progression. An algorithm is run for all combinations of hyperparameter values defined apriori, and the hyperparameter values combination leading to the best performance on the validation set is selected for final model deployment.
Since PNNL algorithms gradually increase the complexity of the neural network, it is intuitive that the model might require different degrees of regularization at different stages. Besides, with subset sampling incorporated, we train new blocks of neurons with different subsets of training samples at each step, which might require different hyperparameter configurations. Thus, instead of performing hyperparameter selection in an offline fashion, we propose to incorporate the hyperparameter selection procedure into progressive learning at every incremental step.
Particularly, let be the set of all hyperparameter values combinations, and be the cardinality of . At each progression step , after determining , we solve optimization problems corresponding to assignments of hyperparameter values:
(2) 
The algorithm then selects that achieves the best performance on the validation set for the newly added block of neurons. Online selection not only ensures the best hyperparameter values selection for each newly added block of neurons, but also reduces the computation overhead incurred when running individual network progression steps.
4 Experiments
Models  Caltech256  MIT  CelebA 
Subset Percentage 10%  
PMLPRandom  
PMLPTopLoss  
PMLPCTopLoss  
Subset Percentage 20%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss  
Subset Percentage 30%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss 
To evaluate the effectiveness of the proposed subset sampling and online hyperparameter selection method, we perform experiments on publicly available datasets designed for object recognition (Caltech256 [7]
), indoor scene recognition (MIT
[13]) and face recognition (CelebA [10]) problems. For CelebA dataset, we used a subset of images corresponding to identities. In each dataset, , andof the data were used for training, validation and testing, respectively. The inputs to all PNNL algorithms are deep features (global average pooling of the last convolution layer) from a pretrained network on ImageNet dataset
[17]. We demonstrate subset sampling with subset percentage of , , , and online hyperparameter selection on PMLP. As previously mentioned, the adopted PMLP follows the progression rule of HeMLGOP in [22]. Here we should note that subset selection was used to only speedup the network progression process (topology construction); the final topologies were finetuned with the full set of training data. We also evaluated other PNNL algorithms, namely StackedELM [25] and PLN [3] which run on the full training set at each step. For detailed information about our experiment protocols and hyperparameter setting, we refer the readers to our publicly available implementation of this work^{1}^{1}1https://bit.ly/2MshRza.Models  Caltech256  MIT  CelebA 
Subset Percentage 10%  
PMLPRandom  
PMLPTopLoss  
PMLPCTopLoss  
Subset Percentage 20%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss  
Subset Percentage 30%  
PMLPRandom  
PMLPTopLoss  
PMLPCToploss  
Full Set  
PMLP  
StackedELM [25]  
PLN [3] 
Table 1 shows the recognition accuracy on the test set of all models on the three datasets. For compact presentation, we refer to the proposed PMLP variants based on Random Sampling, TopM Sampling based on missclassification and TopM Sampling based on diverse missclassification by PMLPRandom, PMLPTopLoss and PMLPCTopLoss, respectively. Different from the empirical results obtained in other learning contexts, the best performing subset selection strategy is random sampling at the lowest percentage level (). In fact, PMLPRandom at performs better than all other algorithms, including the original PMLP. This can be attributed to the effects of both random subset sampling and online hyperparameter selection. Random sampling with a small percentage leads to the general effect that different blocks of neurons are optimized with respect to diverse subsets of data. The final network after optimization can be loosely seen as an ensemble of smaller networks. On the other hand, when a subset of data persists being missclassified throughout the network’s topology progression process, the corresponding sampling strategies will bias the algorithm to select only these samples and reduce the diversity of information presented to the network.
Table 2 shows the total number of unique samples selected by each algorithm following different sampling strategies. This table reflects the degree of diversity in the inputs observed by different networks trained with different sampling strategies. It is clear that the numbers are much higher for random sampling. While the original PMLP presents greater amount of information to the network during progression, every block of neurons in PMLP observes the same set of data, which might lead to overfitting.
Table 3 shows the average time taken to optimize one block of neurons in each algorithm while Table 4 shows the total time taken to perform experiments for a particular setting. Every experiment run was performed on the same node configuration (4 CPU cores, 16 GB of RAM). It is clear that using subset selection, the average time taken at each step of PMLP is greatly reduced (Table 3). Combining subset selection and online hyperparameter selection, the total experiment time is significantly lower (Table 4).
Models  Caltech256  MIT  CelebA 
Subset Percentage 10%  
PMLPRandom  
PMLPToploss  
PMLPCTopLoss  
Subset Percentage 20%  
PMLPRandom  
PMLPToploss  
PMLPCToploss  
Subset Percentage 30%  
PMLPRandom  
PMLPToploss  
PMLPCToploss  
Full Set  
PMLP  
StackedELM [25]  
PLN [3] 
5 Conclusion
In this work, we proposed subset sampling and online hyperparameter selection to speed up and enhance PNNL algorithms. Empirical results demonstrated with PMLP show that proposed approach can not only accelerate the optimization procedure in PMLP but also improve the generalization performance of the resulting networks.
6 Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). This publication reflects the authors’ views only. The European Commission is not responsible for any use that may be made of the information it contains.
References

[1]
(2019)
Deepsub: a novel subset selection framework for training deep learning architectures
. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1615–1619. Cited by: §1, §2.2, 1st item.  [2] (2019) Semantic redundancies in imageclassification datasets: the 10% you don’t need. arXiv preprint arXiv:1901.11409. Cited by: §1, §2.2.
 [3] (2017) Progressive learning for systematic design of large neural networks. arXiv preprint arXiv:1710.08177. Cited by: §1, §2.1, §2.1, Table 1, Table 3, Table 4, §4.
 [4] (2017) Broad learning system: an effective and efficient incremental learning system without the need for deep architecture. IEEE transactions on neural networks and learning systems 29 (1), pp. 10–24. Cited by: §1, §2.1, §2.1.
 [5] (1997) Selective sampling using the query by committee algorithm. Machine learning 28 (23), pp. 133–168. Cited by: 1st item.
 [6] (2006) Query by committee made real. In Advances in neural information processing systems, pp. 443–450. Cited by: 1st item.
 [7] (2007) Caltech256 object category dataset. Cited by: §4.
 [8] (2007) Convex incremental extreme learning machine. Neurocomputing 70 (1618), pp. 3056–3062. Cited by: §1, §2.1, §2.1.
 [9] (2017) Progressive operational perceptrons. Neurocomputing 224, pp. 142–154. Cited by: §1, §2.1.

[10]
(2015)
Deep learning face attributes in the wild.
In
Proceedings of the IEEE international conference on computer vision
, pp. 3730–3738. Cited by: §4.  [11] (1992) Informationbased objective functions for active data selection. Neural computation 4 (4), pp. 590–604. Cited by: §2.2.
 [12] (1998) Employing em and poolbased active learning for text classification. In Proc. International Conference on Machine Learning (ICML), pp. 359–367. Cited by: §2.2.

[13]
(2009)
Recognizing indoor scenes.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 413–420. Cited by: §4. 
[14]
(2017)
Active learning for convolutional neural networks: a coreset approach
. arXiv preprint arXiv:1708.00489. Cited by: §1.  [15] (2009) Active learning literature survey. Technical report University of WisconsinMadison Department of Computer Sciences. Cited by: §1, §2.2, 1st item.
 [16] (2019) Learning and data selection in big datasets. In International Conference on Machine Learning (ICML), Cited by: §1, §2.2.
 [17] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
 [18] (2001) Support vector machine active learning with applications to text classification. Journal of machine learning research 2 (Nov), pp. 45–66. Cited by: §2.2.
 [19] (2019) Learning to rank: a progressive neural network learning approach. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8355–8359. Cited by: §1, §2.1.
 [20] (2019) Datadriven neural architecture learning for financial timeseries forecasting. arXiv preprint arXiv:1903.06751. Cited by: §2.1.
 [21] (2018) Progressive operational perceptron with memory. arXiv preprint arXiv:1808.06377. Cited by: §1, §2.1.
 [22] (2019) Heterogeneous multilayer generalized operational perceptron. IEEE transactions on neural networks and learning systems. Cited by: §1, §2.1, §2.1, §4.
 [23] (2019) Knowledge transfer for face verification using heterogeneous generalized operational perceptrons. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1168–1172. Cited by: §2.1.
 [24] (2018) Are all training examples created equal? an empirical study. arXiv preprint arXiv:1811.12569. Cited by: §2.2.
 [25] (2014) Stacked extreme learning machines. IEEE transactions on cybernetics 45 (9), pp. 2013–2025. Cited by: §1, §2.1, §2.1, Table 1, Table 3, Table 4, §4.
Comments
There are no comments yet.