Deep neural networks (DNNs) have achieved performance breakthroughs in many real-world applications due to their powerful feature learning capabilities, which are typically characterized by sophisticated architectures involving a massive amount of parameters. Training a DNN is equivalent to solving a highly complex non-convex optimization task which is easily stuck into inferior local optima and accordingly leads to the undesired training result.
A commonly employed way to train a DNN relies on an available training set, where a certain loss function defined on the training set is optimized with respect to network parameters to derive optimal parameter values. This optimization process, a.k.a. training process, is iterative and usually terminated by the pre-specified maximal number of training epochs. However, the way of manually specifying the maximum number of training epochs is often too subjective, which increases the risk of over-fitting or under-fitting and accordingly results in undesired generalisation performance. To address this issue, another training paradigm has gained much popularity nowadays, which partitions the original training set, namely the gross training set, into one training set and one validation set via a certain sampling method and utilizes the validation set to estimate the generalization performance of the trained model as the training process (based on the training set) proceeds[1, 2]. This way may improve the generalization performance of the trained model. However, the validation set may not well represent the potential test data and thus become less effective to provide an unbiased estimate of generalization performance.
To address the above issues, we propose a novel DNN training framework which formulates multiple related training tasks by using a certain sampling method to generate multiple different pairs of training and validation sets from the gross training set and solves these related tasks simultaneously via a newly emerging multi-task optimization (MTO) technique that allows the useful knowledge (e.g., promising network parameters) obtained from one training task to be transferred to other training tasks. Specifically, this framework generates multiple pairs of training and validation sets from the gross training set via a specific sampling method, trains a DNN model of a pre-specified network structure on each pair while enabling the useful knowledge obtained from one training process to be shared with other training processes via MTO, and finally outputs the best one, among all the trained models, which achieves the overall best performance across the validation sets from all pairs. The knowledge transfer and sharing mechanism featured in the proposed framework can not only enhance training effectiveness by helping the training processes to escape from local optima but also improve on generalization via implicit regularization imposed on one training process which comes from other training processes. It is worth noting that the cross-validation technique 
commonly used to improve generalization performance when training machine learning (ML) models is for tuning the hyper-parameters of the ML model instead of the model parameters per se. Therefore, it is irrelevant to this study which is focused on learning the parameters (i.e., connection weights and biases) of a DNN model with pre-specified hyper-parameters. Another machine learning technique named ensemble learning also trains multiple models to make a prediction. However, it aims at achieving the best performance by assembling all models in a certain way while our proposed framework aims to train a single best model with the help from training other models.
We implement the proposed training framework, parallelize the implementation on a GPU cluster, and apply it for training three popular DNN models, i.e., DenseNet-121 , MobileNetV2  and SqueezeNet . Performance evaluation and comparison on three classification data sets of different nature demonstrate the superiority of the proposed training framework over the conventional training paradigm in terms of the classification accuracy obtained on the testing set.
In the following, we will first introduce the background of this work in section II, then describe the proposed framework and its implementations in detail in section III, and finally discuss and analyze experimental results in section IV, followed by concluding remarks and future work in section V.
Ii-a Training Deep Neural Networks
Gradient descent based optimization algorithms are widely used for training DNNs, e.g., in supervised learning problems. These algorithms normally use back propagation (BP)
to calculate the gradients of DNN’s parameters and accordingly update parameter values. Specifically, given a training set composed of multiple pairs of the input and the output (i.e., the label of the input), a loss function is formulated which measures the mismatch between the output of the network w.r.t. an input and the actual output of that input in the training set summed over all input-and-output pairs in the training set. Then, BP with stochastic gradient descent is applied to calculate the partial derivative of the loss function with respect to each parameter in the DNN from the last layer to the first layer. Next, parameter values are updated based on the calculated derivatives via a certain learning rule. This training process is iterated until a certain stopping criterion is met.
Training DNNs relies on an available training set which may be used in different ways, leading to different training paradigms. One common practice is to directly train a DNN on the gross training set (i.e. the original training set) and use the pre-specified maximum number of training epochs to terminate the training process. However, the subjective choice of the maximum number of training epochs is likely to make the trained model to overfit the training set and thus lead to inferior generalization. Another popular training paradigm addresses this issue by partitioning the gross training set into one training set and one validation set via random splitting and using the validation set to estimate the generalization performance of the trained model as the training process proceeds and accordingly stop the training process if the generalization performance of the trained model cannot be much improved [8, 9, 10, 11, 12].
In ML, cross-validation is a widely used strategy to improve the generalization performance of a trained ML model. However, it is merely applied to tune the hyper-parameters of the model 
, e.g., the number of layers, the number of hidden neurons, and the learning rate in the context of DNN training. In this work, we focus on learning the parameters (i.e., connection weights and biases) of a DNN model with pre-specified hyper-parameters. Therefore, cross-validation strategy is irrelevant to this study.
Ii-B Multi-Task Optimization
MTO investigates how to effectively and efficiently tackle multiple optimization tasks concurrently via online knowledge transfer. This paradigm has been inspired by the well-established concepts of transfer learning and MTL  in predictive analytics. Existing MTO techniques are mainly developed for Bayesian optimisation [16, 17, 18]
or evolutionary computation techniques[19, 20, 21, 22]. Swersky et al. 
proposed multi-task Bayesian optimisation (MTBO) which is based on the well-studied multi-task Gaussian process models. This work can transfer the knowledge gained from prior optimisations to new tasks in order to find optimal hyperparameter settings more efficiently, or optimise multiple tasks simultaneously when the goal is maximizing average performance, e.g., optimizing-fold cross-validation. 
proposed an MTO based evolutionary algorithm (EA) where tasks benefit from the implicit knowledge transfer during the task solving and can often lead to accelerated convergence for a variety of complex optimization functions. Besides EA, other variants[24, 25] are also developed to solve the MTO problems. In , Feng et al.
proposed an evolutionary multitasking algorithm with explicit knowledge transfer via denoising autoencoder, which demonstrates higher efficacy over implicit knowledge transfer. proposed an MTO based framework of generating feature subspaces for ensemble classification.
Iii Proposed Method
The conventional way of training DNNs corresponds to minimize a loss function which measures the mismatch between the output of the network w.r.t. an input and the actual output, which can be regarded as a single task optimization (STO) problem. One common training paradigm can be defined as follows: given a training set where and refer to the input and the corresponding actual output, it trains a DNN until reaching the pre-specified maximum epochs (Fig. 0(a)) to minimize a particular loss function (1),
where we define . Once the training is completed, the trained DNN is evaluated on the testing set which is invisible during training. Another popular training paradigm partitions the gross training set into one training set and one validation set via a certain sampling method and use the to estimate the generalization performance (evaluated by the validation loss ) of the trained DNN during the training process (Fig. 0(b)). The training process will be terminated if the validation loss cannot be much reduced and the trained DNN with minimum validation loss is regarded as the final trained model.
Unlike the above conventional training paradigms, our proposed training framework has two modules which are related training tasks formulation and MTO (Fig. 0(c)). In the first module, we would formulate multiple related training tasks , where each task obtains a distinct pair of training and validation sets via a certain sampling method. Then each task is to use the to train one individual DNN model with a pre-specified network structure and for monitoring the generalization performance during the training process.
After that, the MTO module will solve all tasks simultaneously, and apply knowledge transfer across all tasks to help them find better model parameters which produce a lower validation loss on their associated validation set during the training process. At last, one trained DNN which achieves the overall best performance across all the validation sets from all pairs will be selected as the final trained model. The conventional STO based training method (with validation set, Fig. 0(b)) can be regarded as a special case of our proposed framework when there is only a single training task.
In this framework, the training set in each task is different and accordingly, the model learned in each task may contain the useful knowledge (e.g. promising parameter values) which can be transferred and shared with some other tasks to help their training processes to escape from inferior local optima. Meanwhile, the validation sets in different tasks provide the estimate of generalization from different perspectives. As a result, knowledge transfer and sharing across different tasks may impose implicit regularization on the training process of one task from other training processes of other tasks, aiming to produce a DNN with improved generalization which can perform well on all validation sets.
Iii-B1 Formulating Multiple Related Training Tasks
In this implementation, we formulate each training task like this: firstly, we randomly split a ratio of samples from the gross training set as validation set and the remaining as training set to form the pair ; secondly, we use the pair to formulate a training tasks (2) which aims to train one individual DNN model with a pre-specified network structure via its pair of training and validation sets.
During training, the is used to estimate the change of the generalization ability of the and also provides a way to evaluate if the knowledge from other tasks is beneficial to improve the generalization ability of task .
This process is repeated times to formulate training tasks . These tasks are highly related since they are sampled from the same gross training set.
Iii-B2 Adaptive Multi-Task Optimization based DNN Training Algorithm
We propose an adaptive MTO based DNN training algorithm (AMTO) which targets at solving all tasks simultaneously and transfer the intermediate learned knowledge (which we define it as model parameters ) across all tasks to improve their training performance. To effectively transfer knowledge across tasks, especially when a large number of tasks are solved together, the formulated training tasks can (i) learn task relationship with other tasks so that knowledge transfer is more likely to occur between related tasks, and (ii) determine whether to accept the transferred knowledge based on whether it can help it improve on generalization performance during the training process.
Specifically, each task maintains a relationship list () which records how it is related to other tasks. For the , its relationship list is represented as (3),
where the represents the degree of relationship to . Then we convert the elements of
to the probabilities of acquiring knowledge from the corresponding tasks which sum to one by softmax function (4).
Apparently, a higher represents a higher possibility of acquiring knowledge from .
At the beginning of the algorithm, all elements of are initialized as zero. Then for the , it selects at random with the probability generated from (4) and acquires the model parameters as . We name this operation as knowledge reallocation.
After that, a temporary DNN training task is formulated as (5) to evaluate if training the on can achieve better generalization estimated on than . Then the model parameters of all tasks, including and , are trained for iterations simultaneously via a gradient descent based method. Next, for each task , we evaluate the validation loss and on their corresponding validation sets and substitute the with if the latter one achieves lower validation loss. In this way, the task can actively accept the knowledge from other tasks if that knowledge helps to improve its generalization performance and decline it otherwise. We name this operation as determining transfer. Meanwhile, the relationship list will be updated. Specifically, the will be updated by (6).
In other words, the increases if the transferred after training on can achieve lower validation loss on than and vice versa. After this operation, the algorithm goes back to operation knowledge reallocation or terminates if it reaches the maximum training iterations or the validation loss of any tasks does not reduce for consecutive validations.
Through periodically implementing the knowledge reallocation and determining transfer, each task can share its learned knowledge with other tasks and can investigate if the knowledge from other tasks is beneficial to improve its own generalization performance. In this process, once a task gets trapped into inferior local optima (i.e., unable to further reduce the validation loss), the knowledge from other tasks can potentially help it escape there. Meanwhile, the transferred knowledge from different training tasks imposes implicit regularization on the trained DNNs, which improves on the generalization performance. At the end of the training process, the DNN which achieves the highest harmonic accuracy () across all the validation sets is selected as the final output. Equation 7 shows the where the represents the accuracy evaluated on validation set .
Iii-B3 Parallelization of the implementation on a GPU cluster
The algorithm is well-suited for parallelization to improve efficiency. We implement the algorithm on a GPU-enabled supercomputer called OzSTAR111https://supercomputing.swin.edu.au/. As demonstrated in Fig. 2, we define the Computing Unity () as a basic unit to solve one training task which consists of two GPUs. These two GPUs act as master and slave respectively, where the slave solves the temporary training task and the master solves the . During training, all formulated training tasks are solved simultaneously where each solves one task. In this case, the efficiency is comparable to the conventional STO based training paradigm. We present the pseudocode of the parallelized AMTO in Algorithm 1.
This section evaluates the proposed AMTO method on three publicly available image classification datasets, aiming to demonstrate,
The proposed AMTO algorithm can achieve better generalization performance than the conventional STO.
The performance of the proposed AMTO algorithm improves with the number of formulated tasks.
We will elaborate on the datasets details, experimental setting, and results with analysis.
This dataset  was manually extracted from the USGS National Map Urban Area Imagery collection for various urban areas around the country. The pixel resolution of this public domain imagery is 0.3m. This dataset contains 21 land-use classes with 100 images per class, and each image is in a size of pixels. We randomly divide this dataset into a gross training set (80%) and a testing set (20%).
This dataset  has around 7400 images, which contains 37 different breeds of pets. This dataset has been pre-partitioned into a gross training set (50%) and a testing set (50%). The relatively small ratio of the training set increases the challenge of training a DNN with good generalization ability.
This dataset  contains 2800 remote sensing images which are from 7 typical land-use classes. There are 400 images per class collected from Google Earth which are sampled on 4 different scales with 100 images per scale. Each image has a size of pixels. This dataset is rather challenging due to the wide diversity of the scene images which are captured under different seasons and various weathers, and sampled with different scales. Same as UCMerced, this dataset is randomly partitioned into a gross training set (80%) and a testing set (20%).
We will further generate training set and validation set based on the gross training set for training and use the testing set for testing.
Iv-B Experimental Setting
We compare our method with the conventional STO training paradigm (training one individual DNN with validation set) on various popular DNN models, including SqueezeNet , MobileNetV2 , and DenseNet-121 . In the STO, we randomly split 10% data for validation and the remaining for training from the gross training set of each dataset. In the MTO, we formulate each related training task with distinct pair of validation set and training set generated from the gross training set where the ratio for validation set is 10% as well.
Since the datasets we use are small, we initialize these DNNs via the parameters pre-trained on ImageNet. The training samples are augmented by random horizontal flipping and resized to () pixels to match the required input size of the DNN models. Each single task is solved by momentum SGD with Nesterov  where the initial learning rate is and the momentum is 0.9. The maximum training iterations are and the mini-batch size is 64. The learning rate is dropped by 0.1 in iterations. We apply early stopping both on the STO and AMTO and the training process will be terminated if the validation loss of any tasks does not reduce after 10 consecutive validations. For the AMTO method, we formulate four training tasks and apply knowledge reallocation and determining transfer every 100 training iterations.
The experiments are executed for 5 runs on each dataset and DNN model where the STO and AMTO start from the same random seed in one run. We use the mean Top-1 accuracy on the testing set of the 5 runs as a metric to measure the generalization performance of the trained DNN. All the experiments are implemented with Pytorch and run on the HPC platform Ozstar where each node has two Nvidia Tesla P100 GPUs.
Iv-C Results and Analysis
Iv-C1 Comparison With the Single Task Optimization
Training a DNN for image classification aims to find a DNN model with desirable generalization performance, which is measured by the testing performance after training. In this experiment, we compare our proposed AMTO method with the conventional STO in terms of the Top-1 accuracy to verify the effectiveness in improving generalization ability.
Table I compares the mean Top-1 accuracy of 5 runs achieved by STO and AMTO of three datasets obtained by three popular architectures. From the table we can make the following observations: (i) the DNNs trained by AMTO performs better than those of STO on all cases, which demonstrates that AMTO is effective in training the DNN with better generalization performance. (ii) The networks with a smaller capacity (SqueezeNet and MobileNetV2) generally benefit more from AMTO. This is noteworthy as the small network often performs less desirable due to the trade-off with speed and size. Improving the performance of small networks can greatly enhance their applicability, e.g., on portable devices.
Iv-C2 AMTO With Different Number of Formulated Tasks
The prior experiments studied AMTO of four formulated tasks. We next investigate how AMTO scales with different numbers of formulated training tasks. Fig. 3 shows the AMTO’s mean validation loss, as well as the mean Top-1 accuracy on three datasets with 1,2,4,6 formulated tasks, where one formulated task represents the conventional STO. From this figure, we can find that the mean validation loss of the target task reduces as the number of formulated tasks increases in the AMTO in all cases. This demonstrates that the optimization ability of AMTO is enhanced as the number of formulated tasks increases.
On the other hand, the mean Top-1 accuracy on the testing set of the trained DNN is higher than that of STO in all cases which verifies the effectiveness in improving the DNN’s generalization performance. It’s also noticeable that the mean Top-1 accuracy does not monotonically improve as the mean validation loss decreases. This phenomenon is reasonable since the distribution gap exists between the validation sets and testing set so that decreasing validation loss does not guarantee improving testing performance. The other possible reason is, randomness exists in the algorithm, especially in the step of knowledge reallocation, which causes the fluctuation. Moreover, as the total number of training iterations is fixed, an increasing number of formulated tasks may lead to less chance of transferring useful knowledge across the tasks. To further improve the stability of AMTO, a more effective way of knowledge transfer method needs to be proposed.
V Conclusions and Future Work
We proposed a novel DNN training framework based on MTO which can not only enhance training effectiveness but also improve on the generalization performance of the trained DNN model via knowledge transfer and sharing. We implemented the proposed framework, parallelized the implementation on a GPU cluster, and applied it to three popular DNN models. Performance evaluation and comparison demonstrated that the DNN models trained via the proposed framework achieved better generalization performance than the conventional training paradigm. In the future, we plan to explore more different ways to formulate the related training tasks. Furthermore, we will perform an in-depth study on how the number of formulated training tasks influences the performance so as to devise a way to make best use of multiple related training tasks. Moreover, we plan to further enhance the modules of related training tasks formulation and MTO in the proposed framework based on some of our previous works [31, 32, 33].
This work was performed on the OzSTAR national facility at Swinburne University of Technology. The OzSTAR program receives funding in part from the Astronomy National Collaborative Research Infrastructure Strategy (NCRIS) allocation provided by the Australian Government. This work was supported in part by the Australian Research Council (ARC) under Grant No. LP170100416, LP180100114 and DP200102611, the Research Grants Council of the Hong Kong SAR under Project CityU11202418, and the China Scholarship Council (CSC).
-  L. Prechelt, “Early stopping-but when?” in Neural Networks: Tricks of the trade. Springer, 1998, pp. 55–69.
-  ——, “Automatic early stopping using cross validation: quantifying the criteria,” Neural Networks, vol. 11, no. 4, pp. 761–767, 1998.
-  S. Arlot, A. Celisse et al., “A survey of cross-validation procedures for model selection,” Statistics surveys, vol. 4, pp. 40–79, 2010.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
H. Leung and S. Haykin, “The complex backpropagation algorithm,”IEEE Transactions on Signal Processing, vol. 39, no. 9, pp. 2101–2104, 1991.
Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” 2011.
-  J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” arXiv preprint arXiv:1511.06390, 2015.
-  M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan et al., “Population based training of neural networks,” arXiv preprint arXiv:1711.09846, 2017.
-  S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in neural information processing systems, 2017, pp. 3856–3866.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  T. Q. Huynh and R. Setiono, “Effective neural network pruning using cross-validation,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 972–977.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
-  R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, pp. 95–133.
-  K. Swersky, J. Snoek, and R. P. Adams, “Multi-task bayesian optimization,” in Advances in neural information processing systems, 2013, pp. 2004–2012.
-  R. Bardenet, M. Brendel, B. Kégl, and M. Sebag, “Collaborative hyperparameter tuning,” in International conference on machine learning, 2013, pp. 199–207.
-  D. Yogatama and G. Mann, “Efficient transfer learning method for automatic hyperparameter tuning,” in Artificial intelligence and statistics, 2014, pp. 1077–1085.
-  A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: toward evolutionary multitasking,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 3, pp. 343–357, 2016.
-  L. Feng, L. Zhou, J. Zhong, A. Gupta, Y.-S. Ong, K.-C. Tan, and A. Qin, “Evolutionary multitasking via explicit autoencoding,” IEEE transactions on cybernetics, no. 99, pp. 1–14, 2018.
-  B. Zhang, A. K. Qin, and T. Sellis, “Evolutionary feature subspaces generation for ensemble classification,” in Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 2018, pp. 577–584.
-  A. Gupta and Y.-S. Ong, “Genetic transfer or population diversification? deciphering the secret ingredients of evolutionary multitask optimization,” in 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2016, pp. 1–7.
-  A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: toward evolutionary multitasking,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 3, pp. 343–357, 2015.
-  L. Feng, W. Zhou, L. Zhou, S. Jiang, J. Zhong, B. Da, Z. Zhu, and Y. Wang, “An empirical study of multifactorial pso and multifactorial de,” in 2017 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2017, pp. 921–928.
J. Zhong, L. Feng, W. Cai, and Y.-S. Ong, “Multifactorial genetic programming for symbolic regression problems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, no. 99, pp. 1–14, 2018.
-  Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems. ACM, 2010, pp. 270–279.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3498–3505.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2),” in Doklady an ussr, vol. 269, 1983, pp. 543–547.
-  A. K. Qin and P. N. Suganthan, “Initialization insensitiveLVQ algorithm based on cost-function adaptation,” Pattern Recognition, vol. 38, no. 5, pp. 773–776, 2005.
M. Gong, Y. Wu, Q. Cai, W. Ma, A. K. Qin, Z. Wang, and L. Jiao, “Discrete particle swarm optimization for high-order graph matching,”Information Sciences, vol. 328, pp. 158–171, 2016.
-  L. Feng, L. Zhou, J. Zhong, A. Gupta, Y.-S. Ong, K.-C. Tan, and A. K. Qin, “Evolutionary multitasking via explicit autoencoding,” IEEE Transactions on Cybernetics, vol. 49, no. 9, pp. 3457–3470, 2019.