THE permanent magnet synchronous motor (PMSM) is the preferred choice in many industry applications due to its high power and torque densities along its high efficiency . In order to exploit the motor’s maximum utilization, high thermal stress on the motor’s potentially failing components must be taken into account when designing the motor or determining its control strategy. Especially in the automotive sector, competitive pressure and high manufacturing costs drive engineers to find more and more ways to reduce the safety margin in embedded materials. Being able to exploit the motor’s full capabilities makes precise temperature information at runtime necessary since overheating will result in severe motor deterioration. Among the typical important components that are sensitive to excessive heat, e.g. stator end windings and bearings, the permanent magnets in the rotor constitute especially failure prone parts of the motor. Cooling of the rotor is an intricate endeavor compared to stator cooling, which adds to the risk of permanent magnets irreversibly demagnetize due to overheating . While sensor-based measurements would yield fast and accurate knowledge about the machine’s thermal state, assessing the rotor temperature in this manner is usually not within economic and technically feasible boundaries yet. In particular, direct rotor monitoring techniques such as infrared thermography [3, 4] or classic thermocouples with shaft-mounted slip-rings  fall short of entering industrial series production.
Consequently, research focus centers on estimating rotor temperatures, and those of permanent magnets in particular, on a model basis. Although computational fluid dynamics (CFD) and heat equation finite element analysis (FEA) enjoy good reputation for their rigorous modeling capacities , their high computational demand excludes them from real-time monitoring upfront. An alternative real-time capable thermal model, called lumped-parameter thermal network (LPTN), approximates the heat transfer process with equivalent circuit diagrams. Being partly based on basic formulations of heat transfer theory, they are computationally lightweight if reduced to a low-order structure and provide good estimation performance 
. However, LPTNs must forfeit physical interpretability of its structure and parameter values by significantly curtailing degrees of freedom in favor of the real-time requirement. Moreover, expert domain knowledge is mandatory for the correct choice of not only their parameter values, but also for their structural design.
In the last decades, research efforts have also been made that deviate from thermodynamic theory: Typical lightweight approaches from this domain encompass the setup of electric machine models that provide information about temperature-sensitive electrical model parameters indirectly. There are methods that work with current injection  or voltage injection  to obtain the stator winding resistance or the magnetization level of the magnets, respectively, as thermal indicators at the cost of additional losses. Moreover, fundamental wave flux observers  can be contemplated to assess the reversible demagnetization of the embedded magnets. However, these methods suffer from high electric model parameter sensitivity, such that inaccurate modeling (potentially in the range of manufacturing tolerances) leads to excessive estimation errors .
In an effort to combine the advantages from both domains, 
fused an LPTN and a flux-based temperature observer with a Kalman filter. They report increased robustness and estimation accuracy for the full motor speed range, and an additional system failure detection feature.
In contrast to these physically-motivated estimation approaches, machine learning (ML) models that detach from any classic fundamental heat theory approximation will be examined empirically on the task of estimating the magnet temperature in a PMSM in this work. No a priori knowledge will be incorporated, and computational demand at runtime (during inference) scales less drastically with model complexity. Model parameters are fitted on observational data only, making domain knowledge less relevant. Leveraging this generalizability, one can easily transfer insights from this work into neighboring fields of interest, e.g. heating in power electronics, batteries’ state-of-health, etc.
A scheme depicting the idea of fitting a ML model on collected testbench data and having it eventually inform an arbitrary controller is shown in Fig. 1. The more accurate the control is informed of the thermal state, the better it can watch for critical operation and apply power derating .
Although the magnet temperature is the only contemplated target value in this work, all considered ML models are also trivially applicable to other continuous or discrete valued quantities of interest, such as torque or stator temperatures [15, 16, 17]. Furthermore, incorporating an increasing amount of spatially targeted temperature regions poses a virtually minor design overhead.
Certain ML approaches for the task of temperature profile estimation in a PMSM were studied before: Recurrent neural networks with memory units, in particular, long short-term memory (LSTM) or gated recurrent units (GRU) were evaluated on low-dynamic temperature profiles with a hyperparameter optimization via particle swarm optimization (PSO) in. In 
, temporal convolutional neural networks (TCN) were applied on also high-dynamic data and a comparison with recurrent architectures were compiled after tuning hyperparameters with Bayesian optimization. Far simpler ML models like linear regression were also shown to be effective as long as data has been preprocessed with low-pass filters in.
This paper extends the related work in so far that the broader field of supervised learning in regression tasks is illuminated. All previous publications fall into the regime of either fast and simple least-squares regression, or computation-heavy and sophisticated deep learning, albeit there is a rich set of tools in between. This gap is characterized by models of intermediate complexity and expressive power, and is systematically evaluated in this work with real-time capability and achievable estimation accuracy in mind.
Ii Regression Algorithms
The range of ML models can be categorized by e.g. their modeling capacities, prior assumptions over the data, amount of model parameters and their update rules, or simply their runtime in terms of the big-
notation. Representative classes for the regression task are linear models like ordinary least squares and its regularized derivatives; feed-forward neural networks as non-linear function approximators; decision trees that learn splits in the feature co-domain; support vector machines; k-nearest-neighbors; as well as the diverse ensemble of those (cf.). All contemplated models are briefly illuminated in this section.
Ii-a Ordinary Least Squares
The model family of linear approximators assume a linear relationship between a multi-dimensional input of vectors and the real-valued (possibly multi-dimensional) output vector with denoting the amount of input features, and being the amount of obervations. See  for a comprehensive overview.
Rearranging the minimization of the residual sum of squares gives a closed solution form for inferring the model coefficients from the data:
which is known as ordinary least squares (OLS). Popular regularized versions of OLS are called ridge regression for an additional penalty term in the cost function or LASSO  for the addition. Here, these are not considered as they have been shown to be inefficient for this dataset .
Ii-B Weighted Least Squares
A variation of OLS where all observations are weighted by a weight matrix :
Weighted least squares (WLS) is often used to account for heteroscedasticity in the data and increase robustness of the estimator. Especially in this work, there is another industry-driven incentive to weight observed data: Since the ultimate goal is to avoid overheating, it is of reasonably higher interest to estimate high temperatures more accurately than lower thermal states. When deviating from the analytical solution of the least squares method to gradient-descent-based optimization, one can also penalize under-estimates more than over-estimates, which is not covered here. WLS will be compared to OLS in Sec.IV-C.
Ii-C Epsilon-Support Vector Regression
Although observations are usually projected into higher dimensions through a kernel function , epsilon-support vector regression (-SVR) still constructs a regularized linear model with coefficients on the new feature space . Here, regularization means penalizing model complexity by minimizing the quadratic weights. More specifically, the linear cost function is -insensitive i.e. cost is accumulated only if the prediction error exceeds a threshold . Those observations with errors beyond are called support vectors in the regression context. In order to allow for support vector deviation from the -band one can encompass non-negative slack-variables and in the minimization formulation:
where balances coefficient regularization and -band alleviation. Solving the dual problem gives the following approximate solution:
with being the number of support vectors, and denoting the weight for support vector .
Ii-D K-Nearest Neighbors
In the regression context, -nearest neighbors (-NN) is a method where all training observations are stored, while new samples are estimated by taking the mean of stored points in the vicinity of that new observation . The design parameter is the number of neighbors to consider when evaluating the mean. Neighborhood is determined by the euclidean norm, and can be weighted optionally by the distance of each neighbor to the new sample. The -NNs denote a so-called lazy learn algorithm, since all computation is deferred to the inference phase. This counteracts real-time capability but might be bought for superior accuracy.
Ii-E Randomized Trees
Two methods based on ensembling of randomly grown decision trees are included: Random forests (RF) and extremely randomized trees (ET) 
. Single decision trees tend to overfit on the data and, therefore, suffer from high variance. This is mitigated by building an ensemble of decision trees that are fit on random subsets of the given observations with replacement (bootstrapping) and random subsets of features. Both measures lead to diverse predictions that are partially decorrelated from each other, such that averaging over them reduces variance significantly at the cost of additional bias. ETs differ from RFs in so far that, during training, ETs draw random thresholds in each feature out of the considered set in order to determine the next split, whereas RFs search for the most discriminative thresholds. This additional random component in ETs further amplifies the effect of variance reduction and bias increase.
Ii-F Neural Network Architectures
Neural networks are known to be universal generalizers 
, with gradual degrees of complexity that can be adapted to the capacities of the application platform. The vanilla form of a neural network is the multi-layer perceptron (MLP), where regressors are non-linearly transformed over several layers and eventually conclude to a prediction through (1). The transformed vector after layer is computed from the preceding transformed vectors of layer by
being an activation function at layerand denoting the trainable weights between the and neurons of layer and 24] or the exponentially linear unit 
as they have shown to converge faster with similar accuracy compared to the original sigmoid or tangens hyperbolicus. Despite the high non-linear structure in MLPs, they are end-to-end differentiable through the backpropagation rule
, making them optimizable by stochastic gradient descent (SGD) and derivations from that.
, is batch normalization, which scales each new batch of training data after every layer according to the norm but has no effect during inference . Even though this and alternative normalization schemes, such as weight and layer norm [29, 30], gained popularity in recent years , there was also a new type of activation and dropout functions proposed that circumvent additional layers: Self-normalizing NNs (SNNs) . Here, the idea is to normalize neuron activations implicitly through scaled ELUs (SELU) and a new variant of dropout.
In contrast to the MLP architecture, recurrent topologies with long short-term memory (LSTM)  or temporal convolutional neural networks (TCN)  can utilize time dependency between neighboring observations in a temperature profile without explicit feature engineering. However, these variants come with many more model parameters, and were extensively optimized in , such that their performance will be merely reported for comparison with the benchmarks in the end of this paper.
Iii Black-Box Thermal Modeling
Similar to , a three-phase PMSM of mounted on a test bench yielded the available data with a consistent sampling frequency of . The data aggregates 139 hours of recordings in total or around one million multi-dimensional samples. Obviously, supervised learning requires data measured on enhanced motor test equipment. Coolant, ambient, and magnet temperatures are recorded with standard thermocouples. The rotor temperature information, represented by the permanent magnets’ surface temperature, is transmitted wirelessly over a telemetry unit. Tab. I compiles the considered quantities that represent the input and output of the following ML models. Denoted input signals are commonly accessible in real-world traction drive systems, hence, tuned ML models can be plugged into commercial vehicle controls without further sensor upgrades. Fig. 2
depicts a two-dimensional principal component analysis (PCA) representation of the input features colored according to the target’s thermal state. It becomes evident that no trivial relationship between data with high and low temperatures is inferable.
|Liquid coolant temperature|
|Actual voltage -axis component|
|Actual voltage -axis component|
|Actual current -axis component|
|Actual current -axis component|
|Electric apparent power|
|Motor speed and current interaction|
|Motor speed and power interaction|
|Permanent magnet temperature|
Iii-a Data Preprocessing and Feature Engineering
All representations of the data are standardized on their sample mean and sample unit variance exhibited in the training set. The exponentially weighted moving average (EWMA) and standard deviation (EWMS) are taken into account so that for every timestepthe following terms are computed for each input parameter and adhered to the models’ input (see Tab. I):
where with and being the span that is to be chosen. Multiple values for the span can be applied leading to different frequency-filtered versions of the raw time-series data. The weights of the preceding observations describe around of all weights’ total sum. Consecutive calculations of the EWMA can be derived from a more computationally efficient form of (5),
which is highly relevant especially for automotive applications where embedded systems run on cost-optimized hardware. A computing-efficient form of the EWMS exists likewise.
Incorporating these additional features is inevitable for the ML approaches considered in this work, as they all assume independent and identically distributed data. This assumption is in conflict with a PMSM’s thermal behavior representing a dynamic system. Nonetheless, adding trend to the actual input space gives rise to overcoming this discrepancy and finding approximating functions with sufficient accuracy.
Iii-B Analogy to LPTN RC Circuits
Examining LPTNs that are well-defined for the temperature estimation task from , reveals that these are characterized by low-pass filters or, equivalently, RC circuits smoothing the raw input data. From signal theory, it is known that RC circuits are infinite impulse response (IIR) filters of the form
which can be discretized to
with being the step size. Rearranging (8) gives
which resembles (6) with .
Consequently, it is reasonable to directly apply EWMAs on the sensor time-series recordings in order to obtain regressors exhibiting patterns similar to those in LPTNs. This observation was empirically confirmed in .
Generalization error is reported by evaluating the prediction error on a test set of seven hours unseen during training. In case of MLPs, further of the training set is withheld from training, and acts as validation set. That portion is used to apply early stopping 
i.e. mitigating overfitting by stopping training after the cost function on this set is not improving anymore for a certain delta after a given number of iterations (epochs).
Scores are reported upon the mean squared error (MSE), mean absolute error (MAE), and the coefficient of determination (R²) between predicted sequence and ground truth. Moreover, the maximum deviation occuring in the testset ( norm) is an important indicator for the quality of temperature estimations.
Iii-D Cross-Validation for Hyperparameter Tuning
For a fair comparison of the different models, all design-parameters or hyperparameters are tuned systematically with the same approach: Bayesian optimization . During this sequential optimization technique, a surrogate model (here, a Gaussian process) is trained to find a mapping from the hyperparameter space to the test set error. The surrogate model’s capability to yield uncertainty estimates in the hyperparameter space can be used to trade exploration off for exploitation when determining the next hyperparameter set to evaluate.
More specifically, the chosen objective is the average MSE over all folds during stratified group-three-fold cross-validation (CV). Here, stratification refers to distributing recording sessions with same-level maximum temperatures evenly over the three folds. Such homogeneous folds reduce test error, following the heuristic of avoiding outlier samples concentrated in just few folds. Grouping denotes that no samples from the same measurement session appear in more than one fold in order to mitigate overfitting. In addition, feature value normalization was conducted for every fold with respect to observations in the training set. Note that the test set from Sec.III-C is not part of this CV strategy, and thus does not leak into the optimization of hyperparameters.
Iii-E Hyperparameters and Intervals
|weighting||uniform or distance||distance|
|estimators||93 / 600|
|max. depth||60 / 53|
|min. samples for split||15 / 20|
|min. samples per leaf||2 / 7|
|bootstrap||yes or no||yes / yes|
|activation||SELU or ReLU||ReLu|
The certain choice of four span values was always part of the optimization resulting in an independent span set for each model. The amount of span values could also be made a hyperparameter, but was deliberately set to four in order to maintain comparability with previous work [16, 17], where it was found to balance modeling accuracy with computational demand. Model-specific hyperparameters are compiled in Tab. II. All hyperparameter interval bounds are chosen manually to a range where modeling performance is likely to converge while constraining runtime.
RFs and ET come with the same hyperparameters: The number of trees denotes the ensemble size, while the maximum tree depth constraints growth. A higher minimum amount of samples for a split help make more robust splits, and increasing the minimum amount of samples per tree-leaf can mitigate overfitting.
The MLP model family also offers a wide variety of hyperparameters: The low upper bounds for the number of units and layers curtail the amount of model parameters and, thus, the modeling flexibility for each MLP, yet it has been shown in  that also smaller neural networks reach satisfactory prediction accuracy.
The particular optimization algorithm was either one of Adam, Adamax, Nesterov Adam (NAdam), rectified Adam (RAdam), RMSprop, or vanilla SGD [37, 38, 39]. During training, the learning rate of the optimizer is divided by two after there is no improvement in training set loss for consecutive epochs anymore. Reducing the learning rate on such loss plateaus is a common heuristic in MLP training for improving convergence and local minima exploration . In addition to this learning rate decay schedule, mini-batch size is doubled after epochs from to and then again to , such that each training lasts epochs if no early stopping applies. Increasing the batch size has been shown to benefit training convergence similar to learning rate decay .
Hyperparameters of SVR and -NN are described in Sec. II.
Iv Experimental Results
For all hyperparameter optimizations, the scikit-optimize framework 
is utilized. The acquisition function is either of upper confidence bound, expected improvement or probability of improvement, which calculate new candidate points independently, and where the most promising proposal serves as next evaluation point in every iteration. Each model’s hyperparameter space was searched for at least 100 iterations with 30 initial random selections.
and the radial basis function-kernel are used for the SVR.
The found optima are also organized in Tab. II.
Iv-a Model Performance
An overview of the individual predictive performance of each model with optimized hyperparameters is compiled in Tab. III. For models with stochastic optimization, the best experiment out of repetitions with different random number generator seeds is reported in order to alleviate scatter of the training process. Among the usual performance metrics, the model size is also highlighted. This quantity represents the amount of parameters that are to be stored for each model in order to make a prediction on a new observation after training. Note, however, that these numbers do not include memory that must be reserved for saving the moving averages of the sensor data.
Black-box CNNs show the overall best performance in terms of the MSE, while grey-box LPTNs have the lowest norm and merely 46 parameters, which is substantially less than for the other models. Only OLS is worth mentioning to be in between those two extrema, with a strong MSE, low maximum deviation and also few model parameters of 109. This should make OLS the preferrable approach among machine learning models in case there are no resources to hand-design an LPTN. Though deep learning models are the most precise, the model size of over thousand parameters might be difficult to justify while facing the marginal increase in accuracy especially in automotive systems. The test set predictions of OLS, ET and MLP are shown in Fig. 3. They can be utilized as intermediate approaches to intelligent temperature estimation before automotive hardware becomes as strong as it is necessary for deep learning.
It becomes evident that -NN, RF and SVR could not find a sufficient function approximation, even though their hyperparameter optimization seeked to maximize modeling capacity by increasing the model size up to the upper bound of the hyperparameter intervals, yet without comparable success.
Iv-B Learn Curves
Besides the total test error of each model, scalability with more training data is often of equal interest. Fig. 4 illustrates the learn curves of all models. The test set is constant and the same as in the previous experiments while the training set is increased successively.
It can be seen that all models plateau out at half the training set size except for SVR, whose performance seems to diminish. An explanation might be a limited modeling capacity for the SVR, which struggles to map all operation points observed in the data.
It can be summarized that the better performing algorithms (OLS, MLP, and ET) achieve high performance already with lesser training set sizes of around hours and no significant performance gains for more data. This naturally suggests their use in applications where a limited amount of data is collectable.
Iv-C Error Residuals
In the following the error residuals along the value range of the ground truth permanent magnet temperature is illuminated. In terms of an industrial application, robust and accurate estimation of high temperatures is of significantly higher interest than that of low temperatures. This is due to the purpose of avoiding overheating and the material destruction implied by that. There are several ways to opt for this more specialized use case:
subsample data such that more high temperatures occur in the data,
adjust the cost function to penalize under-estimates, and increase costs for deviations at high temperatures.
While the first point is a general approach to the certain method of data collection and should be studied on its own, the latter point is trivially done for most ML models. One example could be WLS: their performance and corresponding error residuals are opposed to those of OLS in Fig. 5. All obervations are linearly weighted from to according to their closeness to the total minimum and maximum PM temperature, respectively, occuring in the data set. The effect is subtle, but it can be seen that for WLS outlying predictions are often settled in lower regions of the temperature value range, while higher temperatures come with less variance. In order to evaluate the advantage of WLS over OLS on a significant scale, additional load profiles need to be recorded that expose more temperature variance. Specifically, deviations at low temperatures are inversely, linearly weighted from a maximum weight at the maximum allowed temperature, and under-estimates are additionally weighted by a factor of .
V Conclusion and Outlook
It has been shown that if rich datasets are recorded at a test bench or in production, which is likely in the automotive industry, then engineers can rely on them to model temperature estimators. The utilization of special domain expertise and motor sheet specifications is circumvented, while monitoring important component temperatures inside a PMSM is still real-time processible for certain classical supervised learning algorithms. Through autonomously conducted hyperparameter searches, it was possible to demonstrate that classical supervised learning algorithms achieve state-of-the-art estimation accuracy also during high dynamic drive cycles. Ordinary least squares stands out with one of the best accuracies and by far the lowest amount of model parameters among ML methods, making it the first choice after having found optimal moving average factors during feature engineering for this certain application. In the long run, however, deep neural networks are expected to be prevalent in the temperature estimation domain, due to their excellent scalability, the rising availability of measurement data, and increasing computing capabilities finding their way into series production for the sake of (hybrid) electric vehicles.
It is still an open question how well supervised learning algorithms may generalize across different motors from the same manufacturer or even among different manufacturers. This can be answered only with a dataset exhibiting this diversity, and is yet to be recorded. Moreover, incorporating domain knowledge at a lesser scale is an auspicious option. Eventually, assessing the uncertainty of predictions in order to enable probabilistic estimations could leverage reliability on data-driven temperature estimators.
-  Z. Q. Zhu and D. Howe. Electrical Machines and Drives for Electric, Hybrid, and Fuel Cell Vehicles. Proceedings of the IEEE, 95(4):746–765, apr 2007.
-  D. Huger and D. Gerling. The Effects of Thermal Cycling on Aging of Neodymium-Iron-Boron Magnets. In 11th International Conference on Power Electronics and Drive Systems, pages 389–392, 2015.
-  M. Ganchev, B. Kubicek, and H. Kappeler. Rotor Temperature Monitoring System. The XIX International Conference on Electrical Machines, pages 1–5, 2010.
-  S. Stipetic, M. Kovacic, Z. Hanic, and M. Vrazic. Measurement of Excitation Winding Temperature on Synchronous Generator in Rotation Using Infrared Thermography. IEEE Transactions on Industrial Electronics, 59(5):2288–2298, 2012.
-  C. Mejuto, M. Mueller, M. Shanel, A. Mebarki, M. Reekie, and D. Staton. Improved Synchronous Machine Thermal Modelling. In 8th International Conference on Electrical Machines, pages 1–6, 2008.
-  A. Boglietti, A. Cavagnino, D. Staton, M. Shanel, M. Mueller, and C. Mejuto. Evolution and Modern Approaches for Thermal Analysis of Electrical Machines. IEEE Transactions on Industrial Electronics, 56(3):871–882, 2009.
-  O. Wallscheid and J. Böcker. Global Identification of a Low-Order Lumped-Parameter Thermal Network for Permanent Magnet Synchronous Motors. IEEE Transactions on Energy Conversion, 31(1):354–365, 2016.
-  O. Wallscheid and J. Böcker. Design and Identification of a Lumped-Parameter Thermal Network for Permanent Magnet Synchronous Motors Based on Heat Transfer Theory and Particle Swarm Optimisation. In 17th European Conference on Power Electronics and Applications, pages 1–10, 2015.
-  S. D. Wilson, P. Stewart, and B. P. Taylor. Methods of Resistance Estimation in Permanent Magnet Synchronous Motors for Real-Time Thermal Management. IEEE Transactions on Energy Conversion, 25(3):698–707, 2010.
-  M. Ganchev, C. Kral, H. Oberguggenberger, and T. Wolbank. Sensorless Rotor Temperature Estimation of Permanent Magnet Synchronous Motor. In Industrial Electronics Conference Proceedings, 2011.
-  O. Wallscheid, A. Specht, and J. Böcker. Observing the Permanent-Magnet Temperature of Synchronous Motors Based on Electrical Fundamental Wave Model Quantities. IEEE Transactions on Industrial Electronics, 64(5):3921–3929, 2017.
-  O. Wallscheid, T. Huber, W. Peters, and J. Böcker. Real-Time Capable Methods to Determine the Magnet Temperature of Permanent Magnet Synchronous Motors - A Review. In 40th Annual Conference of the IEEE Industrial Electronics Society, pages 811–818, 2014.
-  D. Gaona, O. Wallscheid, and J. Böcker. Improved Fusion of Permanent Magnet Temperature Estimation Techniques for Synchronous Motors Using a Kalman Filter. IEEE Transactions on Industrial Electronics, 67(3):1708–1717, 2020.
-  O. Wallscheid and J. Böcker. Derating of automotive drive systems using model predictive control. In IEEE International Symposium on Predictive Control of Electrical Drives and Power Electronics (PRECEDE), pages 31–36, 2017.
-  O. Wallscheid, W. Kirchgässner, and J. Böcker. Investigation of Long Short-Term Memory Networks to Temperature Prediction for Permanent Magnet Synchronous Motors. In Proceedings of the International Joint Conference on Neural Networks, pages 1940–1947, 2017.
-  W. Kirchgässner, O. Wallscheid, and J. Böcker. Deep Residual Convolutional and Recurrent Neural Networks for Temperature Estimation in Permanent Magnet Synchronous Motors. In IEEE International Electric Machines Drives Conference, pages 1439–1446, 2019.
-  W. Kirchgässner, O. Wallscheid, and J. Böcker. Empirical Evaluation of Exponentially Weighted Moving Averages for Simple Linear Thermal Modeling of Permanent Magnet Synchronous Machines. In Proceedings of the 28th International Symposium on Industrial Electronics, pages 318–323, 2019.
-  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Verlag, 2nd edition, 2009.
-  A. E. Hoerl and R. W. Kennard. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1):55–67, 1970.
-  R. Tibshirani. Regression Selection and Shrinkage via the Lasso, 1996.
-  L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
-  P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, apr 2006.
-  K. Hornik, M. Stinchcombe, and H. White. Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5):359–366, 1989.
V. Nair and G. E. Hinton.
Rectified linear units improve Restricted Boltzmann machines.In ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 2010.
-  D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ArXiv e-prints, abs/1511.0, 2015.
-  A. Krogh and J. A. Hertz. A Simple Weight Decay Can Improve Generalization. Advances in Neural Information Processing Systems, 4:950–957, 1992.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
-  S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv e-prints, abs/1502.0, 2015.
-  T. Salimans and D. P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. ArXiv e-prints, abs/1602.0, 2016.
-  J. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer Normalization. ArXiv e-prints, 1607.06450, 2016.
-  E. Hoffer, R. Banner, I. Golan, and D. Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.
-  G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. ArXiv e-prints, abs/1706.0, 2017.
-  F. A. Gers and F. Cummins. Learning to Forget: Continual Prediction with LSTM. In Ninth International Conference on Artificial Neural Networks, volume 2, pages 1–19, 1999.
-  S. Bai, J. Z. Kolter, and V. Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. ArXiv e-prints, abs/1803.0, 2018.
-  I. A. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
-  B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the Human out of the Loop: A Review of Bayesian Optimization. In Proceedings of the IEEE, volume 104, pages 148–175, 2016.
-  D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, abs/1412.6:1–9, 2014.
Incorporating Nesterov Momentum into Adam.ICLR Workshop, 2016.
-  L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the Variance of the Adaptive Learning Rate and Beyond. arXiv e-prints, aug 2019.
-  S. L. Smith, P. Kindermans, and Q. V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. ArXiv e-prints, abs/1711.0, 2017.
-  T. Head and Others. Scikit-Optimize. https://scikit-optimize.github.io/, 2018.
-  F. Chollet and Others. Keras. https://keras.io, 2015.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  C. Chang and C. Lin. LIBSVM: A Library for Support Vector Machines. Transactions on Intelligent Systems and Technology, 2(3):27, 2011.
-  E. Gedlu, O. Wallscheid, and J. Böcker. Permanent Magnet Synchronous Machine Temperature Estimation using Low-Order Lumped-Parameter Thermal Network with Extended Iron Loss Model. In Presented at the Int. Conf. on Power Electronics, Machines and Drives, 2020.