The Artificial Intelligence (AI) has long be a hot spot, and recent breakthroughs have significantly raised the interest of the AI in recent few years. For instances, Brenden Lake , in 2015 designed a machine to learn to write the Mongolian, and it has passed the Turing test. This work proved that the machine can be act like humans in hand writing tasks. Another important breakthrough is the AlphaGo by Google DeepMind. David Silver,  in 2016 published the first paper of AlphaGo and claimed that the computer can defeat the professionals of the game of Go for the first time, and actually this paper was published after the AlphaGo defeated the world champion Lee Se-dol. The success of AlphaGo further proved that the machine can not only be alike humans, but can also be smarter than humans. The success of these work brings unprecedented confidence to AI.
The goal of AI is to build the machines which have intelligence like humans. The mathematical models and algorithms are essential to achieve this goal. In the works mentioned above, the Bayesian Framework has been used by Brenden Lake for the hand writing tasks, and the AlphaGo is actually built on the neural networks and Monte Carto tree search. However, these works also have another issues that they all need large scale of training data to obtain the optimal models. It is also very interesting to see that Google has developed another machine for playing the game of Go which is called the AlphaGo Zero. This new machine was trained without any chess manuals, only the rules of the game of Go was taught. The AlphaGo Zero has successfully won the champion-defeating AlphaGo  by 100-0. In our opinion, this work also proved that it is possible to train the machines using extremely small samples. This also mentioned us that the Big Data should not be the only way for efficient AI.
Actually, in the pattern of human learning we often encountered the cases with very small data. For example, a little child may master the basic tricks of the game of Go with a few lessons. Technically, the main methodology employed to build the AlphaGo is the Deep Learning and Reinforcement Learning, which are all based on the idea to take advantage of human knowledge as much as possible. The variations of the game of Go are actually computationally infinity for the computers. Thus, it is impossible to train a smart AlphaGo without the methods with efficient learning based on small samples.
The grey system theory is primarily developed for the small samples. The grey models often appear to have deterministic structure with free parameters, which should be estimated by the samples. In the idea of the systems science, the deterministic structure of the grey models is essentially the known part the system, with the free parameters as the unknown part. In our opinion, such pattern is quite limited, as in the real world applications it is very difficult to discover the structure of the existing systems. And in most cases, we can only find out some of the part of the systems in a short time or by a low cost. Thus it is natural to ask the question: How can we simulate the systems with partially known structure ?
Based on this question, we have carried out a series of works to combine the grey modelling techniques and the machine learning methods. The main problems considered in these works are the systems with known dynamical structure and the unknown nonlinear relationship between the output and input series. And the previous results in the real world applications all indicate that the models built on this idea are significantly more effective than the conventional grey models.
In this paper, we would give the general formulation of this framework for the first time, and we name it as the Grey Machine Learning (GML). In order to better express the essence of this new framework, we abstracted the computational details for the general formulations in this paper. And for easier understanding and more effective applications, we tried our best to omit the deep mathematical theorems and concepts.
The rest of this paper is organized as follows: In order to better explain the main idea of the GML, we firstly summarize the general forms of the conventional first order grey models in Section 2; Then the general formulation including the computational details are presented in Section 3; Short discussions on the main findings and open issues are shown in Section 4, and the perspectives are illustrated in Section 5.
2 The typical formulation of the conventional grey models
Most of the existing grey models with continuous whitening equation share a general linear formulation as:
where the series is often called the feature series (or output series). The function is often varied by time or reliance series (or input series) , with unknown parameters .
For the discrete grey models, we can also write the general linear formulation as
The notations have the same meaning to the continuous formulation (1).
Most of the existing grey models can be written in the above formulations, and some popular ones with 1-AGO are listed in Table 1 and 2. Moreover, many grey models without 1-AGO also follow such formulations. The grey model with fractional order accumulation by Wu et al.  is one of the most popular grey models without 1-AGO, and this model still satisfies the formulation if we change the order of accumulation to be . And similar cases can also be found in other references related to fractional grey models [5, 6, 7]. And it is also interested to see that some nonlinear grey models can also be transformed to such formulations, e.g. the nonlinear Bernoulli grey models .
|NGM(1, 1, )|||
|NGM(1, 1, , )|||
It is obvious that the solutions of the grey models in the above formulations also share the similar formulations. As is well known, most grey models use the initial condition , then we can easily obtain the general formulation of such grey models.
For the continuous models, the solution can always be presented as the following function with a convolution as:
For the discrete models, the solution can always be presented as the following function with a discrete convolution as:
As shown above, the main structure is often known in a given grey model. The free parameters (or ) and are actually the unknown part of the grey models. Actually the structure of (or ) often plays a very important role in improving the grey models. it is well known that the nonhomogeneous grey model NGM often performs much better than the GM(1, 1) model, and similar examples can be widely seen in the existing studies. But, we can not always find out an optimal formulation of the function in the real world applications. And this problem leads to our motivation of introducing the Machine Learning to the grey models.
3 The Grey Machine Learning
3.1 The General formulation
The general formulation of the GML models share the equations as:
The continuous form:
The discrete form:
Where the function is totally unknown, and this is also the main difference between the GML models and the conventional grey models.
3.2 Estimation of the unknown function
Now we need to estimate the unknown function in the general formulations of the GML models. The kernel method has been mainly used in our previous works. In this subsection, we will discuss it in a more concise way.
3.2.1 Linear representation in a higher dimensional feature space
According to the Weierstrass Theorem, any continuous function on a closed and bounded interval can be uniformly approximated by polynomials to any degree of accuracy. Less formally, we can write the approximation of any function in the following form
Let the mapping be , where , and the weights be . Then the nonlinear function can be written in a linear form in the higher dimensional feature space as
Notice that this formulation always holds even the is a vector. And the dimension of the feature space can be infinity.
3.2.2 The nonparametric estimation
Within the discussion above, we can transform the estimation of arbitrary nonlinear function into a linear problem. With the given samples , we want to have the estimation in the following formulation
One defines the regularization problem, which is also in the form of ridge regression, as follows:
where is the 2-norm. The main difference between the regularization problem (10) and the commonly used least squares method is that the is also expected to be minimum. Actually, the available nonlinear mapping is not unique. e.g. if the nonlinear function is differentiable with higher order, it can be expanded with the Taylor power series, and there are numerous formulations if is expanded at different points. Thus, mathematically the ridge regression is used to ensure the solution to be unique. And also, with nondeterministic mathematical expressions, the regularization problem is also categorized in the nonparametric estimation formulations. On the other hand, we often want the estimation to be plat enough to have higher generalization, or more stable. To this end, the term should be used.
The regularization problem (10) is essentially a constrained quadric programming, thus we need firstly to define the Lagrangian as
The solution of the regularization problem (10) can be easily obtained using the KKT conditions as follows:
With the first equation in (12), we can see that the can be computed by the values of the Lagrangian multipliers and . Thus we can easily obtain the computational formulation of the nonlinear function as
We have not give the deterministic form of the nonlinear mapping yet. But notice that we only need to compute the inner product in the feature space . This is quite simple if we use the Mercer’s conditions (See ), with which we know that for any symmetric positive definite function , we have the following expansion:
where is the orthogonal basis, and
are the nonnegative eigenvalues. And such functions are often called the kernel functions or kernels.
Thus we can easily define the nonlinear mapping corresponding to a given kernel as
And obviously, the inner product of the nonlinear mapping can be written as the kernel
is the identity matrix.
3.2.3 The semiparametric estimation
With the nonparametric estimation, we can easily deduce the semiparametric version, of which the objective is to obtain the estimation shares the following formulation
where is a variable in or a vector in the linear space . This form can be easily transformed to the nonparametric formulation by denoting a new nonlinear mapping as
The Eq.(19) can be written as
Notice that this formulation is mathematical equivalent to the nonparametric form (9), thus all the computational formulations are just the same. And now we can simply obtain the key formulations for the semiparametric estimation with some tiny changes.
Firstly, the in (12) can be rewritten as
If we note , we can also write (23) as
From the above discussions, we can see that the nonlinear function is always estimated as the form with kernels:
This is very important as it ensures the computational feasibility of the grey systems with an unknown nonlinear function, and within this it is quite simple to derive the estimation of the general forms (5) and (6). e.g. if we use the (6), we can easily obtain the computational formulations by the following simple notations:
3.3 The solutions of the general formulations
Mathematically, the solutions of the general formulations shares the similar expressions to the linear ones as follows:
The continuous form of (5):
The discrete form of (6):
The implementation of the GML models is also very simple. The well documented Matlab packages of the following GML models are available online. One can easily build particular GML models combining the general framework of GML and the source codes.
The KGM(1, n) model333Source code
The KNEA model 444In the original paper the is denoted as as it stands for the oil production.
4 Short discussions on the findings and issues
4.1 Main findings
Actually, the nonparametric estimation presented above is essentially the classical Least Squares Support Vector Machines (LSSVM) by Johan Suykens , which represents one of the typical framework of the Machine Learning methods. The semiparametric estimation is the variation of the Partially Linear Least Squares Support Machines (PL-LSSVM), which was proposed by Marcelo Espinoza, Johan Suykens and Bart De Moor in 2004 , which has not been paid much attention in the past years. However, it can be seen in this paper that this formulation is quite useful to build the GML models. In fact a very strong result called the Representer Theorem by Bernhard Schölkopf, has been proved in 2001 , which shows that for any regularized formulation of regularized problems for kernel approximation, the estimation always follows the form of (25). The above facts show that the computational and theoretical basis of the kernel method is well established, which can make us more confident to use the kernel methods in the GML framework.
As mentioned above, our objective is to build the real “Grey” models with partially known part and partially unknown part. According to the general forms of the GML models represented above, the partially known part is actually the linear differential or difference in (5) or (6). Such formulation often implies that the status or output series of the systems is declining or increasing by time, and such systems are often called the dynamical systems, which widely exist in the real world applications, such as the oil & gas fields, batteries, etc. The nonlinear function estimated by the kernels represents the nonlinear function of the input series or reliance series of the systems. The physical meaning of these functions are clear, which essentially describes the nonlinear relationship in the dynamical systems. Thus, the GML models essentially represent the nonlinear dynamical systems.
With such properties, the GML models have been firstly shown to be more efficient than the existing linear grey models, these findings proved the nonlinearity of the new models. In our recent works , the GML models have also been shown to outperform the classical LSSVM, which is essentially a static model, and this proved the dynamical property of the GML models. What’s more, the comparison to the LSSVM also proved that the use of the known information can significantly improve the performance of the conventional Machine Learning models. This is also very important to the existing Machine Learning methods.
4.2 The open issues
The GML is essentially a combination of the existing grey modelling method and kernel method, thus it would also inherit the existing flaws of these methods.
The main computational pattern actually follows the general formulation of the linear grey models. Thus the existing issues on the existing grey models may also effect the performance of the GML models. Such as the issues on background values, [28, 29]), initial point optimization , the inconsistent problems , . However, with nonlinear formulations of the GML models, it is not clear whether the existing methods are still available at present.
On the other hand, as the GML framework employs the kernel method, it was expected that some important properties of the existing kernel-based models would still exist, but it was not the way as shown in our works.
(1)Selection of kernels It was reported that the kernels could have the same performance with proper kernel parameters . But it was found that only the Gaussian kernel can be efficient in the GML models  in the numerical experiments, and the efficiency of the GML model with Gaussian kernel can be much higher than that with other kernels. This implies that we can not easily use the existing knowledge of the established kernel learning theories, and the existing results should be fixed for the GML models.
(2)Optimization of hyperparameters It is also well known that the regularized parameter (e.g. in (10
)) and the kernel parameters (the tunable parameters in the kernel functions), which are often called hyperparameters uniformly, are also very important to the kernel-based models, and sometimes they are even more important than the formulation of the kernels. The most commonly used method for tuning such parameters is the cross validation (CV). However, the CV was shown to be available for only a few cases[21, 23]. Although some works have been presented to show the effectiveness of the CV for the time series models, there still exist controversies on such point of view. And what’s more, as the GML is mainly designed for small samples, it would be more difficult for the theoretical analysis.
(3)Training algorithm There numerous training algorithms for the kernel-based models, but in this paper we have not mentioned it in the above content. The main reason is that the problems we are willing to solve by the GML are the ones with small samples, and the training task is never difficult. Out of interest, we still have carried out some researches, obtaining some interesting results. The main stream training algorithms for the kernel-based models can be simply categorized in two classes, the gradient based algorithms and the SMO like algorithms. The SMO are often reported to be the best choice in the applications for the commonly used kernel-based models, such as the standard SVM  and LSSVM . However, in our experiments, the SMO can never outperform the conjugate gradient (CG) method . And the SMO needs millions of iterations even in some very simple data sets. In our opinion, although the CG is enough for us to train the GML models, it is still very interesting to figure out why the SMO loses its priorities in the GML. And the results may bring us some useful knowledge for the numerical computations.
5 Conclusions and Perspectives
According to the above discussions, our idea of the ”Grey” models can be implemented using the semiparametric estimation by the kernels. With the general formulation of the GML, more efficient GML models can be developed in the future.
Further, the works of GML at present do not only build some new grey models, but also prove the possibility to combine the dynamic nature of the grey system models and the nonlinearity of the Machine Learning models. Thus some other Machine Learning methods could also be expected to be used to build the GML models, such as the Multilayer Perceptrons (MP), Gaussian Process Regression (GPR), Deep Learning Neural Networks,.
At last, as mentioned above, with the widely existence of the nonlinear dynamical systems, more works in a wider range of real world applications should also be carried out. Especially the unconventional oil and gas systems [35, 36], building systems (energy consumption and emissions) , and clean energy systems [5, 38, 39, 40].
This research was supported by the Open Fund (PLN201710) of State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation (Southwest Petroleum University), and the Doctoral Research Foundation of Southwest University of Science and Technology (no. 16zx7140).
-  Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
-  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
-  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
-  Lifeng Wu, Sifeng Liu, Ligen Yao, Shuli Yan, and Dinglin Liu. Grey system model with the fractional order accumulation. Communications in Nonlinear Science and Numerical Simulation, 18(7):1775–1785, 2013.
-  Lifeng Wu, Xiaohui Gao, Yanli Xiao, Yingjie Yang, and Xiangnan Chen. Using a novel multi-variable grey model to forecast the electricity consumption of shandong province in China. Energy, 157:327–335, 2018.
-  Bo Zeng and Sifeng Liu. A self-adaptive intelligence gray prediction model with the optimal fractional order accumulating operator and its application. Mathematical Methods in the Applied Sciences, 40(18):7843–7857, 2017.
-  Huiming Duan, Guang Rong Lei, and Kailiang Shao. Forecasting crude oil consumption in china using a grey prediction model with an optimal fractional-order accumulating operator. Complexity, 2018:1–12, 2018.
-  Xin Ma, Zhibin Liu, and Yong Wang. Application of a novel nonlinear multivariate grey bernoulli model to predict the tourist income of china. Journal of Computational and Applied Mathematics, 347:84–94, 2019.
-  Sifeng Liu, Yi Lin, and Jeffrey Yi Lin Forrest. Grey systems: theory and applications. Springer, 2010.
-  Jie Cui, Si-feng Liu, Bo Zeng, and Nai-ming Xie. A novel grey forecasting model and its optimization. Applied Mathematical Modelling, 37(6):4399–4406, 2013.
-  LQ Zhan and HJ Shi. Methods and model of grey modeling for approximation non-homogenous exponential data. Systems Engineering-Theory & Practice, 3:689–694, 2013.
-  Ma X and Liu Z B. Application of a novel time-delayed polynomial grey model to predict the natural gas consumption in China. Journal of Computational and Applied Mathematics, 2017(324):17–24, 2013.
-  Nai-ming Xie and Si-feng Liu. Discrete grey forecasting model and its optimization. Applied Mathematical Modelling, 33(2):1173–1186, 2009.
-  Nai-Ming Xie, Si-Feng Liu, Ying-Jie Yang, and Chao-Qing Yuan. On novel grey forecasting model based on non-homogeneous index sequence. Applied Mathematical Modelling, 37(7):5059–5068, 2013.
-  T. L. Tien. The indirect measurement of tensile strength of material by the grey prediction model GMC(1, n). Measurement Science and Technology, 16(6):1322–1328, 2005.
-  Z. X. Wang. Nonlinear grey prediction model with convolution integral NGMC and its application to the forecasting of China’s industrial emissions. Journal of Applied Mathematics, 2014:1–9, 2014.
-  X Ma and Z-B Liu. Research on the novel recursive discrete multivariate grey prediction model and its applications. Applied Mathematical Modelling, 40(7-8):4876–4890, 2016.
-  X Ma and Z-B Liu. Predicting the oil field production using the novel discrete GM(1, N) model. Journal of Grey System, 27(4):63–73, 2015.
-  Bo Zeng, Chengming Luo, Sifeng Liu, Yun Bai, and Chuan Li. Development of an optimization method for the GM(1, N) model. Engineering Applications of Artificial Intelligence, 55:353–362, 2016.
-  Zheng-Xin Wang and De-Jun Ye. Forecasting chinese carbon emissions from fossil energy consumption using non-linear grey multivariable models. Journal of Cleaner Production, 142:600–612, 2017.
-  X. Ma, Y.S. Hu, and Z.B. Liu. A novel kernel regularized nonhomogeneous grey model and its applications. Communications in Nonlinear Science and Numerical Simulation, 48:51–62, 2017.
-  Xin Ma and Zhi-bin Liu. The kernel-based nonlinear multivariate grey model. Applied Mathematical Modelling, 56:217–238, 2018.
-  Xin Ma. Research on a novel kernel based grey prediction model and its applications. Mathematical Problems in Engineering, 2016:1–9, 2016.
-  Xin Ma, Zhibin Liu, Yong Wei, and Xinhai Kong. A novel kernel regularized nonlinear gmc (1, n) model and its application. The Journal of Grey System, 28(3):97–110, 2016.
Vandewalle J Suykens J A K.
Least squares support vector machine classifiers.Neural processing letters, 9(3):293–300, 1999.
-  Marcelo Espinoza, Johan AK Suykens, and Bart De Moor. Kernel based partially linear models and nonlinear identification. IEEE Transactions on Automatic Control, 50(10):1602–1606, 2005.
-  Bernhard Schölkopf, Ralf Herbrich, and Alex Smola. A generalized representer theorem. In Computational learning theory, pages 416–426. Springer, 2001.
-  Bo Zeng, Yongtao Tan, Hui Xu, Jing Quan, Luyun Wang, and Xueyu Zhou. Forecasting the electricity consumption of commercial sector in Hong Kong using a novel grey dynamic prediction model. Journal of Grey System, 30(1):159–174, 2018.
-  Bo Zeng and Chuan Li. Improved multi-variable grey forecasting model with a dynamic background-value coefficient and its application. Computers & Industrial Engineering, 118:278–290, 2018.
-  Xin Ma and Zhibin Liu. The GMC(1, n) model with optimized parameters and its application. Journal of Grey System, 29(4):122–138, 2017.
-  Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
-  John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998. Technical report.
-  S Sathiya Keerthi and Shirish Krishnaj Shevade. SMO algorithm for least-squares SVM formulations. Neural computation, 15(2):487–507, 2003.
-  Ma X. Study on dynamical prediction methods based on grey system and kernel method. PhD thesis, Southwest Petroleum University, 2016. The chapter 4, in Chinese.
-  Yisheng Hu, EJ Mackay, O Vazquez, and Ishkov O. Streamline simulation of barium sulfate precipitation occurring within the reservoir coupled with analysis of observed produced water chemistry data to aid scale management. SPE Production & Operations, 33(1):85–101, 2018.
-  Yong Wang and Xiangyi Yi. Flow modeling of well test analysis for a multiple-fractured horizontal well in triple media carbonate reservoir. International Journal of Nonlinear Sciences and Numerical Simulation, 19(5):439–457, 2018.
-  Ma M D Cai W. Do commercial building sector-derived carbon emissions decouple from the economic growth in Tertiary Industry? A case study of four municipalities in China. Science of Talphagohe Total Environment, 650(Part 1):822–834, 2019.
-  Pei Du, Jianzhou Wang, Wendong Yang, and Tong Niu. Multi-step ahead forecasting in electrical power system using a hybrid forecasting system. Renewable Energy, 122(7):533–550, 2018.
-  Zhang F C Cai H J Zeng W Z Wang X K. Zou H Y Fan J L, Wu L F. Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: A review and case study in China. Renewable and Sustainable Energy Reviews, 100:186–212, 2019.
-  Wu LF Zhang F Lu X Xiang Y Fan JL, Chen B. Evaluation and development of temperature-based empirical models for estimating daily global solar radiation in humid regions. Energy, 144:903–914, 2018.