1 Introduction
In recent years, deep learning (a branch of machine learning) has achieved many successes in lot of fields. However, a clear theoretical framework of deep learning is still missing. Consequently, there are many fundamental questions about deep learning are still open. For example: What is deep learning really doing? Is it really learning, or just a kind of fancy approximation to a function? Why it indeed has so many success? Why deep learning needs big data? For a particular problem, how much data is sufficient to drive deep learning to learn? Up to now, there is no satisfactory answer for these fundatamental questions. We here are trying to address these questions from a new angle.
We introduced term ”Mechanical learning” in [1]. Mechanical learning is a computing system that is based on a simple set of fixed rules (this is so called mechanical), and can modify itself according to incoming data (this is so called learning). A learning machine is a system that realizes mechanical learning.
In [2], we described learning machine in a lot of details. By doing so, we gained some useful knowledges and insights of mechanical learning.
In this short article, by using those knowledges and insights, we are trying to view deep learning from a new angle. First, we will briefly discuss learning machine, pattern, internal representation space, Xform, data sufficiency, and learning strategies and methods. Then, we will use the view of learning machine to see deep learning. We start from the simplest deep learning, i.e., 21 RBM, then go to 31 RBM, 1 RBM,  RBM, stacks of RBMs. Then we discuss the learning dynamics of deep learning. By this approach, we see clearly what deep learning is doing, why deep learning is working, under what condition deep learning can learn well, how much data are needed, and what disadvantages deep learning has.
2 Mechanical Learning and Learning Machine
We here very briefly sum up the discussions that we did in [1] and [2]. A learning machine has 2 major aspects: it is an IPU, i.e. it is able to process information; and it is learning, i.e. its information processing ability is changing according to data. Without learning (since it is a machine we design, we can stop learning), is very similar to a CPU. However, one major difference between learning machine and CPU is: learning machine treat incoming data according to its pattern, not bitwise.
Thus, in order to understand a learning machine, it’s absolutely necessary to understand pattern well. There are 2 kinds of patterns: objective pattern and subjective pattern. Subjective pattern is crucial for learning machine. In [2], we proved one theorem: For any objective pattern , we can find a proper subjective pattern that can express well and is build upon a least set of base patterns. To describe subjective patterns, it is best to use Xform, which is one algebraic expression upon some base patterns. Xform is one very important mathematical object. Xform could have subforms. Xform and its subforms actually forms the fundamental fabric of a learning machine.
We also defined learning by teaching and learning without teaching. Then, further specify typical mechanical learning. Learning by teaching requires we know learning machine and the pattern to learn well. By these knowledges, we can design a teaching sequence to make learning machine learn well. We proved that if a learning machine has certain capabilities for learning by teaching, it is universal, i.e. able to learn anything.
However, most learning is not learning by teaching. In order to understand typical mechanical learning, we introduced internal representation space. Structurally, a learning machine has components: input space, output space, internal representation space, and learning methods and learning strategies. The most important part is internal representation space. We studies internal representation space in details, and revealed, in fact, it is equivalent to a collection of Xforms. This fact tells us that learning is nothing but a dynamics on , moving from one Xform to another. With clear and reachable internal representation space, learning can be understood much better, and can be done much more efficiently. For example, we can unify all 5 kinds of learning – logic reasoning, connectionism, probabilistic approach, analogy, and evolution (see [3]) – together on naturally.
For mechanical learning, we need to understand data sufficiency. This is very crucial concept. We use Xform and its subforms to define data sufficient to support one Xform and sufficient to bound one Xform. With sufficient data, we can see how learning strategy and learning method work. There could be many learning strategies and learning methods. We show 3 learning strategies: 1. Embed Xform into parameter space. 2. Squeeze Xform from inside to higher abstraction. 3. Squeeze Xform from inside and outside to higher abstraction. We prove that with certain capabilities, the last 2 strategies and methods will make universal learning machine. Of course, this is theoretical results, since we have not designed a specific learning machine yet.
Here, we will show that deep learning is actually doing mechanical learning by the first strategy, i.e. embed Xforms into parameter space. Such a fact will help us to understand deep learning much better.
3 See Deep Learning from View of Learning Machine
According to our definition, if without human intervention, deep learning is mechanical learning. Of course, this ”if” is a big if. Often, deep learning program is running with a lot of human intervention, specially at the time of model set up. We will restrict our discussion to Hinton’s original model [4], i.e., a stack of RBMs. Each level of RBM is clearly a  learning machine ( and are dimensions of input and output). Hinton’s deep learning model is by stacking RBM together. If without further human intervention, it is a learning machine. This is the original model of deep learning. Other deep learning program can be thought as variations based on this model. Though in the past few years deep learning has leaped forward greatly, stacking RBM together still reflects the most typical properties of deep learning.
Thus, we would expect many things in mechanical learning could be applied to deep learning. The point is, we are viewing deep learning from a quite different angle – angle of mechanical learning. For example, we can view Hinton’s original deep learning program [4] as one 2584 learning machine, and we ask what is the internal representation space of it? We expect such angle and questions would reveal useful things for us. The central questions indeed are: what is the internal representation space of deep learning? what is the learning dynamics? At first, seems it is quite hard since learning is conducted on a huge parameter space (dimension could be hundreds of millions), and learning methods are overwhelmingly a big body of math. However, when we apply the basic thoughts of learning machine to deep learning, starting from simplest RBM, i.e. 21 RBM, we start to see much more clearly.
21 RBM
21 RBM is the simplest. However, it is also very useful since we can examine all details, and such details will give us a good guide on more general RBM.
21 RBM is one IPU. We know 21 IPU totally has 16 processing (). But, we only consider those processing: , so totally 8 processing, which we denote as (see [1]). For 21 RBM, any processing can be written as: for input , the output is:
The parameters determine what the processing really is. Parameter space has infinite many choices of parameters. But, there are only 6 processing, thus, for many different parameters, the processing is actually same. We can see all processing in below table:
(0,0)  0  0  0  0  0  0  0  0 

(1,0)  0  1  0  1  0  1  0  1 
(0,1)  0  0  1  0  1  1  0  1 
(1,1)  0  0  0  1  1  0  1  1 
Region  None  None  
Xform 
Tab. 1. Table for all processing of 21 RBM
Fig. 1. Parameter space of 21 RBM that is cut into 6 regions
In first row of table, are all processing of 21 IPU. Under first row, there is value table for each processing. We point out some quite familiar processing: is actually logical OR gate, is logical AND gate, is logical XOR gate. Note, are processing for 21 IPU, but not in 21 RBM. It is well known, 21 RBM has no XOR and AND (i.e. no ).
, indicate regions in parameter space , each region for one processing. There are only 6 regions, since 21 RBM only has 6 processing. We briefly discuss how we get these regions. See illustration in figure.
Suppose is processing. Line cuts into 2 regions: and . If is in first region, then , in second, . Line is perpendicular to , so, it cuts the previous 2 regions into 4 regions: and and and . Clearly, if , , if , . Line could no longer cuts the previous 4 regions into 8 regions, it could only cut 2 regions (2nd, 4th quadrant) into 4 regions (). So, totally, we have 6 regions, and each region is for one processing. This argument about regions is very simple, yet very effective. We can extend this argument to 1 RBM.
Each region is for one processing. So, region can be used to represent processing. That is 6th row in the table shown. Yet, a much better expression is by Xform ([2]). We explain them here. Here are base patterns. For 2dim pattern space, there are only these 4 base patterns. But, can also be used to represent one processing of 21 IPU, i.e. is such a processing: when input is , output is 1, otherwise output is zero. Xforms are expressions based on all 1 base patterns, operations +, , , composition, Cartesian product, and apply them consecutively. Example, , , are Xforms. Any processing of 21 IPU can be expressed by at least one Xform [2]. For example, if region is , processing is , Xform is . Another example, region is , processing is (this is OR gate), Xform is . is a processing of 21 IPU (XOR gate), but not in 21 RBM, its Xform is . The 7th row in the table shows Xforms representing processing. We can say that each processing is represented by a region, and by a Xform as well.
When 21 RBM is learning, clearly, parameter is adapting. But, only when cross region, processing changes. Before crossing, change of parameters is just for preparation for crossing (perhaps many parameter changes are just wasted). Learning is moving from one region to another. Or, equivalently, learning is from one Xform to another. Such view is crucial. Now, we are clear, on surface, learning on 21 RBM is a dynamics on parameter space , but real learning dynamics is on 6 regions (or Xforms). Such indirectness causes a lot of problem.
31 RBM
Just increase input dimension 1, we have 31 RBM. To discuss it, we can gain some insights for general RBM. For 31 RBM, still we can write: for any input , output is:
However, while we can easily write down all possible processing of 21 RBM, it would be hard to do so for 31 RBM. For 31 IPU, we know that the number of all possible processing is . Since only considering such processing : , the number becomes . We expect 31 RBM has less processing. But, how many possible processing of 31 RBM could have?
Following the guidance that 21 RBM gives to us, i.e. to consider the hyperplane generated by nonlinearity that cuts parameter spaces, we examine parameter space
, and following planes: . These planes are naturally obtained. For example, if we consider the input , it is easy to see plane is where cut the value of output: 1 or 0. So, the above planes are associated with following inputs:We can clearly see that in one region that is cut out by above 7 planes, the output values are same. Therefore, one region actually is representing one processing: in the region, processing is same. So, question of how many possible processing becomes how many possible regions cut out by the 7 planes. We do counting for the regions below.
First, cuts parameter space into 2 pieces: , . Second, perpendicular to , so, it cuts each region , into 2 pieces, we then have 4 regions: , ., . Then, perpendicular to and , so, we have 8 regions: . Then, consider . This plane no longer could be perpendicular to all . We will not have regions. We only have regions. Following the same argument, we have: For , regions. For , regions. For , regions.
So, for 31 RBM, there are at most 41 possible processing, comparing to 128 possible processing of full 31 IPU. However, there are possibility that the number of processing is even less than 41, since among those regions, it is possible that 2 different regions give same processing. We do not consider these details here.
Since regions can be represented by Xform, each processing 31 RBM can be represented by at least one Xform. are Xform for all base patterns. Examples, Xform is in 31 RBM. But, is not in 31 RBM. There are a lot of such Xform that is not in 31 RBM.
Learning dynamics on 31 RBM is also in such way: on surface, it is dynamics on , but real learning dynamics is on 41 regions (or Xforms).
1 RBM
The argument for 31 RBM can be extended to 1 RBM (See details in [2]). We consider hyperplanes and regions cut off by these hyperplanes. The number of these regions is less than: . Compare to the number of all processing of 1 IPU, which is , easy to see, 1 RBM has much less processing. This means that 1 RBM could not express many processing.
For 1 RBM, still we can write: for any input , output is:
(1) 
There are hyperplanes such as ; hyperplanes such as ; …. We also have this: First hyperplanes will cut parameter space into regions. Then, later each hyperplanes will cut more regions by the rate of multiplying factor . Thus, we can see the number of regions are:
where is the number of hyperplanes such as , etc.
And, we have the equation:
(2) 
So,
(3) 
Thus, the number of regions are
This is a very big number. Yet, compare to the total possible processing of full IPU, it is quite small. See their quotient:
It tells that full IPU has times more processing comparing to RBMs. This is huge difference. Say, just for , is more than 120 digits, i.e. the number of processing of full IPU would has more 120 digits at the end than the number of RBMs.
Also, each region can be expressed by at least one Xform. For examples, , , etc. Learning dynamics on 1 RBM is in such way: on surface, it is dynamics on , but real learning dynamics is on those regions (or Xforms).
 RBM
Suppose are 1 RBMs, we can form a  RBM, denote as , whose processing are , where are processing of . So, is Cartesian product of .
Since all are cut into regions, and in each region, processing is same, we can see is also cut into regions, and each region is a Cartesian product of regions of : , where is one region from ith RBM . Thus, the number of all possible regions of is . This is a much smaller number than , which is the number of total possible processing for  IPU.
Xform for each region of , are actually Cartesian product of Xform for those regions of . Suppose , and are Xforms for region in , , then Xform for is . For example, is one Xform of .
Learning dynamics on  RBM is in such way: on surface, it is dynamics on parameter space , but real learning dynamics is on those regions (or Xforms).
Stacking RBMs
Consider a  RBM , and a  RBM , stacking them together, we get one  IPU : A processing of are composition of processing of : . And we denote as: .
The parameter space of clearly is . We know is cut into some regions, in each region processing is same. So does . Thus, is cut into some regions, in each region processing is same, and these regions are Cartesian product of regions in and . So, we know number of total possible processing equals total possible processing of times , i.e. .
We can easily see if is large enough, the above number will become greater than , which is total possible processing of . We can see, at least potentially, has enough ability to become a  full IPU. But, we will not consider here. In fact, it is very closely related to the so called Universal Approximation Theorem. Indeed, stacking RBM together is powerful.
Xform can be expressed as composition as well. For example, consider 3 21 RBM , , and . Using and to form a 22 RBM, and using to stack on it, we get a 21 IPU : . If for this case, has Xform , and has Xform , and has Xform , them, has Xform . Easy to see this Xform is processing (XOR gate), which is not expressible by one 21 RBM. So, putting 3 21 RBMs together, more Xform can be expressed.
4 Learning Dynamics of Deep Learning
With these understandings RBMs, which is the most essential building block of deep learning, we can see how the model of deep learning is build up, and how learning dynamics is doing. Clearly, today’s deep learning is much more than original Hinton’s model of stacking RBMs (see [4]). But, we would first talking about this model.
The deep learning model is by stacking more RBMs (this is so called deep). Once several RBMs are putting together, a deep learning model is formed. Suppose , are  RBM, , where are sequence of integers. So, we can stack these RBM together to form one  IPU, whose processing could be written in such way: where each is processing of . We denote this IPU as . All parameters of form a huge Euclidean space . We can denote this huge parameter space as .
Then, clearly, deep learning is conducted on to reach a good processing by modifying the parameters in
. Of course, it requires skills to modify such a huge number of parameters. There are methods, such as CD (convergent divergence), SGD (stochastic gradient descent), etc. are invented for such purpose.
However, no matter what methods are used to modify the parameters, it is modifying parameters to form the dynamics of learning. So, seems the phase space of learning dynamics is on the space . But, this is just on surface. As we discussed in last section, the true dynamics is not conducted on parameters, but on those regions. The learning dynamics is conducted on these regions cut by hyperplanes and Cartesian products. The number of regions are huge: .
More precisely, the situation is: as learning, a huge number of parameters are adapting, but only when parameters cross region, the processing of changes. Before crossing, processing remains same, the changes of parameter at most can be thought as the preparation for crossing (perhaps many such changes of parameters are just wasted). Thus, learning dynamics is to move from one region to another. We also know each region is associated with one Xform. Thus, learning dynamics is to move from one Xform to another Xform.
Fig. 2. Illustration on parameter space is cut into regions
Fig. 2 gives one illustration on parameter space is cut into regions. Of course, is very high dimension Euclidean space, so regions could be shown on paper precisely, and Fig. 2 is just one illustration. However, this illustration gives us one clear picture about deep learning structure.
The deep learning structure is formed by these factors: how many RBMs, dimension of each RBM, how to stack, how to do Cartesian product. Once the structure is formed, if no further human intervention (such as manually adjust numbers or subroutines in model), the structure will not change. Such structure is formed by people at set up time. So, for fixed structure, we will have fixed region cut (as illustrated in Fig. 2). Further, we will have a fixed set of Xforms, and learning is conducted on this set of Xforms.
We can see one example. , , are 3 21 RBMs. We put them like this: . We have 3 parameter space . We have 6 regions for each parameter space. Put them together, we have 6x6x6 = 216 regions. is one 21 IPU. So, at most has 8 processing. Thus, among those 216 regions some different regions will have same processing. But, each region will have one Xform. That is to say, for one processing, there could have several Xform associated with. For example, consider this region: . This gives processing (XOR gate). Normally, for this processing, we can use Xform for it. But, for the region, naturally, the Xform is: . That is to say, this Xform will generate the same processing as . Another region: will give the same processing. And, the Xform is: .
For deep learning model build on stacking RBMs (original Hinton’s model, [4]), as we discussed in last section, the situation is same: the parameter space is cut into regions (by hyperplanes, etc), and each region is associated with one Xform, when parameter cross the boundary of the regions, one Xform moves to another Xform, learning dynamics is conducted on this set of Xform.
This is the learning dynamics of deep learning, this is what really deep learning is doing.
For more complicated deep learning, such as convolution is used, connection pruning is done, nonlinearity is other than sign function (like ReLU), the situation will be more complicated. However, if there is no human intervention, it will surely be mechanical learning. We still can prove that the learning dynamics is the same: the parameter space
is cut into regions, and each region is associated with one Xform, when parameter cross the boundary of the regions, one Xform moves to another Xform, learning dynamics is conducted on this set of Xform.To prove this for general deep learning mathematically, additional works are needed. We will do this work in other place. But, we have no doubt this can be done.
Such a learning strategy is exactly what we described in [2]: ”Embedded Xforms into parameter space”.
5 Remark
Now, we know the fact: deep learning is using strategy of ”Embedded Xforms into parameter space”. This fact is very essential and many consequences can be derived from it. Here we make some comments.
True nature of deep learning:
On surface, deep learning seems build a model from data feed into eventually (by using neural network, stacking RBMs, and more other tools). However, as we reveal in previous sections, it is not such a case. Essentially, a deep learning model is doing this: at the time of model setting up, to cut the huge parameter space
into many regions, and each region is associated with one Xform that is one logical statement, then driven by big amount of data, following certain learning dynamics, i.e. to move from region to another region, which is equivalent to move from one Xform to another Xform, and eventually to reach a satisfactory Xform, which is the learning result.So, we say that deep learning is not to building up a model from data input, but it is to choose a good Xform from a set of Xforms established at the time deep learning model is set up. This is the true nature of deep learning.
Such a view is different than popular view about deep learning. However, it is true and help us to understand deep learning better. For example, [6] might be right, there are some group renormalization going on, but, it missed this issue: group renormalization happens at the setting stage not at the learning stage. Another example, [5] gives a very good explanation about the power of multistagecomposition. However, it failed to realize that learning is not only to get a good processing, but to find a best possible Xform for the good processing (since one particular processing could have many associated Xform, and some of such Xform is bad, some of such Xform is good).
Fundamental limitation of deep learning:
The fundamental limitation of a deep learning model is from its nature: it acts on a preselected set of Xforms , which is formed at the time the model is set up.
So, mostly likely, a deep learning model could not be an universal learning machine [2]. If it is universal, must contains at least one Xforms for any possible processing. This is nearly impossible.
Actually, a deep learning model is set up by human, and is for one particular task. So, most likely, only contains Xforms specially for this task. And, this deep learning model is limited by .
If the learning target is for a particular processing, and if in there is at least one Xform associated to this processing, a deep learning model could possible to get the target. Otherwise, a deep learning model could not reach the processing, no matter how hard to try and how much data. In another word, the model is a bad model. But, deep learning has no any method to tell if a model is good or bad before trying it out. This is one huge limitation.
Yet, even contains a Xform associated with the desired processing, we still do not know if the Xform is a good one. As we know in [2], there are many Xforms associated with one processing, some is bad, and some is good. For example, one Xform is more robust for certain conditions. If does not contain the robust Xform, no matter how hard to try and how much data, learning could not get a robust solution. Again, deep learning has no any method to tell if contains such Xform. .
These limitations are fundamental and are derived from the fact: deep learning is chosen Xform from a preselected set , not dynamically building a Xform from input data.
Logical or Probabilistic:
Quite often, people think deep learning is doing probability learning. They think that a probability model is essential for deep learning since a lot data feed in, specially, stochastic gradient descent is one very essential learning method. However, we would like to point out: deep learning fundamentally is viewing its learning target logically. Why? Each Xform in
is a logical statement, often a very long logical statement (so called deep). So, the very essential thing is: when a deep learning model does its information processing, it is doing according to a solid logical statement. Not doing the information processing probabilistically.Of course, the way to get the Xform might not be pure logical, it could involve a lot of probabilistic views and actions, such as stochastic gradient descent. However, we would point out, even the way to get the Xform, could be pure deterministic not probabilistic. It is possible to design a pure deterministic learning dynamics, at least in theory.
Why works well:
Practice shows, deep learning works very well for many problems. Now, we can see the reason for such success clearly: when a deep learning model for one problem is set up by experienced people, the desired Xform is already build into the model. More precisely, if the desired Xform is , when deep learning model is set up, we have , where is the preselected set of Xforms. If so, it is possible to learn the Xform successfully, so the processing associated with . Thus, the success of a deep learning model depends on its set up. Having a good set up, the model will work well. Otherwise, no matter what data and what efforts, the model will not work well.
Of course, besides the set up of model, learning methods are crucial. It is not easy to choose the right from at all. We would like to point out the methods currently used indeed have some advantages:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Methods act on Euclidean space, which is most easy to calculate, with many sophisticated algorithms, library, packages, and hardwares available.

Methods are mostly doing linear algebraic calculus, which are easy to be parallelized. And, high parallelization is the key of its success. However, this advantage is build on this fact: no dynamic adaption. If dynamic adaption is used (such as recently introduced Capsule), this advantage might be lost.
Data for deep learning:
As discussed above, the logical statement (Xform) is the core of deep learning. Without supporting data, a deep learning model could not reach a sophisticated logical statement (Xform). We defined data sufficiency in [2], which tells what kind of data can support one Xform (logical statement).
Of course, the data sufficiency we defined is only the first step to understand data. Since deep learning approaching the desired Xform by some learning methods, it is easy to see that we need more data than just sufficient to support one Xform. The relationships here could be quite complicated, which would be the topic for further research.
However, we can tell that data sufficient to support and sufficient to bound the desired Xform is the necessary condition for deep learning. In this sense, for so called big data for deep learning, we understand the necessity and lower bound.
Disadvantages of deep learning:
Deep learning has some fundamental disadvantages from its root. We list some of them below:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

It is acting on huge parameter space, but, actual learning dynamics is on a fixed set of regions (which is equivalent to a set of Xforms). This indirectness makes every aspects of learning harder, especially, it is near impossible to know what is exactly happening in learning dynamics.

Successful learning needs data sufficient to support and to bound. This is very costly.

The structure of learning is setup by human. Once setup, structure (how many layers, how big a layer is, how layers fitting together, how to do convolution, etc) is not be able to change. This means that learning is restricted on a fixed group of regions, equivalent a fixed group of Xforms. If best Xform is not in this set, deep learning has no way to reach best Xform, no matter how big data are and how hard to try. Consequently, it is not universal learning machine.

It is very costly to embed Xforms into a huge parameter space. Perhaps, among all computing spend on learning, only a very small fraction is used on critical part, i.e. moving Xform to another, and most are just wasted.

Since there is no clear internal representation space, it is hard to define initial Xform, which is very essential to efficiency improving and several stages of learning.
Looking forward to universal learning machine:
Since deep learning model is not universal learning machine, naturally, we looking forward to universal learning machine. We discussed this in [1] and [2]. There, we proved that with certain capabilities, we can make universal learning machine. Also, we actually invented a concrete universal learning machine which is in the patent application process. We think universal learning machine has many advantages over deep learning model. There are many research works needed to be done for universal learning machine.
References

[1]
Chuy Xiong. Discussion on Mechanical Learning and Learning Machine, arxiv.org, 2016.
http://arxiv.org/pdf/1602.00198.pdfhttp://arxiv.org/pdf/1602.00198.pdf 
[2]
Chuy Xiong. Descriptions of Objectives and Processes of Mechanical Learning, arxiv.org, 2017.
http://arxiv.org/abs/1706.00066.pdfhttp://arxiv.org/pdf/1706.00066.pdf 
[3]
Pedro Domingos. The Master Algorithm, Talks at Google.
https://plus.google.com/117039636053462680924/posts/RxnFUqbbFRchttps://plus.google.com/117039636053462680924/posts/RxnFUqbbFRc  [4] E. G. Hinton. Learning multiple layers of representation, Trends in Cognitive Sciences, Vol. 11, pp 428434.. http://www.cs.toronto.edu/ hinton/absps/tics.pdfhttp://www.cs.toronto.edu/ hinton/absps/tics.pdf

[5]
Henry W. Lin, Max Tegmark, David Rolnick. Why does deep and cheap learning work so well?, arxiv.org, 2016.
http://arxiv.org/pdf/1608.08225.pdfhttp://arxiv.org/pdf/1608.08225.pdf 
[6]
Pankaj Mehta, David J. Schwab, David Rolnick. An exact mapping between the Variational Renormalization Group and Deep Learning, arxiv.org, 2014.
http://arxiv.org/pdf/1410.3831.pdfhttp://arxiv.org/pdf/1410.3831.pdf
Comments
There are no comments yet.