DeepAI

# Mathematical Perspective of Machine Learning

We take a closer look at some theoretical challenges of Machine Learning as a function approximation, gradient descent as the default optimization algorithm, limitations of fixed length and width networks and a different approach to RNNs from a mathematical perspective.

11/19/2022

### Can Gradient Descent Provably Learn Linear Dynamic Systems?

We study the learning ability of linear recurrent neural networks with g...
01/25/2018

### A New Backpropagation Algorithm without Gradient Descent

The backpropagation algorithm, which had been originally introduced in t...
12/05/2020

### A Review of Designs and Applications of Echo State Networks

Recurrent Neural Networks (RNNs) have demonstrated their outstanding abi...
02/07/2020

### On the Effectiveness of Richardson Extrapolation in Machine Learning

Richardson extrapolation is a classical technique from numerical analysi...
06/12/2020

### ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

Due to its simplicity and outstanding ability to generalize, stochastic ...
12/31/2021

### High Dimensional Optimization through the Lens of Machine Learning

This thesis reviews numerical optimization methods with machine learning...
07/28/2021

### A Reflection on Learning from Data: Epistemology Issues and Limitations

Although learning from data is effective and has achieved significant mi...

## 1 Introduction

In the nutshell the idea of training a neural network (

NN) is equivalent to the problem approximation of a given function, , with the domain, , and codomain, ,

 f:D→C (1)

which depends on some data of size of

-dimensional input vectors,

and -dimensional output (label) vectors, , by a composition of functions of the form

 →Pi(→zi−1⋅→wi)=(Pi(→zi−1⋅→wi),...,Pi(→zi−1⋅→wi)), (2)

where

is called an activation function of layer

, is an output vector of the layer , and is called the weight vector of layer . Once the size of each layer and the choice of each activation function is made, one usually uses, so called, back propagation algorithm, adjusting the values of each weight vector according to some type of gradient descent rule. In other words, one is trying to solve an optimization problem, minimizing the "difference norm"

 C:=∥∥f−~f∥∥d,where ~f=→Pr(→Pr−1(…→P1)) (3)

and is the number of layers of the neural network.

So apriori, we are making a choice of the function of weights . Note that the dimension of each vector is the size of the layer . To simplify the problem we may always find the maximum, , of the size layers, and assume that each . We are implicitly assuming that for each , . 111We shall return to the importance of this condition later to discuss the design and structure of NN

## 2 Existence of function f and the toll of cost function C

First thing to consider, given a labeled data set , if there is a representation function such that for all . This important step is often overlooked in practice. In theory, if there is

 →xj=→xksuch that→yj≠→yk (4)

then no such function exist. In other words is a function of more variables than provided in the data set and the idea approximation by is meaningless.

One may always assume some measurement error (noise) of the data set and consider instead a weaker condition

 →xj=→xk⟹∥∥→yj−→yk∥∥l2<ε (5)

necessary for existence of a function . In such case one may look at the average values of duplicate points

 fave=∑→xj=→xk→yk∑→xj=→xk1 (6)

and choose to approximate instead of . There is no guarantee that a good approximation of function is a good approximation of itself.

One should also consider the norm of the approximation function . It is well known that various classes of "nice" functions are dense in spaces. In particular, the class of functions defined in (3) are dense with respect to convergence in measure and norm (see [1]). This important fact implies that given any functions and any , there is a function s.t.

 ∥∥f−~f∥∥Lp<ε. (7)

The approximation function is a function of weights vectors . The pursuit of such function is a two part problem. The first part, defining the structure of the neural network, is done by a human. The methodology behind the choice of NN structures is at the stage of experimental science. The second part, weights optimization, is done by a computer, usually capable of trillions of operations per second. Needless to say that the effectiveness of latter part depends heavily on the former.

In practice one usually does not use an approximation with respect to norm. Computationally one may only evaluate the function at finitely many points and approximate it by at such points, often with respect to the norm. Since for all ,

 ∥∥f(yj)−~f(yj)∥∥lq⩽∥∥f(yj)−~f(yj)∥∥lp, (8)

one can choose any to obtain the approximation for all . 111Since for any sequence of complex numbers ,   , "l1 weights regularization" is also an "l2 regularization", so "(l1 and l2)-regularization" is redundant.

Here is an interesting question. Given , does it follow that the sequence for any choice ? One can easily show it is not the case. What about a sequence of randomly chosen points ? In this case the answer is affirmative.

What about the converse statement? Given a sequence , does it follow that ? What if for any sequence of points in the domain of , the sequence belongs to ? What if the measure of the domain of is not a Lebesgue measure?

Things get even more bizarre if the set function is only finitely additive. In some cases spaces may not be complete for any . In Measure Theory, "functions" that agree almost everywhere are indistinguishable. In case of finitely additive measures the equivalence classes of a "function" are often more complex. Why would anyone care about finitely (and not countably) additive measure on ? From the point of view of Constructive Mathematics, it is impossible to verify countable additivity of in the first place.

## 3 Some Considerations of feed forward neural networks

Using a gradient descent method, also known as back-propagation, to optimize weights , one obtains critical points of the function . Statistically speaking, saddle points in

are more probable than maxima and minima if

. Thus clever enough gradient descent algorithm will not yield a saddle point. There is no guaranty that the function does not have more than one local minimum, in which case such algorithm may converge to a local instead of the global minimum of .

It is also important to know weather a gradient descent algorithm converges to a local minimum of

. One may think that should not be a problem. Unfortunately most variable learning rate algorithms (keras optimizers) are designed to increase the convergence rate without a guarantee of asymptotic convergence to a local minimum. The are a few computational problems with such algorithms. One problem comes from the fact that all weights are updated simultaneously after each instance (epoch) which may cause the subsequent set of weights to yield a larger value of

. Another problem arises from the choice of the learning rate for each instance. It is often proportional to the absolute value of the gradient of . If function is concave up near local minimum, the value is likely to be too large.

Another common practice in supervised machine learning is to partition given labeled data set into the training and validation subsets. The validation set is only used to estimate the accuracy of the model at each stage. The goal of the algorithm is to minimize both training and validation error. This idea implicitly assumes some similarity of the function

on these two sets.

Assume, for example, we are trying to approximate the function

 f(x)=sin(1x),x>0 (9)

Try to approximate this function using a neural network of any size so that the mean square error on random points of the interval

is less than 0.1. Using Tensorflow 2 with four fully connected layers of size 800, activation function LeakyReLU(alpha=0.01), on

randomly generated points in the interval after epochs with Adam optimizer and validation split I obtained the following approximation and mean square error.

With with sixteen fully connected layers of size 200 and other parameters unchanged, the approximation and mean square error are as follows.

The following example is somewhat challenging, but even more pathological.

###### Example 3.1.

One can construct a measurable subset of with the following properties. Both and its complement, , in are totally disconnected (contain no interval) and

 μ(S)=μ(Sc)=12. (10)

Let

be the characteristic function of such set,

 f(x)=IS(x),x∈[0,1]. (11)

Approximation of by a function in (3) on any infinite countable subset of with respect to norm is computationally impossible because for each the probability of is equal the probability of .

Each domain of the above examples is one dimensional, and it is well known that approximation problem becomes more challenging as the dimension increases.

## 4 Simple functions and Adaptive Neural Networks

Let us take another look at the machine learning problem of approximation of a function given its values at a countable set of points . If we only assume that , it may be impossible to approximate on with respect to mean square error (see example 3.1). On the other hand, if we assume some smoothness conditions (like bounded variation), to predict the value of at , one could take the average of values of in in some neighborhood of . In such case, do we really need deep neural networks to construct approximation of , or is it just an excuse to own the latest Nvidia graphics card?

Note that the current methods in neural networks are using approximation of a given measurable function by almost everywhere continuous function . In Measure Theory one first approximates a measurable function by simple functions. The whole point of Lebesgue integration is to use simple functions instead of piece-wise continuous (step) functions. Neural networks representing simple functions would not consume as much computational power required by matrix multiplication. Some work in this direction [2] is known as Lookup Tables, but no connection between simple functions and lookup tables have been made explicitly.

Another interesting direction would be an algorithm which designs the structure of a neural network. To implement this approach, such algorithm would have to control the width of layers and the depth of the network.

To address the width control, one could partition each layer,

, of neurons into two subsets,

and , with zero weights of neurons in , and non-zero weights of neurons in . Here is where the condition

 Pi(→0)=0 (12)

of an activation function becomes significant. If (12) holds, the neurons from contribute nothing to the value of . One could think of neurons in as auxiliary neurons. As long as weight optimization algorithm does not compute gradients of neurons in , the computational complexity of the network is equivalent to one containing only the neurons in ’s.

To adapt the depth of a neural network, one could partition all layers of neurons into two sets, and , such that the layers in precede the layers in . If all but one neurons in layer from belongs to , and the remaining neuron in assigned consisting of ones, as long as , such layer acts as an identity function. If one does not compute the gradients and update the values for layer in , the computational burden of such layers is negligible.

In other words, the layers in set are dormant and act as the identity function and each neuron that belongs to set acts as a place holder. Increasing the size of ’s and size would increases the maximum approximation accuracy with little increase in computations.

One could then implement a rule for a "switch" of a layer of neurons in to a layer in based on a threshold of the gradient of . One could similarly implement a rule for a "switch" of neurons from to . Implementation of such switches automates the growth of width and depth of neural networks to accommodate the approximation accuracy without any human interference.

The above described adaptive algorithm is somewhat similar to a learning process of an adult human brain. Such brain contains constant number of neurons, but the number of neural connections is changing while learning a new skill. One could also consider a reverse switch from class to to implement the ability to "forget" no longer needed skill. Going a step further, one could allow the neurons in each to be shared by multiple networks working in parallel.

## 5 Computer Vision

One of the promising areas of neural network application is so called computer vision. In recent years convolutional neural networks had a significant progress in object detection and recognition from images and video data. One of the challenges of object classification is a consequence of so-called "curse of dimensionality". Each

pixels image of an object is often treated as -dimensional vector. Even a moderate size picture of pixels without some reprocessing presents a computational challenge for modern machines

There is another problem with the idea of representing pixels images as -dimensional vectors. By converting 2D objects into vectors, one loses the internal structure of the underlying Cartesian space. Assume, for example, we have a gray scale pixel image of a single object . Let be the grayness intensity value the pixel with Cartesian coordinates , with . If is small, one can usually assume that functions

 f(x±1,y),f(x,y±1),f(n−x,y),f(x,n−y),f(x−1,y),f(x,y)±ε (13)

which correspond to shifts, reflections and intensity change, would also represent the same object . One could similarly define a small rotation and noise invariance of the function representation of the given object. Enforcing such invariance rules on the structure of neural networks is far more difficult.

We think of objects as three dimensional, so one could assume that dimension of the solution space of object classification should be of same order of magnitude. Perhaps some difficulties of object classification follow from the complexity of equivalence relation defining each class of objects. Often times such relation is not based only on the three dimensional space. How, for example, would you recognize a bottle? Since bottles come in various shapes sizes, even if we had a rigorous definition of shape, equivalence rule of the "bottle" class would be complicated. Yet, when an adult human is presented with a previously unseen and even unusual bottle, would recognise it without much effort. Our notion of bottles comes from their extensive use in daily life. It seems unlikely that the problem of computer vision would have a computationally feasible solution by a narrow AI algorithm trained only on images.

## 6 Recurrent Neural Networks

Another class of networks, called recurrent NNs, are designed to predict the n-the value, , of a sequence of vectors , given all previous values

. One would think this type of problem requires different approach, yet the common practice is to modify the connections of feed forward neural network and use good old gradient descent.

Fourier series was fist thing came to my mind when looking at the above problem. If the sequence is periodic we would discover this fact within two periods of the sequence. Since not all functions are periodic, one could next assume the sequence function is almost periodic. For example, the class of Besicovich almost periodic functions on consist of trigonometric polynomials of the form

 P(x)=n∑k=1a(ηk)eiηkx,ηk∈R,a(ηk;P)∈C (14)

and their completion with the norm

 ∥P∥B2ap=limτ→∞(12τ∫τ−τ|P(x)|2dx)1/2. (15)

This is a large set of functions that need not be periodic. Even very easy almost periodic function

is not periodic. It would be interesting to use recurrent neural networks to approximate this function.

One can easily generalize Fourier series to this class of functions and use a computational power to estimate the Fourier series instead of using gradient descent. Since

 (16)

the set of functions functions form an orthonormal system and the completion of trigonometric polynomials in (14) is a Hilbert space. One may compute generalized Fourier coefficients of a Besicovich almost periodic function using

 a(η;f)=limτ→∞(12τ∫τ−τf(x)e−iηxdx),η∈R. (17)

Moreover, one can take advantage of Harmonic Analysis theory, by first defining a finitely additive measure with

 γ(S)=limτ→∞(12τ∫(−τ,τ)∩S1dx),S⊆R (18)

and then obtaining in (15) as the usual norm with the measure .

Only recently Fourier series were used in the new design of recurrent neural network called "Transformers" introduced by authors of the paper "Attention is all you need" [3].

## References

• [1] K Hornik, M. Stinchcombe, H. White: Multilayer Feedforward Networks are Universal Approximators Neural Networks Vol 2, pp. 359 - 366, 1989
• [2] R. Isermann , M. Münchhof M: Neural Networks and Lookup Tables for Identification Identification of Dynamic Systems Springer, Berlin, Heidelberg, 2011
• [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Plosukhin: Attention is All You Need 31st Conference on Neural Information Processing Systems Long Beach, CA, USA, 2017