Machine learning (ML) bishop_pattern_2011; cover_elements_1991; hastie2009elements; hundred
refers to a broad field of study, with multifaceted applications of cross-disciplinary breadth. ML ultimately aims at developing computer algorithms that improve automatically through experience. The core idea, common to all artificial intelligence (AI) technology, is that systems can learn from data, so as to identify distinctive patterns and make consequently decisions, with minimal human intervention. The range of applications of ML methodologies is extremely vastsutton2018reinforcement; graves2013speech; sebe2005machine; grigorescu2020survey, and still growing at a steady pace due to the pressing need to cope with the efficiently handling of big data chen2014big. Biomimetic approaches to sub-symbolic AI rosenblatt1961principles inspired the design of powerful algorithms. These latter sought to reproduce the unconscious process underlying fast perception, the neurological paths for rapid decision making, as e.g. employed for faces meyers2008using or spoken words caponetti2011biologically recognition.
An early example of a sub-symbolic brain inspired AI was the perceptronrosenblatt1958perceptron, the influential ancestor of deep neural networks (NN) bengio2007greedy; Goodfellow-et-al-2016
, which define the skeleton of modern AI architectures. The perceptron is indeed an algorithm for supervised learning of binary classifiers. It is a linear classifier, meaning that its forecasts are based on a linear prediction function which combines a set of weights with the feature vector. Analogous to neurons, the perceptron adds up its input: if the resulting sum is above a given threshold the perceptron fires (returns the output the value 1) otherwise it does not (and the output equals zero). Modern multilayer perceptrons, account for multiple hidden layers with non linear activation functions. The learning is achieved viaconditioning: single or multilayered perceptrons should be trained by examples, be rewarded when they fire correctly and punished otherwise bengio2007greedy; hinton2006fast; rumelhart1988learning. Supervised learning requires a large set of positive and negative examples, the training set, labelled with their reference category.
The perceptrons’ acquired ability to perform classification is eventually stored in a finite collection of numbers, the weights and thresholds that were learned during the successive epochs of the supervised training. To date, it is not clear how such a huge collection of numbers (hundred-millions of weights in state of the art ML applications) are synergistically interlaced for the deep networks to execute the assigned tasks, with an exceptional degree of robustness and accuracyxie2020explainable; hinton2015distilling; erhan2010understanding.
Starting from these premises, the aims of this paper are multifold. On the one side, we will develop a novel learning scheme which is anchored on reciprocal space. Instead of recursively adjusting the weights of the edges that define the connection among nodes, we will modify the entries of a properly indented basis which is engineered to allow for the information to be processed from input to output, via nested embeddings. The vectors of the basis are the eigenvectors of a transfer matrix in direct space. By training a limited set of eigenvalues enables in turn to recover the weights that link the nodes of the underlying network. The proposed method can be flexibly tailored to return either linear or non linear classifiers. These latter display competitive performance as compared to standard deep learning schemes, while allowing for a significant reduction of the learning parameter space. As an important byproduct of the analysis, working in the spectral domain allows one to gain a fresh insight into the process of supervised learning. The input is received and processed by successive arrays of mutually entangled eigenvectors that compose the basis of the examined space: the directed indentation between adjacent stacks yield an effective compression of the information, which is eventually delivered to the detection nodes. We speculate that this is a universal paradigm of learning which holds in general beyond the specific setting here explored.
The MNIST database. To introduce and test the proposed method we will consider a special task, i.e. recognition of handwritten digits. To this end, we will make use of the MNIST database lecun1998mnist which has a training set of 60,000 examples, and a test set of 10,000 examples. Each image is made of pixels and each pixel bears an 8-bit numerical intensity value, see Fig. 1
. A deep neural network can be trained using standard backpropagationbengio2007greedy algorithm to assign the weights that link the nodes (or perceptrons) belonging to consecutive layers. The first layer has nodes and the input is set to the corresponding pixel’s intensity. The highest error rate reported on the original website of the database lecun1998mnist is 12 %, which is achieved using a simple linear classifier, with no preprocessing. In early 2020, researchers announced 0.16 % error byerly2020branching with a deep neural network made of branching and merging convolutional networks. Our goal here is to contribute to the analysis with a radically different approach to the learning. Our final objective is indeed to generate a network made of nodes, organized in successive layers, tying the training to reciprocal space. More specifically, we will introduce and self-consistently adapt a set of linearly independent eigenmodes, designed to promote the feed forward transfer of the information, from the input to the detection nodes. The associated eigenvalues are central for envisaged learning procedure.
Single-layer perceptron trained in reciprocal space. Assume to label the nodes assigned to layer , then . The output layer is composed by ten nodes (), where recognition takes eventually place. Select one image from the training set and be the generic number therein displayed. We then construct a column vector , of size , whose first entries are the intensities displayed on the pixels of the selected image (from the top-left to the bottom-right, moving horizontally), as illustrated in Fig. 1. The last elements are the output nodes where reading is performed.
To set the stage, we begin by reporting on a simplified scenario that, as we shall prove in the following, yields a single layer perceptron. The extension to multi-layered architectures will be discussed in the second part of the paper. Introduce a basis of the space where belongs and denote by the -th vector of the basis. Vectors are the columns of matrix , which is schematically depicted in Fig. 2 (a). The diagonal of is filled with unities. Sub-diagonal blocks of size for
are populated with random numbers, uniformly distributed within a bound interval, with . These blocks provide an effective indentation between successive slabs of vectors, as in the spirit of the reasoning anticipated above. Any image of the database returns a N-dimensional vector which can be readily expanded on the introduced basis to yield where stands for the coefficients of the expansion. The non zero entries of activate the vectors of the first block, and trigger a cascade of successive reactions that eventually hits the final block of vectors. Indeed, the first vectors, which are necessarily engaged to explain the non zero content of , rebound on the successive elements of the basis. These latter need to adjust their associated weights to compensate for the echoed perturbation, and in doing so, prompt a reaction that propagates on the successive stack of vectors. Iterating forward, it is immediate to conclude that the characteristics of are eventually stored in the final bunch of coefficients . A successful compression, which is an accurate rendering of the supplied vector, can be attained depending on the details of the imposed vectors indentation.
As a next step of the analysis, we look for a linear operator , that transfers into an output vector which can be effectively employed to perform the sought classification task. More specifically, and before commenting on the definition of , we are interested in the final entries of , which will be further manipulated with a softmax filter, . In formulae, we will focus on the output elements , with : the position of the largest among the values is used to classify . If the maximum is for example found in position , the image that gives as an input vector is associated to a zero. In general, if the maximum of , , is positioned at , then the handwritten digit supplied as an input is . To define the operator we use its spectral decomposition in the form , where is a diagonal matrix of entries , . In other words, is introduced as the matrix with eigenvectors , relative to eigenvalues . The learning strategy that we shall here implement and test amounts therefore to assign the eigenvalues’ and eigenvectors’ entries, so that can accomplish the classification task evoked above. To gain insight into the role exerted by we set to calculate . In particular, , for . Recalling the specific form of the introduced eigenvectors (see Fig. 2(a)), the above expression reduces to , for . Hence, only the last eigenvalues contribute to the generated output, and can be consequently tuned for the system to discriminate between different input images. This observation form the basis of the first learning scheme here discussed. We define a loss function as , where stands for the label attached to depending on its category. Specifically the -th entry of is equal unit (and the rest identically equal to zero) if , with
. The loss function can be minimized by acting on the limited subset of eigenvalueswith k= . Alternatively, one can also modulate the sub-diagonal block entries of , which implement the recursive compression of the information from the input to the output layer. By training eigenvectors and eigenvalues in the reciprocal space, we obtain a linear classifier which acts, in real space, as a single layer perceptron. Recall in fact that the classification relies on examining the last entries of . Hence, one can imagine to work in a reduced space of dimension , and therein define . The first entries of are the intensities on the pixels of the selected image, as for the homologous quantity. The other elements are set to zero. Then, we consider the matrix , constructed from by trimming out all the information that pertain to the intermediate layers, as introduced in the reciprocal space (see Fig. 2(b)). Stated differently, matrix
provides the weighted links that feed from the input to the output layer in direct space, via the linear transformation: this is a single layer perceptron, shown in Fig. 2(b), which was trained by endowing reciprocal space with an arbitrary number of additional dimensions, the intermediate stacks of vectors responsible for the sequential embedding of the information. In Fig. 3, we report on the performance of the learning algorithm (measured by its accuracy, i.e. the fraction of correctly recognized images) for and changing the number of nodes , that compose the intermediate layer 111It is worth stressing that the size of the intermediate layer is indeed central during the learning process: as such, it leaves an imprint on the weights of the edges that define the single-layer perceptron.. The red line refers to the simplified scheme where eigenvalues are solely tuned (while leaving the eigenvectors fixed at the random realization set by the initial condition). The blue line is instead obtained when the learning extends to the eigenvectors. The horizontal dashed line represents the accuracy obtained when training a single layer neural network in direct space. This is indeed the correct comparison to be made, as the multilayered processing in reciprocal space implemented above converges to a linear mono-layer perceptron in direct space. Learning on the eigenvalues and eigenvectors allows to significantly reduce the number of free parameters usually managed in conventional neural networks, while returning competitive performance scores (a 90 % accuracy is e.g. obtained training 800 eigenvalues – red line – instead of the parameters that need to be adjusted in direct space) . The process here discussed allows in addition to shed light onto the fundamental mechanisms which underly supervised learning. It is in fact tempting to surmise that the nested arrangement of the eigenvectors of the adjacency matrix which define the network in direct space is a key property for the neural network to carry out the assigned classification task. In the following, we will discuss how these ideas extend to the more general setting of (linear or non-linear) multi-layered neural networks.
Training non linear multi-layered neural networks in the spectral domain. In analogy with the above, we will formulate the problem in reciprocal space. The weights of the sought network will be recovered from the spectral properties of a suitably engineered transfer operator. The image to be processes is again organized in a column vector . This latter vector undergoes a first linear transformation to yield where is a matrix that we shall characterize in the following. Introduce matrix
: this is the identity matrixmodified by the inclusion of a sub-diagonal block , filled with uniformly distributed random numbers, defined in the bounded interval . Then, we introduce the diagonal matrix , which is again obtained from the identity matrix by assigning random entries to the diagonal elements that range from (not included) to (included). A straightforward calculation returns . We hence define as the matrix that transform into . Because of the specific structure of the input vector, and owing the nature of , the information stored in the first elements of is passed to the successive entries of , in a compactified form which reflects both the imposed eigenvectors’ indentation and the chosen non trivial eigenvalues. These latter will be tuned during the learning stage, while the off-diagonal elements of are frozen to the assigned nominal values. The output vector can be also filtered via a suitable non-linear function . This step marks a distinction between, respectively, the linear and non-linear versions of the learning schemes. For the applications here reported we have chosen , where is a control parameter which could be in principle self-consistently adjusted all along the learning procedure. We are now in a position to iterate the above reasoning. We thus introduce the matrix operator , for . In analogy with the above, is the identity matrix modified with a sub-diagonal block , which extends from rows to , and touches tangentially the diagonal, as schematically illustrated in Fig. 4 (a). Similarly, we introduce , for , which is again obtained from the identity matrix upon mutating to uniformly distributed random entries the diagonal elements that range from (not included) to (included). Finally, we define , as the matrix that transforms into , with
. At variance with the above, both non trivial eigenvalues’ and eigenvectors’ input can be self-consistently adjusted by the envisaged learning strategy. The outcome of the linear transformation can go through the non-linear filter, where could also be tunable parameters. To implement the learning procedure we introduce the loss function defined as:
where is the softmax operation applied to the last entries of the -th image of the input vector . The loss function (1) is minimized by acting on the free parameters of the model: the successive indentation of the basis which commands the transfer of the information, the blocks of tunable eigenvalues and the quantities
that set the steepness of the non linear functions. This eventually yields a fully trained network, in direct space, which can be unfolded into a layered architecture to perform pattern recognition (see Fig.4 (b)). Remarkably, self-loop links are also present. The limit of a linear single layer perceptron is recovered when silencing the non linearities: a matrix can be generated from the matrix , following the same strategy outlined above. The accuracy of the (linear) single layer perceptron trained with this scheme returns performance scores that are identical to those reported in Fig. 3.
We now turn to considering a fully non linear model, using an architecture of four successive layers. In Fig 5
(a), we report the estimated accuracy of the network trained in reciprocal space, when changing, for . The performance of the algorithm ramps up to about 98 %. This is a competitive figure as compared to standard ML techniques which we have reached despite a significant reduction of the free training parameters: the weights of the ensuing network are in fact inherited by the trained spectra and do not need to be adjusted as individual free parameters of the model. Notice that, the entries that define the random block of matrix are excluded from the pool of trainable parameters, so resulting in a noteworthy contraction of the adjustable parameters, as confronted to homologous ML schemes.
To better illustrate this point we analyze in Fig. 5(b) the relative accuracy of the newly proposed method vs. that obtained with an equivalent standard deep neural network, trained in the space of the nodes. The ratio of the computed accuracy is plotted against , the number of training parameters used in our setting, normalized to that employed within conventional ML schemes. Already at very low fractions ( %) we reach performances which are almost identical, within statistical errors, to those obtained with usual training protocols.
. To build and train the aforementioned models we used TensorFlow and created a custom spectral layer matrix that could be integrated in virtually every TensorFlow or Keras model. That allowed us to leverage on the automatic differentiation capabilities and the built-in optimizers of TensorFlow. Recall that we aim at training just a block of thematrix and a portion of the diagonal of . To reach this goal we generated two fully trainable matrices, for each layer in the spectral domain, and applied a suitably designed mask to filter out the sub-parts of the matrices to be excluded from the training. This is easy to implement and, although improvable from the point of view of computational efficiency, it works perfectly, given the size of the problem to be handled. We then trained all our models with the AdaMax optimizer kingma2014adam by using a learning rate of for the linear case and for the non linear one. The training proceeded for epochs and during each epoch the network was fed with batches of
images. These hyperparameters have been chosen so as to improve on GPU efficiency, accuracy and stability. However, we did not perform a systematic study to look for the optimal setting. All our models have been trained on a virtual machine hosted by Google Colaboratory. Standard neural networks have been trained on the same machine using identical software and hyperparameters, for a fair comparison. Further details about the implementation, as well as a notebook to reproduce our results, can be found in the public repository of this projectgitrepo.
Summing up, we have here proposed a novel approach to the training of deep neural networks which is bound to the spectral, hence reciprocal, domain. The eigenvectors and eigenvalues of the adjacency matrices that connects consecutive layers via directed feed-forward links are trained, instead of adjusting the weights that bridge each pair of nodes of the collection, as it is customarily done in the framework of conventional ML approaches. This choice results in a considerable reduction of the computational costs, while still returning competing achievements in terms of classification ability. By just training on the eigenvalues requires in fact acting on a set of parameters which scales linearly with , the size of the deep neural network, and at variance with the complexity of ordinary schemes which ramp up as . For this reason, the proposed method could be also used in combination with existing ML algorithm for an effective (and computationally fast) pre-training of the deep neural network to be employed. By formulating the learning process in reciprocal space, we have isolated an important aspect which, we believe, could form the basis for a rational understanding of the surprising ability of deep networks to cope with the assigned tasks. In direct space, in fact, the capacity of the network to e.g. handle classification hides in a gigantic pool of, apparently uncorrelated, signed parameters. These latter underlies however an highly regular structure for the eigenvectors of the associated adjacency matrix: the nested indentation of eigenvectors belonging to successive stacks favors the recursive processing of the data, from the input to the output layer. This observation could be at the heart of algorithmic decision making and help shedding novel light on why deep networks work as well as they do.