Hidden Unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation

10/16/2019
by   Elisa Oostwal, et al.
0

We study layered neural networks of rectified linear units (ReLU) in a modelling framework for stochastic training processes. The comparison with sigmoidal activation functions is in the center of interest. We compute typical learning curves for shallow networks with K hidden units in matching student teacher scenarios. The systems exhibit sudden changes of the generalization performance via the process of hidden unit specialization at critical sizes of the training set. Surprisingly, our results show that the training behavior of ReLU networks is qualitatively different from that of networks with sigmoidal activations. In networks with K >= 3 sigmoidal hidden units, the transition is discontinuous: Specialized network configurations co-exist and compete with states of poor performance even for very large training sets. On the contrary, the use of ReLU activations results in continuous transitions for all K: For large enough training sets, two competing, differently specialized states display similar generalization abilities, which coincide exactly for large networks in the limit K to infinity.

READ FULL TEXT VIEW PDF

Authors

page 10

01/20/2020

Memory capacity of neural networks with threshold and ReLU activations

Overwhelming theoretical and empirical evidence shows that mildly overpa...
01/16/2018

Empirical Explorations in Training Networks with Discrete Activations

We present extensive experiments training and testing hidden units in de...
05/05/2015

Empirical Evaluation of Rectified Activations in Convolutional Network

In this paper we investigate the performance of different types of recti...
05/21/2020

Supervised Learning in the Presence of Concept Drift: A modelling framework

We present a modelling framework for the investigation of supervised lea...
06/12/2019

Semi-flat minima and saddle points by embedding neural networks to overparameterization

We theoretically study the landscape of the training error for neural ne...
03/18/2019

On-line learning dynamics of ReLU neural networks using statistical physics techniques

We introduce exact macroscopic on-line learning dynamics of two-layer ne...
10/20/2020

Smooth activations and reproducibility in deep networks

Deep networks are gradually penetrating almost every domain in our lives...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The re-gained interest in artificial neural networks [hertz, bishop1995, EngelvandenBroeck, hastie, bishop2006]

is largely due to the successful application of so-called Deep Learning in a number of practical contexts, see e.g. 

[goodfellow, naturedeep, esanndeep] for reviews and further references.

The successful training of powerful, multi-layered deep networks has become feasible for a number of reasons including the automated acquisition of large amounts of training data in various domains, the use of modified and optimized architectures, e.g. convolutional networks for image processing, and the ever-increasing availability of computational power needed for the implementation of efficient training.

One particularly important modification of earlier models is the use of alternative activation functions [goodfellow, searching, timetoswish]. Arguably, so-called rectified linear units (ReLU) constitute the most popular choice in Deep Neural Networks [hahnloser, krizhevsky, goodfellow, searching, timetoswish, acousticmodels]. Compared to more traditional activation functions, the simple ReLU and recently suggested modifications warrant computational ease and appear to speed up the training, see for instance [nair, acousticmodels, villmannswish]. The one-sided ReLU function is found to yield sparse activity in large networks, a feature which is frequently perceived as favorable and biologically plausible [hahnloser, goodfellow, sparse]. In addition, the problem of vanishing gradients

, which arises when applying the chain rule in layered networks of sigmoidal units, is avoided

[goodfellow]. Moreover, networks of rectified linear units have displayed favorable generalization behavior in several practical applications and benchmark tests, e.g. [searching, hahnloser, krizhevsky, timetoswish, acousticmodels].

The aim of this work is to contribute to a better theoretical understanding of how the use of ReLU activations influences and potentially improves the training behavior of layered neural networks. We focus on the comparison with traditional sigmoidal functions and analyse non-trivial model situations. To this end, we employ approaches from the statistical physics of learning, which have been applied earlier with great success in the context of neural networks and machine learning in general

[hertz, SST, revmodphys, kinzel, opper, handbook, EngelvandenBroeck]. The statistical physics approach complements other theoretical frameworks in that it studies the typical behavior of large learning systems in model scenarios. As an important example, learning curves have been computed in a variety of settings, including on-line and off-line supervised training of feedforward neural networks, see for instance [SST, revmodphys, EngelvandenBroeck, handbook, schwarzegradient, saadsollashort, saadsollalong, woehlertransient, nestorplateaus, kinzel, opper, retarded]

and references therein. A topic of particular interest for this work is the analysis of phase transitions in learning processes, i.e. sudden changes of the expected performance with the training set size or other control parameters, see

[kinzel, opper, retarded, Kang, dpg, epl, epj, phasetransitions] for examples and further references.

Currently, the statistical physics of learning is being revisited extensively in order to investigate relevant phenomena in deep neural networks and other learning paradigms, see [monasson, kadmon, pankaj, sohl, egalitarian, esannsession2019, relucapacity, goldt] for recent examples and further references.

In this work, we systematically study the training of layered networks in so-called student teacher settings, see e.g. [SST, revmodphys, opper, EngelvandenBroeck]. We consider idealized, yet non-trivial scenarios of matching student and teacher complexity. Our findings demonstrate that ReLU networks display training and generalization behavior which differs significantly from their counterparts composed of sigmoidal units. Both network types display sudden changes of their performance with the number of available examples. In statistical physics terminology, the systems undergo phase transitions at a critical training set size. The underlying process of hidden unit specialization and the existence of saddle points in the objective function have recently attracted attention also in the context of Deep Learning [kadmon, dauphin, saxe].

Before analysing ReLU networks, we confirm earlier theoretical results which indicate that the transition for large networks of sigmoidal units is discontinuous (first order): For small training sets, a poorly generalizing state is observed, in which all hidden units approximate the target to some extent and essentially perform the same task. At a critical size of the training set, a favorable configuration with specialized hidden units appears. However, a poorly performing state remains metastable and the specialization required for successful learning can delay the training process significantly [Kang, epl, epj, dpg].

In contrast we find that, surprisingly, the corresponding phase transition in ReLU networks is always continuous (second order). At the transition, the unspecialized state is replaced by two competing configurations with very similar generalization ability. In large networks, their performance is nearly identical and it coincides exactly in the limit

In the next section we detail the considered models and outline the theoretical approach. In Sec. III our results are presented and discussed. We conclude with a summary and outlook on future extensions of this work.

Ii Model and Analysis

Here we introduce the modelling framework, i.e. the considered student teacher scenarios. Moreover, we outline their analysis by means of statistical physics methods and discuss the simplifying assumption of training at high (formal) temperatures.

Ii-a Network architecture and activation functions

We consider feed-forward neural networks where

input nodes represent feature vectors

A single layer of hidden units is connected to the input through adaptive weights The total real-valued output reads

(1)

The quantity is referred to as the local potential of the hidden unit. The resulting activation is specified by the function and hidden to output weights are fixed to . Figure 1 (left panel) illustrates the network architecture.

This type of network has been termed the Soft Committee Machine (SCM) in the literature due to its vague similarity to the committee machine for binary classification, e.g. [EngelvandenBroeck, revmodphys, robertcapacity, holm, opper, retarded, relucapacity]. There, the discrete output is determined by the majority of threshold units in the hidden layer, while the SCM is suitable for regression tasks.

Fig. 1: Left panel: Illustration of the network architecture with an -dim. input layer, a set of adaptive weight vectors (represented by solid lines) and total output given by the sum of hidden unit activations with fixed weights (dashed lines). Right panel: The considered activation functions: the sigmoidal (solid line) and the ReLU activation (dashed line).

We will consider two popular types of transfer functions:

  • Sigmoidal activation
    Frequently, S-shaped transfer functions have been employed, which increase monotonically from zero at large negative arguments and saturate at a finite maximum for . Popular examples are based on or the sigmoid , often with an additional threshold as in or a steepness parameter controlling the magnitude of the derivative

    We study the particular choice

    (2)

    with which is displayed in the right panel of Fig. 1. The relation to an integrated Gaussian facilitates significant mathematical ease, which has been exploited in numerous studies of machine learning models, e.g. [schwarzegradient, saadsollashort, saadsollalong]. Here, the function (2) serves as a generic example of a sigmoidal and its specific form is not expected to influence our findings crucially. As we argue below, the choice of limiting values and for small and large arguments, respectively, is also arbitrary and irrelevant for the qualitative results of our analyses.

  • Rectified Linear Unit (ReLU) activation
    This particularly simple, piece-wise linear transfer function has attracted considerable attention in the context of multi-layered neural networks. It is given by

    (3)

    which is illustrated in Fig. 1 (right panel). In contrast to sigmoidal activations, the response of the unit is unbounded for .

    The function (3) is obviously not differentiable in . Here, we can ignore this mathematical subtlety and remark that it is considered irrelevant in practice [goodfellow]. Note also that our theoretical investigation in Sec. II does not relate to a particular realization of gradient-based training.

It is important to realize that replacing the above functions by in (a) or by in (b), where is an arbitrary factor, would be equivalent to setting the hidden unit weights to in Eq. (1). Alternatively, we could incorporate the factor in the effective temperature parameter of the theoretical analysis in Sec. II-D. Apart from this trivial re-scaling, our results would not be affected qualitatively.

Ii-B Student and teacher scenario

We investigate the training and generalization behavior of the layered networks introduced above in a setup that models the learning of a regression scheme from example data. Assume that a given training set

(4)

comprises input output pairs which reflect the target task. In order to facilitate successful learning, should be proportional to the number of adaptive weights in the trained system. In our specific model scenario the labels are thought to be provided by a teacher SCM, representing the target input output relation

(5)

The response is specified in terms of the set of teacher weight vectors

and defines the correct target output for every possible feature vector

. For simplicity, we will focus on settings with orthonormal teacher weight vectors and restrict the adaptive student configuration to normalized weights:

(6)

with the Kronecker-Delta if and .

Throughout the following, the evaluation of the student network will be based on a simple quadratic error measure that compares student output and target value. Accordingly, the selection of student weights

in the training process is guided by a cost or loss function which is given by the corresponding sum over all available data in

:

(7)

By choosing the parameters and , a variety of situations can be modelled. This includes the learning of unrealizable rules () and training of over-sophisticated students with . Here, we restrict ourselves to the idealized, yet non-trivial case of perfectly matching student and teacher complexity, i.e. , which makes it possible to achieve for all input vectors.

Ii-C Generalization error and order parameters

Throughout the following we consider feature vectors

in the training set with uncorrelated i.i.d. random components of zero mean and unit variance. Likewise, arbitrary input vectors

are assumed to follow the same statistics:

As a consequence of this assumption, the Central Limit Theorem applies to the local potentials

which become correlated Gaussian random variables of order

. It is straightforward to work out the characteristic averages and (co-)variances:

(8)

which fully specify the joint density . The so-called order parameters and for serve as macroscopic characteristics of the student configuration. The norms are fixed according to Eq. (6), while the symmetric quantify the pairwise alignments of student weight vectors. The similarity of the student weights to their counterparts in the teacher network are measured in terms of the quantities . Due to the assumed normalizations, the relations are obviously satisfied.

Now we can work out the generalization error, i.e. the expected deviation of student and teacher output for a random input vector, given specific weight configurations and . Note that SCM with have been treated in [saadsollashort, saadsollalong] for general Here, we resort to the special case of matching network sizes, with

(9)

We note here that matching additive constants in the student and teacher activations would leave unaltered. As detailed in the Appendix, all averages in Eq. (9) can be computed analytically for both choices of the activation function in student and teacher network. Eventually, the generalization error is expressed in terms of very few macroscopic order parameters, instead of explicitly taking into account

individual weights. The concept is characteristic for the statistical physics approach to systems with many degrees of freedom.

In the following, we restrict the analysis to student configurations which are site-symmetric with respect to the hidden units:

(10)

Obviously, the system is invariant under permutations, so we can restrict ourselves to one specific case with matching indices in Eq. (10). While this assumption reflects the symmetries of the student teacher scenario, it allows for the specialization of hidden units: For all student units display the same overlap with all teacher units. In specialized configurations with , however, each student weight vector has achieved a distinct overlap with exactly one of the teacher units. Our analysis shows that states with both positive () and negative specialization () can play a significant role in the training process.

Under the above assumption of site-symmetry (10) and applying the normalization (6), the generalization error (9), see also Eqs. (-B1,-B2), becomes

a) for in student and teacher [saadsollashort]:

(11)

b) for ReLU [straat2019]:

(12)

In both settings, perfect agreement of student and teacher with is achieved for and The scaling of outputs with hidden to output weights in Eq. (1) results in a generalization error which is not explicitly -dependent for uncorrelated random students: A configuration with yields in the case of sigmoidal activations (a), whereas for ReLU student and teacher.

Ii-D Thermal equilibrium and the high-temperature limit

In order to analyse the expected outcome of training from a set of examples , we follow the well-established statistical physics approach and analyse an ensemble of networks in a formal thermal equilibrium situation. In this framework, the cost function is interpreted as the energy of the system and the density of observed network states is given by the so-called Gibbs-Boltzmann density

(13)

where the measure incorporates potential restrictions of the integration over all possible configurations of for instance the normalization for all . This equilibrium density would, for example, result from a Langevin type of training dynamics

where denotes the gradient with respect to all degrees of freedom in the student network. Here, the minimization of is performed in the presence of a -correlated, zero mean noise term with

where denotes the Dirac delta-function. The parameter controls the strength of the thermal noise in the gradient-based minimization of .

According to the, by now, standard statistical physics approach to off-line learning [hertz, EngelvandenBroeck, revmodphys, SST] typical properties of the system are governed by the so-called quenched free energy

(14)

where denotes the average over the random realization of the training set. In general, the evaluation of the quenched average is technically involved and requires, for instance, the application of the replica trick [hertz, EngelvandenBroeck, revmodphys]. Here, we resort to the simplifying limit of training at high temperature , which has proven useful in the qualitative investigation of various learning scenarios [SST]. In the limit the so-called annealed approximation [SST, revmodphys, EngelvandenBroeck] becomes exact. Moreover, we have

(15)

Here, is the number of statistically independent examples in and . As the exponent grows linearly with , the integral is dominated by the maximum of the integrand. By means of a saddle-point integration for we obtain

(16)

Here, the right hand side has to be minimized with respect to the arguments, i.e. the order parameters In Eq. (16) we have introduced the entropy term

(17)

The quantity corresponds to the volume in weight space that is consistent with a given configuration of order parameters. Independent of the activation functions or other details of the learning problem, one obtains for large [epl, epj]

(18)

where is the -dimensional matrix of all pair-wise and self-overlaps of the vectors , i.e. the matrix of all see also Eq. (24) in the Appendix. The constant term is independent of the order parameters and, hence, irrelevant for the minimization in Eq. (16). A compact derivation of (18) is provided in, e.g., [epj].

Omitting additive constants and assuming the normalization (6) and site-symmetry (10), the entropy term reads [epl, epj]

(19)

In order to facilitate the successful adaptation of weights in the student network we have to assume that the number of examples also scales like Training at high temperature additionally requires that for which yields a free energy of the form

(20)

The quantity can be interpreted as an effective temperature parameter or, likewise, as the properly scaled training set size. The high temperature has to be compensated by a very large number of training examples in order to facilitate non-trivial outcome. As a consequence, the energy of the system is proportional to , which implies that training and generalization error are effectively identical in the simplifying limit.

Iii Results and Discussion

In the following, we present and discuss our findings for the considered student teacher scenarios and activation functions.

In order to obtain the equilibrium states of the model for given values of and we have minimized the scaled free energy (20) with respect to the site-symmetric order parameters. Potential (local) minima satisfy the necessary conditions

(21)

In addition, the corresponding Hesse matrix of second derivatives w.r.t.  and has to be positive definite. This constitutes a sufficient condition for the presence of a local minimum in the site-symmetric order parameter space. Furthermore, we have confirmed the stability of the local minima against potential deviations from site-symmetry by inspecting the full matrix of second derivatives involving the individual quantities

[5mm]

Fig. 2: Sigmoidal activation: Learning curves of perfectly matching student teacher scenarios. The upper row of graphs shows the results for : The upper left panel displays the order parameters and as functions of . At the critical value , a continuous phase transition occurs with for greater values of . The upper right panel shows the corresponding learning curve which displays a kink at the transition. In the lower row the corresponding results are shown for . Order parameters are displayed in the lower left panel as functions of . The transition is discontinuous with (vertical dotted line) and (vertical dashed line) In addition, the local minimum of the free energy with is replaced by a configuration with in The lower right panel displays the corresponding . Here, the solid line represents the specialized state and the transition from the unspecialized configuration (dashed curve) to the state with (dotted curve) is marked by the short line in . The dashed vertical line marks the critical where the specialized solution becomes the global minimum of .

[5mm]

Fig. 3: ReLU activation: Learning curves of perfectly matching student teacher scenarios. The upper row of graphs shows the results for . The upper left panel displays the order parameters and as a function of . A continuous transition occurs at , the right panel shows the resulting . The lower row corresponds to the ReLU network with . The transition is also continuous and occurs at . The specialized solution with is represented by the solid and the dashed line in the lower left panel. The dotted line and the chain line represent the local minimum of with The corresponding generalization errors is displayed in the lower right panel, where the dotted line represents the suboptimal configuration with .

Iii-a Sigmoidal units re-visited

The investigation of SCM with sigmoidal with along the lines of the previous section has already been presented in [epl]. A corresponding model with discrete binary weights was studied in [Kang].

As argued above, for , the mathematical form of the generalization error, Eqs. (11, -B1), and the free energy are the same as for the activation Hence, the results of [epl] carry over without modification. The following summarizes the key findings of the previous study, which we reproduce here for comparison.

For we observe that in thermal equilibrium for small , see the upper row of graphs in Fig. 2. Both hidden units perform essentially the same task and acquire equal overlap with both teacher vectors, when trained from relatively small data sets. At a critical value , the system undergoes a transition to a specialized state with or in which each hidden unit aligns with one specific teacher unit. Both configurations are fully equivalent due to the invariance of the student output under exchange of the student weights and for . The specialization process is continuous with the quantity increasing proportional to near the transition. This results in a kink in the continuous learning curve at , as displayed in the upper right panel of Fig. 2.

Interestingly, a different behavior is found for all . The following regimes can be distinguished:

  • :  For small the only minimum of corresponds to unspecialized networks with . Within this subspace, a rapid initial decrease of with is achieved.

  • :  In , a specialized configuration with appears as a local minimum of the free energy. The configuration corresponds to the global minimum up to At this -dependent critical value, the free energies of the competing minima coincide.

  • :  Above , the configuration with constitutes the global minimum of the free energy and, thus, the thermodynamically stable state of the system. Note that the transition from the unspecialized to the specialized configuration is associated with a discontinuous change of , cf. Fig. 2 (lower right panel). The specialized state facilitates perfect generalization in the limit

  • : In addition, at another characteristic value the local minimum disappears and is replaced by a negatively specialized state with . Note that the existence of this local minimum of the free energy was not reported in [epl]. The observed specialization increases linearly with for . This smooth transition does not yield a kink in A careful analysis of the associated Hesse matrix shows that the state of poor generalization persists for all , indeed.

The limit with has also been considered in [epl]: The discontinuous transition is found to occur at and Interestingly, the characteristic value diverges as for large [epl]. Hence, the additional transition from to cannot be observed for data sets of size . On this scale, the unspecialized configuration persists for It displays site-symmetric order parameters with and see [epl] for details. Asymptotically, for they approach the values and which yields the non-zero generalization error On the contrary, the specialized configuration achieves i.e. perfect generalization, asymptotically.

The presence of a discontinuous specialization process for sigmoidal activations with suggests that – in practical training situations – the network will very likely be trapped in an unfavorable configuration unless prior knowledge about the target is available. The escape from the poorly generalizing metastable state with or requires considerable effort in high-dimensional weight space. Therefore, the success of training will be delayed significantly.

Fig. 4: ReLU activation: Learning curves of the perfectly matching student teacher scenario for . In this limit, the continuous transition occurs at In the left panel, the solid line represents the specialized solution with , while the chain line marks the solution with . In the former, for large , while in the latter, remains positive with for large . The learning curves for the competing minima of coincide for as displayed in the right panel. It approaches perfect generalization, i.e.  for

Iii-B Rectified linear units

In comparison with the previously studied case of sigmoidal activations, we find a surprisingly different behavior in ReLU networks with

For , our findings parallel the results for networks with sigmoidal units: The network configuration is characterized by for and specialization increases like

(22)

near the transition. This results in a kink in the learning curve at as displayed in Fig. 3 (upper row) for with

However, in ReLU networks the transition is also continuous for . Figure 3 (lower row of graphs) displays the results for the example case with (lower row).

The student output is invariant under exchange of the hidden unit weight vectors, consistent with an unspecialized state for small At a critical value the unspecialized configuration is replaced by two minima of : in the global minimum we have , while the competing local minimum corresponds to configurations with The former facilitates perfect generalization with in the limit In both competing minima the emerging specialization follows Eq. (22) with critical exponent

Fig. 5: Student cross overlap in a perfectly matching student teacher scenario. Left panel: For sigmoidal activations, here , the order parameter is negative in the specialized and in the unspecialized configurations with . The values are marked as in Fig. 2 (lower left panel). It remains negative in the specialized state (solid line) for all , while it becomes positive in where the configuration with (dashed line) is replaced by a state with (dotted line). For better visibility of the behavior near , only a small range of is shown. Right panel: In the ReLU system with, e.g., , becomes positive before the continuous transition occurs, it reaches a maximum in and approaches zero from above for in the specialized configuration with (solid line). In the local minimum of with , becomes negative for large as marked by the dotted line.

In contrast to the case of sigmoidal activation, both competing configurations of the ReLU system display very similar generalization behavior. While, in general, only states with can perfectly reproduce the teacher output, the student configurations with and also achieve relatively low generalization error for large , see Fig. 3 (lower row) for an example.

The limiting case of large networks with can be considered explicitly. We find for large ReLU networks that the continuous specialization transition occurs at

The generalization error decreases very rapidly (instantaneously on -scale) from the initial value of with to a constant plateau-like state with

where For , the order parameter either increases or decreases with , approaching the values asymptotically, while in both branches for

Surprisingly, both solutions display the exact same generalization error, see Fig. 4 (right panel). Consequently, the free energies of the competing minima also coincide in the limit since the entropy (17) satisfies

In the configuration with the order parameters display the scaling behavior

(23)

for large In Appendix -D we show how a single teacher ReLU with activation can be approximated by weakly aligned units in combination with one anti-correlated student node. While the former effectively approximates a linear response of the form , the unit with implements . Since the student can approximate the teacher output very well, see also the appendix for details. In the limit , the correspondence becomes exact and facilitates perfect generalization for

Note that a similar argument does not hold for student teacher scenarios with sigmoidal activation functions which do not display the partial linearity of the ReLU.

Iii-C Student-student overlaps

It is also instructive to inspect the behavior of the order parameter which quantifies the mutual overlap of student weight vectors. In the ReLU system with large finite , we observe before the transition. It reaches a maximum value at the phase transition and decreases with increasing . In the positively specialized configuration it approaches the limiting value from above, while it assumes negative values on the order in the configuration with

This is in contrast to networks of sigmoidal units, where before the discontinuous transition and in the specialized state, see [epl, epj] for details. Interestingly, the characteristic value coincides with the point where becomes positive in the suboptimal local minimum of

Figure 5 displays for sigmoidal (left panel) and ReLU activation (right panel) for as an example. Apparently the ReLU system tends to favor correlated hidden units in most of the training process.

Iii-D Practical relevance

It is important to realize that a quantitative comparison of the two scenarios, for instance w.r.t. the critical values , is not sensible. The complexities of sigmoidal and ReLU networks with units do not necessarily correspond to each other. Moreover, the actual -scale is trivially related to a potential scaling of the activation functions.

However, our results provide valuable qualitative insight: The continuous nature of the transition suggests that ReLU systems should display favorable training behavior in comparison to systems of sigmoidal units. In particular, the suboptimal competing state displays very good performance, comparable to that of the properly specialized configuration. Their generalization abilities even coincide in large networks of many hidden units.

On the contrary, the achievement of good generalization in networks of sigmoidal units will be delayed significantly due to the discontinuous specialization transition which involves a poorly generalizing metastable state.

Iv Conclusion and Outlook

We have investigated the training of shallow, layered neural networks in student teacher scenarios of matching complexity. Large, adaptive networks have been studied by employing modelling concepts and analytical tools borrowed from the statistical physics of learning. Specifically, stochastic training processes at high formal temperature were studied and learning curves were obtained for two popular types of hidden unit activation.

To the best of our knowledge, this work constitutes the first theoretical, model-based comparison of sigmoidal hidden unit activations and rectified linear units in feed-forward neural networks.

Our results confirm that networks with sigmoidal hidden units undergo a discontinuous transition: A critical training set size is required to facilitate the differentiation, i.e. specialization of hidden units. However, a poorly performing state of the network persists as a locally stable configuration for all sizes of the training set. The presence of such an unfavorable local minimum will delay successful learning in practice, unless prior knowledge of the target rule allows for non-zero initial specialization.

On the contrary, the specialization transition is always continuous in ReLU networks. We show that above a weakly -dependent critical value of the re-scaled training set size , two competing specialized configurations can be assumed. Only one of them displays positive specialization and facilitates perfect generalization from large training sets for finite . However, the competing configuration with negative specialization realizes similar performance which is nearly identical for networks with many hidden units and coincides exactly in the limit

As a consequence, the problem of retarded learning associated with the existence of metastable configurations is expected to be much less pronounced in ReLU networks than in their counterparts with sigmoidal activation.

Clearly, our approach is subject to several limitations which will be addressed in future studies.

Probably the most straightforward, relevant extension of our work would be the consideration of further activation functions, for instance modifications of the ReLU such as the leaky or noisy ReLU or alternatives like swish and max-out [timetoswish, searching].

Within the site-symmetric space of configurations, cf. Eq. (10), only the specialization of single units with respect to one of the teacher units can be considered. In large networks, one would expect partially specialized states, where subsets of hidden units achieve different alignment with specific teacher units. Their study requires the extension of the analysis beyond the assumption of site-symmetry.

Training at low formal temperatures can be studied along the lines of [epj] where the replica formalism was already applied to networks with sigmoidal activation. Alternatively, the simpler annealed approximation could be used [EngelvandenBroeck, SST, revmodphys]. Both approaches allow to vary the control parameter of the training process and the scaled example set size independently, as it is the case in more realistic settings. Note that the findings reported in [epj] for sigmoidal activation displayed excellent qualitative agreement with the results of the much simpler high-temperature analysis in [epl].

The dynamics of non-equilibrium on-line training by gradient descent has been studied extensively for soft-committee-machines with sigmoidal activation, e.g. [saadsollashort, saadsollalong, woehlertransient, nestorplateaus]. There, quasi-stationary plateau states in the learning dynamics are the counterparts of the phase transitions observed in thermal equilibrium situations. First results for ReLU networks have been obtained recently [straat2019]. These studies should be extended in order to identify and understand the influence of the activation function on the training dynamics in greater detail.

Model scenarios with mismatched student and teacher complexity will provide further insight into the role of the activation function for the learnability of a given task. It should be interesting to investigate specialization transitions in practically relevant settings in which either the task is unlearnable or the student architecture is over-sophisticated for the problem at hand (). In addition, student and teacher systems with mismatched activation functions should constitute interesting model systems.

The complexity of the considered networks can be increased in various directions. If the simple shallow architecture of Eq. (1) is extended by local thresholds and hidden to output weights that are both adaptive, it parameterizes a universal approximator, see e.g. [cybenko, hornik, hanin]. Decoupling the selection of these few additional parameters from the training of the input to hidden weights should be possible following the ideas presented in [adiabatic].

Ultimately, deep layered architectures should be investigated along the same lines. As a starting point, simplifying tree-like architectures could be considered as in e.g. [retarded, relucapacity].

Our modelling approach and theoretical analysis goes beyond the empirical investigation of data set specific performance. The suggested extensions bear the promise to contribute to a better, fundamental understanding of layered neural networks and their training behavior.

-a Co-variance matrix and order parameters

The -dim. matrix of order parameters reads

(24)

of elements , and Note that Eqs. (18) and (19) correspond to the special case of and exploit site-symmetry (10) and normalization (6).

-B Derivation of the generalization error

Here we give a derivation of the generalization error in terms of the order parameters for sigmoidal and ReLU student and teacher. For general and it reads

(25)

which reduces to Eq. (9) for To obtain for a particular choice of activation function , expectation values of the form have to be evaluated over the joint normal density of the hidden unit local potentials and , i.e.  with the appropriate submatrix of , cf. Eq. (24):

-B1 Sigmoidal

For student and teacher with sigmoidal activation functions or , the generalization error has been derived in [saadsollalong] and is given by:

(26)

-B2 ReLU

For student and teacher with ReLU activations , applying the elegant formulation used in [yoshida_statistical_2017] gives an analytic expression for the two-dimensional integrals:

(27)

Substituting the result from Eq. (27) in Eq. (25) for the corresponding covariance matrices gives the analytic expression for the generalization error in terms of the order parameters:

(28)

For , orthonormal teacher vectors with , fixed student norms , and assuming site symmetry, Eq. (10), we obtain Eqs. (11) and (12), respectively.

-C Single unit student and teacher

In the simple case with a single unit as student and teacher network, we have to consider only one order parameter Assuming we obtain the free energy with

(29)
(30)
(31)

The necessary condition becomes

(32)
(33)

In both cases, the student teacher overlap increases smoothly from zero to A Taylor expansion of for yields the asymptotic behavior

for both types of activation. This basic large- behavior with carries over to student teacher scenarios with general in configurations with positive specialization.

-D Weak and negative alignment

Here we consider a particular teacher unit which realizes a ReLU response

A set of hidden units in the student network can obviously reproduce the response by aligning one of the units perfectly with, e.g., and for . Similarly, we obtain for that

Now consider the mean response of a student unit with small positive overlap , given the teacher unit response . It corresponds to the average over the conditional density . One obtains

by means of a Taylor expansion for . As a special case, the mean response of an orthogonal unit with is independent of

It is straightforward to work out the conditional average of the total student response for a particular order parameter configuration with and Apart from the prefactor it is given by

where the right hand side coincides with the expected output for and Hence, the average response agrees with the teacher output for large . Moreover, the correspondence becomes exact in the limit which facilitates perfect generalization in the negatively specialized state with discussed in Sec. III.

References