On the Optimal Memorization Power of ReLU Neural Networks

10/07/2021 ∙ by Gal Vardi, et al. ∙ 0

We study the memorization power of feedforward ReLU neural networks. We show that such networks can memorize any N points that satisfy a mild separability assumption using Õ(√(N)) parameters. Known VC-dimension upper bounds imply that memorizing N samples requires Ω(√(N)) parameters, and hence our construction is optimal up to logarithmic factors. We also give a generalized construction for networks with depth bounded by 1 ≤ L ≤√(N), for memorizing N samples using Õ(N/L) parameters. This bound is also optimal up to logarithmic factors. Our construction uses weights with large bit complexity. We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The expressive power of neural networks has been widely studied in many previous works. These works study different aspects of expressiveness, such as the universal approximation property [11, 21], and the benefits of depth in neural networks [33, 15, 28, 12, 10]. Another central and well studied question is about their memorization power.

The problem of memorization in neural networks can be viewed in the following way: For every dataset of labeled samples , construct a network such that for every

. Many works have shown results regarding the memorization power of neural networks, using different assumptions on the activation function and data samples (see e.g.

[19, 18, 4, 37, 13, 14, 8, 26, 17, 39, 40, 25, 27, 32]). The question of memorization also have practical implications on phenomenons such as ”double descent” [5, 24] which connects the memorization power of neural networks with their generalization capabilities.

A trivial lower bound on the required size of the network for memorizing labeled points is implied by the VC dimension of the network (cf. [31] ). That is, if a network with a certain size cannot shatter any specific set of points, then it certainly cannot memorize all sets of points. Known VC dimension bounds for networks with parameters is on the order of [16, 2, 3]. Hence, it follows that memorizing samples would require at least parameters. The best known upper bound is given in [26], where it is shown that memorizing data samples can be done using a neural network with parameters. Thus, there is a clear gap between the lower and upper bounds, although we note that the upper bound is for memorization of any set of data samples, while the lower bound is for shattering a single set of data samples. In this paper we ask the following questions:

What is the minimal number of parameters that are required to memorize labeled data samples? Is the task of memorizing any set of data samples more difficult than shattering a single set of samples?

We answer these questions by providing a construction of a ReLU feedforward neural network which achieves the lower bound up to logarithmic factors. In this construction we use a very deep neural network, but with a constant width of . In more details, our main result is the following:

Theorem 1.1 (informal statement).

Let be a set of labeled samples of a constant dimension , with for every and for every . Then, there exists a ReLU neural network with width , depth , and parameters, such that for every , where the notation hides logarithmic factors in .

Comparing this result to the known VC bounds, we show that, up to logarithmic factors, our construction is optimal. This also shows, quite surprisingly, that up to logarithmic factors, the task of shattering a single set of points is not more difficult than memorizing any set of points, under the mild separability assumption of the data samples. We note that this result can also be extended to regression tasks (see Remark 3.3).

In our construction, the depth of the network is . We also give a generalized construction where the depth of the network is limited to some . In his case, the number of parameters in our construction is (see Thm. 5.1). We compare this result to the VC-dimension bound from [3], and show that our construction is optimal up to logarithmic factors.

Our construction uses a bit extraction technique, inspired by Telgarsky’s triangle function [33], and by [28]. Using this technique, we are able to use weights with bit complexity , and deep neural networks to “extract” the bits of information from the specially crafted weights of the network. We also generalize our results to the case of having a bounded bit complexity restriction on the weights. We show both lower (Thm. 6.1) and upper (Thm. 6.2) bounds, proving that memorizing points using a network with parameters, for some can be done if the bit complexity of each weight is . Hence, our construction is also optimal, up to logarithmic factors, w.r.t the bit complexity of the network. We emphasize that also in previous works showing non-trivial VC bounds (e.g. [2, 3]) weights with large bit complexity are used. We note that increasing the bit complexity beyond cannot be used to further reduce the number of parameters (see the discussion in Sec. 4).

Related work

Memorization – upper bounds.

The problem of memorizing arbitrary data points with neural networks has a rich history. [4] studied memorization in single-hidden-layer neural networks with the threshold activation, and showed that neurons suffice to memorize arbitrary points in general position in with binary labels. [8] extended the construction of [4] and showed that single-hidden-layer ReLU networks with hidden neurons can memorize points in general position with arbitrary real labels. In [20] and [30] it is shown that single-hidden-layer networks with the threshold activation can memorize any arbitrary set of points, even if they are not in general position, using neurons. [19] proved a similar result for any bounded non-linear activation function where either or exists. [40] proved that single-hidden-layer ReLU networks can memorize arbitrary points in with arbitrary real labels using neurons and parameters.

[18] showed that two-hidden-layers networks with the sigmoid activation can memorize points with neurons, but the number of parameters is still linear in . [39] proved a similar result for ReLU (and hard-tanh) networks. [37] showed that threshold and ReLU networks can memorize

binary-labeled unit vectors in

separated by a distance of , using neurons and parameters. [27] improved the dependence on by giving a construction with neurons and parameters. This result holds only for threshold networks, but does not assume that the inputs are on the unit sphere.

The memorization power of more specific architectures was also studied. [17] proved that residual ReLU networks with neurons can memorize points on the unit sphere separated by a constant distance. [25]

considered convolutional neural networks and showed, under certain assumptions, memorization using

neurons.

Note that in all the results mentioned above the number of parameters is at least linear in . Our work is inspired by [26], that established a first memorization result with a sub-linear number of parameters. They showed that neural networks with sigmoidal or ReLU activations can memorize points in separated by a normalized distance of , using parameters (where the dimension is constant). Thus, in this work we improve the dependence on from to (up to logarithmic factors), which is optimal. We also note that the first stage in our construction is similar to the first stage in theirs.

Finally, optimization aspects of memorization were studied in [8, 13, 14].

Memorization – lower bounds.

An lower bound on the number of parameters required for memorizing arbitrary points using neural networks with standard activations (e.g., threshold, sigmoid and ReLU) is given in [32]. Thus, for networks with parameters, there is a set of size that cannot be shattered. It implies that in order to obtain memorization with a sub-linear number of parameters some assumptions are required. Our positive result circumvents this lower bound by assuming that the data is separated.

Moreover, lower bounds on the number of parameters required for memorization are implied by bounds on the VC dimension of neural networks. Indeed, if parameters are not sufficient for shattering even a single set of size , then they are clearly not sufficient for memorizing all sets of size . The VC dimension of neural networks has been extensively studied in recent decades (cf. [1, 3]). The most relevant results for our work are by [16] and [3]. We discuss these results and their implications in Sections 4 and 5.

Trade-offs between the number of parameters of the network and the Lipschitz parameter of the prediction function in memorizing a given dataset are studies in [9, 7].

The benefits of depth.

In this work we show that deep networks have significantly more memorization power. Quite a few theoretical works in recent years have explored the beneficial effect of depth on increasing the expressiveness of neural networks (e.g., [23, 15, 33, 22, 12, 28, 38, 29, 10, 34, 6, 36, 35]). The benefits of depth in the context of the VC dimension is implied by, e.g., [3]. Finally, [26] already demonstrated that deep networks have more memorization power than shallow ones, albeit with a weaker bound than ours.

2 Preliminaries

Notations

For and we denote by the string of bits in places until inclusive, in the binary representation of and treat is as an integer (in binary basis). For example, . We denote by the number of bits in its binary representation. We denote . For a function and we denote by the composition of with itself times. We denote vectors in bold face. We use the notation to hide logarithmic factors, and use to hide constant factors. For we denote . We say that a hypothesis class shatters the points if for every there is s.t. for every .

Neural Networks

We denote by the ReLU function. In this paper we only consider neural networks with the ReLU activation.

Let be the data input dimension. We define a neural network of depth as where is computed recursively by

  • for

  • for for

  • for

The width of the network is . We define the number of parameters of the network as the number of weights of which are non-zero. It corresponds to the number of edges when we view the neural network as a directed acyclic graph. We note that this definition is standard in the literature on VC dimension bounds for neural networks (cf. [3]).

Bit Complexity

We refer to the bit complexity of a weight as the number of bits required to represent the weight. Throughout the paper we use only weights which have a finite bit complexity. Specifically, for , its bit complexity is . The bit complexity of the network is the maximal bit complexity of its weights.

Input Data Samples

In this work we assume that we are given labeled data samples , and our goal is to construct a network such that for every . We will assume that there is a separation parameter such that for every we have . We will also assume that the data samples have a bounded norm, i.e. there is such that for every . We note that in [32] it is shown that sets of size cannot be memorized using neural networks with standard activations (including ReLU) unless there are parameters. Hence, to show a memorization result with parameters we must have some assumption on the data samples.

Note that any neural network that does not ignore input neurons must have at least parameters. Existing memorization results (e.g., [26]) assume that is constant. As we discuss in Remark 3.2, in our work we assume that is at most .

3 Memorization Using parameters

In this section we prove that given a finite set of labeled data points, there is a network with parameters which memorizes them. Formally, we have the following:

Theorem 3.1.

Let , and , and let be a set of labeled samples with for every and for every . Denote . Then, there exists a neural network with width 12 and depth

such that for every .

From the above theorem, we get that the total number of parameters is . For (see Remark 3.2 below) we get the lower bound presented in Thm. 1.1. Note that except for the number of data points and the dimension , the number of parameters of the network depends only logarithmically on all the other parameters of the problem (namely, on ). We also note that the construction can be improved by some constant factors, but we preferred simplicity over minor improvements. Specifically, we hypothesize that it is possible to give a construction using width (instead of ) similar to the result in [26]. To simplify the terms in the theorem, we assume that , otherwise (i.e., if all data points have a norm smaller than ) we just fix and get the same result. In the same manner, we assume that , otherwise we just fix .

Remark 3.2 (Dependence on ).

Note that any neural network that does not ignore input neurons must have at least parameters. In our construction the first layer consists of a single neuron (i.e width 1), which means that the number of parameters in the first layer is . Hence, this dependence on is unavoidable. Previous works (e.g., [26]) assumed that is constant. In our work, to achieve the bound of parameters we can assume that either is constant or it may depend on with .

Remark 3.3 (From classification to regression).

Although the theorem considers multi-class classification, it is possible to use the method suggested in [26] to extend the result to regression. Namely, if the output is in some bounded interval, then we can partition it into smaller intervals of length each. We define a set of output classes with classes, such that each class corresponds to an interval. Now, we can use Thm. 3.1 to get an approximation of the output, while the number of parameters is linear in .

3.1 Proof Intuition

Below we describe the main ideas of the proof. The full proof can be found in Appendix A. The proof is divided into three stages, where at each stage we construct a network for , and the final network is . Each subnetwork has width , but the depth varies for each such subnetwork.

Stage I: We project the data points from to . This stage is similar to the one used in the proof of the main result from [26]. We use a network with 2 layers for the projection. With the correct scaling, the output of this network on are points with for , and for every where .

Stage II: Following the previous stage, our goal is now to memorize , where the ’s are separated by a distance of at least . We can also reorder the indices to assume w.l.o.g that . We split the data points into intervals, each containing data points. We also construct crafted integers and in the following way: Note that by the previous stage, if we round to , it can be represented using bits. Also, each can be represented by at most bits. Suppose that is the -th data point in the -th interval (where and ), then we define such that

bin
bin

That is, for each interval , the number has bits which represent the integral value of the -th data point in this interval. In the same manner, has bits which represent the label of the -th data point in this interval. This is true for all the data points in the -th interval, hence and are represented with and bits respectively.

We construct a network such that for each , where the -th data point is in the -th interval.

Stage III: In this stage we construct a network which uses a bit extraction technique adapting Telgarsky’s triangle function [33] to extract the relevant information out of the crafted integers from the previous stage. In more details, given the input , we sequentially extract from the bits in places until for , and check whether is at distance at most from the integer represented by those bits. If it is, then we extract the bits in places until from , and output the integer represented by those bits. By the previous stage, for each we know that includes the encoding of , and since for every (by the first stage), there is exactly one such . Hence, the output of this network is the correct label for each .

The construction is inspired by the works of [3, 26]. Specifically, the first stage uses similar results from [26] on projection onto a -dimensional space. The main difference from previous constructions is in stages II and III where in our construction we encode the input data points in the weights of the network. This allows us to reduce the required number of parameters for memorization of any dataset.

4 On the Optimal Number of Parameters

In Sec. 3 we showed that a network with width and depth can perfectly memorize data points, hence only parameters are required for memorization. In this section, we compare our results with the known lower bounds.

First, note that our problem contains additional parameters besides . Namely, and . As we discussed in Remark 3.2, we assume that is either constant or depends on with . In this comparison we also assume that and are either constants or depend polynomially on . We note that for a constant , either or must depend on . For example, assume and , then to have point in with norm at most and distance at least from one another we must have that . Hence, we can bound (where is defined in Thm. 3.1), which implies . Moreover, we assume that , because it is reasonable to expect that the number of output classes is not larger than the number of data points. Hence, also . In regression tasks, as discussed in Remark 3.3, we can choose to consist of classes where is either a constant or depends at most polynomially on the other parameters.

Using that , and tracing back the bounds given in Thm. 3.1, we get that the number of parameters of the network is . Note that here the notation only hides constant factors, which can also be exactly calculated using Thm. 3.1.

By [16], the VC dimension of the class of ReLU networks with parameters is . Thus, the maximal number of points that can be shattered is . Hence, if we can shatter points then . In particular, it gives a lower bound for the number of parameters required to memorize inputs. Thus, the gap between our upper bound and the above lower bound is only , which is sub-logarithmic.

Moreover, it implies that the number of parameters required to shatter one size- set is roughly equal (up to a sub-logarithmic factor) to the number of parameters required to memorize all size- sets (that satisfy some mild assumptions). Thus, perhaps surprisingly, the task of memorizing all size- sets is not significantly harder than shattering a single size- set.

If we further assume that depends on , i.e., the number of classes depends on the number of samples (or in regression tasks, as we discussed in Remark 3.3, the accuracy depends on the number of samples), then we can show that our bound is tight up to constant terms. Thus, in this case the factor is unavoidable. Formally, we have the following lemma:

Lemma 4.1.

Let , and assume that for some constant . If we can express all the functions of the form using neural networks with parameters, then .

Proof.

Let be the class of all the functions . The VC-dimension bound from [16] implies that expressing all the functions from with neural networks requires parameters, since it is equivalent to shattering a set of size . Assume that we can express all the functions of the form using networks with parameters. Given some function we can express it with a neural network as follows: define a function such that for every and , the -th bit of is . We construct a neural network for , such that for an input it computes (using parameters) and outputs its -th bit. The extraction of the -th bit can be implemented using the bit extraction technique from the proof of Thm. 3.1. Overall, this network requires parameters, and in this manner we can express all the functions in . This shows that , but since we also have then . ∎

5 Limiting the Depth

In this section we generalize Thm. 3.1 to the case of having a bounded depth. We will then compare our upper bound on memorization to the VC-dimension bound from [3]. We have the following:

Theorem 5.1.

Assume the same setting as in Thm. 3.1, and denote . Let . Then, there exists a network with width depth and a total of parameters such that for every .

The full proof can be found in Appendix B. Note that for we get a similar bound to Thm. 3.1. In the proof we partition the dataset into subsets, each containing data points. We then use Thm. 3.1 on each such subset to construct a subnetwork of depth and width to memorize the data points in the subset. We construct these subnetworks such that their output on each data point outside of their corresponding subset is zero. Finally we stack these subnetworks to construct a wide network, whose output is the sum of the outputs of the subnetworks.

By [3], the VC dimension of the class of ReLU networks with parameters and depth is . It implies that if we can shatter points with networks of depth , then . As in Sec. 4, assume that is either constant or at most , and that and are bounded by some . Thm. 5.1 implies that we can memorize any points (under the mild separability assumption) using networks of depth and parameters. Therefore, the gap between our upper bound and the above lower bound is logarithmic. It also implies that, up to logarithmic factors, the task of memorizing any set of points using depth- neural networks is not more difficult than shattering a single set of points with depth- networks.

6 Bit Complexity - Lower and Upper Bounds

In the proof of Thm. 3.1 the bit complexity of the network is roughly (See Thm. A.1 in the appendix). On one hand, having such large bit complexity allows us to ”store” and ”extract” information from the weights of the network using bit extraction techniques. This enable us to memorize data points using significantly less than parameters. On the other hand, having large bit complexity makes the network difficult to implement on finite precision machines.

In this section we argue that having large bit complexity is necessary and sufficient if we want to construct a network that memorizes data points with less than parameters. We show both upper and lower bounds on the required bit complexity, and prove that, up to logarithmic factors, our construction is optimal w.r.t. the bit complexity.

First, we show a simple upper bound on the VC dimension of neural networks with bounded bit complexity:

Theorem 6.1.

Let be the hypothesis class of ReLU neural networks with parameters, where each parameter is represented by bits. Then, the VC dimension of is .

Proof.

We claim that is a finite hypothesis class with at most different functions. Let be a neural network in , and suppose that it has at most parameters and neurons. Then, each weight of can be represented by at most bits, namely, bits for the indices of the input and output neurons of the edge, and bits for the weight magnitude. Hence, is represented by at most bits. Since we get that every function in can be represented by at most bits, which gives the bound on the size of . An upper bound on the VC dimension of finite classes is the log of their size (cf. [31]), and hence the VC dimension of is . ∎

This lemma can be interpreted in the following way: Assume that we want to shatter a single set of points, and we use ReLU neural networks with parameters for some . Then, the bit complexity of each weight in the network must be at least . Thus, to shatter points with less than parameters, we must have a neural network with bit complexity that depends on . Also, for , our construction in Thm. 3.1 (which memorizes any set of points) is optimal, up to logarithmic terms, since having bit complexity of is unavoidable. We emphasize that existing works that show non-trivial VC bounds also use large bit complexity (e.g. [2, 3]).

We now show an upper bound on the number of parameters of a network for memorizing data points, assuming that the bit complexity of the network is bounded:

Theorem 6.2.

Assume the same setting as in Thm. 3.1, and denote . Let . Then, there exists a network with bit complexity , depth and width such that for every .

The full proof can be found in Appendix C. The proof idea is to partition the dataset into subsets, each containing data points. For each such subset we construct a subnetwork using Thm. 3.1 to memorize the points in this subset. We concatenate all these subnetwork to create one deep network, such that the output of each subnetwork is added to the output of the previous one. Using a specific projection from Lemma A.2 in the appendix, this enables each subnetwork to output for each data point which is not in the corresponding subset. We get that the concatenated network successfully memorizes the given data points.

Assume, as in the previous sections, that and are bounded by some and is bounded by . Thm. 6.2 implies that we can memorize any points (under mild assumptions), using networks with bit complexity and parameters. More specifically, we can memorize points using networks with bit complexity and parameters, for . Up to logarithmic factors in , this matches the bound implied by Thm. 6.1.

7 Conclusions

In this work we showed that memorization of separated points can be done using a feedforward ReLU neural network with parameters. We also showed that this construction is optimal up to logarithmic terms. This result is generalized for the cases of having a bounded depth network, and a network with bounded bit complexity. In both cases, our constructions are optimal up to logarithmic terms.

An interesting future direction is to understand the connection between our results and the optimization process of neural networks. In more details, it would be interesting to study whether training neural networks with standard algorithms (e.g. GD or SGD) can converge to a solution which memorizes data samples while the network have significantly less than parameters.

Another future direction is to study the connection between the bounds from this paper and the generalization capacity of neural networks. The double descent phenomenon [5]

suggests that after a network crosses the ”interpolation threshold”, it is able to generalize well. Our results suggest that this threshold may be much smaller than

for a dataset with samples, and it would be interesting to study if this is true also in practice.

Finally, it would be interesting to understand whether the logarithmic terms on our upper bounds are indeed necessary for the construction, or these are artifacts of our proof techniques. If these logarithmic terms are not necessary, then it will show that for neural network, the tasks of shattering a single set of points, and memorizing any set of points are exactly as difficult (maybe up to constant factors).

Acknowledgements

This research is supported in part by European Research Council (ERC) grant 754705.

References

Appendix A Proof from Sec. 3

We first give the proofs for each of the three stages of the construction, then we combine all the stages to proof Thm. 3.1. For convenience, we rewrite the theorem, and also state the bound on the bit complexity of the network:

Theorem A.1.

Let , , and let be a set of labeled samples with for every and for every . Denote . Then, there exists a neural network with width 12, depth

and bit complexity bounded by such that for every .

a.1 Stage I: Projecting onto a one-dimensional subspace

Lemma A.2.

Let with for every and for every . Then, there exists a neural network with width , depth , and bit complexity , such that for every and for every .

To show this lemma we use a similar argument to the projection phase from [26], where the main difference is that we use weights with bounded bit complexity. Specifically, we use the following:

Lemma A.3 (Lemma 8 from [26]).

Let , then for any distinct there exists a unit vector such that for all :

(1)
Proof of Lemma a.2.

We first use Lemma A.3 to find that satisfies Eq. (1). Note that , hence every coordinate of is smaller than . We define such that each of its coordinates is equal to the first bits of the corresponding coordinate of . Note that . For every we have that:

(2)

We also have for every :

(3)

Let , and note that by Eq. (A.1) we have . We define the network in the following way:

We show the correctness of the construction. Let , then we have that:

where the second equality is because by the definition of we have that for every , and the last inequality is by Eq. (A.1) and the assumption that for every . Now, let , then we have:

where the second to last inequality is since and Eq. (A.1).

The network has depth and width . The bit complexity of the network can be bounded by . This is because each coordinate of can be represented using bits, the bias can be represented using at most bits and the weight in the second layer can be represented using bits. ∎

a.2 Stage II: Finding the right subset

Lemma A.4.

Let with for every and for every . Let with and let where for every . Let . Then, there exists a neural network with width , depth and bit complexity such that for every we have that .

Proof.

Let . We define network blocks and in the following way: First, we use Lemma A.6 to construct such that for every , and for or . In particular, if , and otherwise. Note that since is defined with ceil, it is possible that , if this is the case we replace with . Next, we define:

Finally we define the network (we can use one extra layer to augment the input with an extra coordinate of ).

We show that this construction is correct. Note that for every , if we denote , then and for every we have . By the construction of we get that .

The width of at every layer is at most the width of , since we also keep copies of both and , hence the width is at most . The depth of is 2, and is a composition of the ’s with an extra layer for the input to get , and an extra layer in the output to extract the last coordinate. Hence, its depth is . The bit complexity is bounded by the sum of the bit complexity of and the weights , hence it is bounded by . ∎

a.3 Stage III: Bit Extraction from the Crafted Weights

Lemma A.5.

Let . Let with and let with . Assume that for any with we have that . Then, there exists a network with width , depth and bit complexity , such that for every , if there exist where , then:

Proof.

Let , and denote by the triangle function due to [33]. we construct the following neural network :

where we define if , and if or .

The construction uses two basic building blocks: The first is using Lemma A.7 twice, for and . This way we construct two smaller networks , such that: