Learning Sparse Networks Using Targeted Dropout

05/31/2019 ∙ by Aidan N. Gomez, et al. ∙ Google Map Data Map data ©2019 Map data University of Oxford 26

Neural networks are easier to optimise when they have many more weights than are required for modelling the mapping from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for training a neural network so that it is robust to subsequent pruning. Before computing the gradients for each weight update, targeted dropout stochastically selects a set of units or weights to be dropped using a simple self-reinforcing sparsity criterion and then computes the gradients for the remaining weights. The resulting network is robust to post hoc pruning of weights or units that frequently occur in the dropped sets. The method improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks are a powerful class of models that achieve the state-of-the-art on a wide range of tasks such as object recognition, speech recognition, and machine translation. One reason for their success is that they are extremely flexible models because they have a large number of learnable parameters. However, this flexibility can lead to overfitting, and can unnecessarily increase the computational and storage requirements of the network. There has been a large amount of work on developing strategies to compress neural networks. One intuitive strategy is sparsification: removing weights or entire units from the network. Sparsity can be encouraged during learning by the use of sparsity-inducing regularisers, like or penalties. It can also be imposed by post hoc pruning, where a full-sized network is trained, and then sparsified according to some pruning strategy. Ideally, given some measurement of task performance, we would prune the weights or units that provide the least amount of benefit to the task. Finding the optimal set is, in general, a difficult combinatorial problem, and even a greedy strategy would require an unrealistic number of task evaluations, as there are often millions of parameters. Common pruning strategies therefore focus on fast approximations, such as removing weights with the smallest magnitude Han et al. (2015b), or ranking the weights by the sensitivity of the task performance with respect to the weights, and then removing the least-sensitive ones LeCun et al. (1990). The hope is that these approximations correlate well with task performance, so that pruning results in a highly compressed network while causing little negative impact to task performance, however this may not always be the case. Our approach is based on the observation that dropout regularisation (Hinton et al., 2012; Srivastava et al., 2014) itself enforces sparsity tolerance during training, by sparsifying the network with each forward pass. This encourages the network to learn a representation that is robust to a particular form of post hoc sparsification – in this case, where a random set of units is removed. Our hypothesis is that if we plan to do explicit post hoc sparsification, then we can do better by specifically applying dropout to the set of units that we a priori believe are the least useful. We call this approach targeted dropout. The idea is to rank weights or units according to some fast, approximate measure of importance (like magnitude), and then apply dropout primarily to those elements deemed unimportant. Similar to the observation with regular dropout, we show that this encourages the network to learn a representation where the importance of weights or units more closely aligns with our approximation. In other words, the network learns to be robust to our choice of post hoc pruning strategy. The advantage of targeted dropout as compared to other approaches is that it makes networks extremely robust to the post hoc pruning strategy of choice, gives intimate control over the desired sparsity patterns, and is easy to implement 111Code available at: github.com/for-ai/TD , as well as in Tensor2Tensor (Vaswani et al., 2018)

, consisting of a two-line change for neural network frameworks such as Tensorflow

(Abadi et al., 2015)

or PyTorch

(Paszke et al., 2017). The method achieves impressive sparsity rates on a wide range of architectures and datasets; notably 99% sparsity on the ResNet-32 architecture for a less than 4% drop in test set accuracy on CIFAR-10.

2 Background

In order to present targeted dropout, we first briefly introduce some notation, and review the concepts of dropout and magnitude-based pruning.

2.1 Notation

Assume we are dealing with a particular network architecture. We will use

to denote the vector of parameters of a neural network drawn from candidate set

, with giving the number of parameters. denotes the set of weight matrices in a neural network parameterised by , accordingly, we will denote as a weight matrix that connects one layer to another in the network. We will only consider weights, ignoring biases for convenience, and note that biases are not removed during pruning. For brevity, we will use the notation to denote the weights connecting the layer below to the output unit (i.e. the column of the weight matrix), to denote the number of columns in , and to denote the number of rows. Each column corresponds to a hidden unit, or feature map in the case of convolutional layers. Note that flattening and concatenating all of the weight matrices in would recover .

2.2 Dropout

Our work uses the two most popular Bernoulli dropout techniques, Hinton et al.’s unit dropout (Hinton et al., 2012; Srivastava et al., 2014) and Wan et al.’s weight dropout (dropconnect) (Wan et al., 2013)

. For a fully-connected layer with input tensor

, weight matrix , output tensor , and mask we define both techniques below: Unit dropout (Hinton et al., 2012; Srivastava et al., 2014):

Unit dropout randomly drops units

(often referred to as neurons) at each training step to reduce dependence between units and prevent overfitting.

Weight dropout (Wan et al., 2013):

Weight dropout randomly drops individual weights in the weight matrices at each training step. Intuitively, this is dropping connections between layers, forcing the network to adapt to a different connectivity at each training step.

2.3 Magnitude-based pruning

A popular class of pruning strategies are those characterised as magnitude-based pruning strategies. These strategies treat the top- largest magnitude weights as important. We use to return the top- elements (units or weights) out of all elements being considered. Unit pruning (Molchanov et al., 2016; Frankle and Carbin, 2018): considers the units (column-vectors) of weight matrices under the -norm.

(1)

Weight pruning (Han et al., 2015b; Molchanov et al., 2016): considers the entries of each feature vector under the -norm. Note that the top- is with respect to the other weights within the same feature vector.

(2)

While weight pruning tends to preserve more of the task performance under coarser prunings (Han et al., 2015a; Ullrich et al., 2017; Frankle and Carbin, 2018), unit pruning allows for considerably greater computational savings (Wen et al., 2016; Louizos et al., 2017). In particular, weight pruned networks can be implemented using sparse linear algebra operations, which offer speedups only under sufficiently sparse conditions; while unit pruned networks execute standard linear algebra ops on lower dimensional tensors, which tends to be a much faster option for given a fixed sparsity rate.

3 Targeted Dropout

Consider a neural network parameterized by , and our importance criterion (defined above in Equations (1) and (2)) . We hope to find optimal parameters such that our loss is low, and at the same time , i.e. we wish to keep only the weights of highest magnitude in the network. A deterministic pruning implementation would select the bottom elements and drop them out. However, we would like for low-valued elements to be able to increase their value if they become important during training. Therefore, we introduce stochasticity into the process using a targeting proportion

and a drop probability

. The targeting proportion means that we select the bottom weights as candidates for dropout, and of those we drop the elements independently with drop rate . This implies that the expected number of units to keep during each round of targeted dropout is . As we will see below, the result is a reduction in the important subnetwork’s dependency on the unimportant subnetwork, thereby reducing the performance degradation as a result of pruning at the conclusion of training.

3.1 Dependence Between the Important and Unimportant Subnetworks

The goal of targeted dropout is to reduce the dependence of the important subnetwork on its complement. A commonly used intuition behind dropout is the prevention of coadaptation between units; that is, when dropout is applied to a unit, the remaining network can no longer depend on that unit’s contribution to the function and must learn to propagate that unit’s information through a more reliable channel. An alternative description asserts that dropout maximizes the mutual information between units in the same layer, thereby decreasing the impact of losing a unit Srivastava et al. (2014). Similar to our approach, dropout can be used to guide properties of the representation. For example, nested dropout (Rippel et al., 2014) has been shown to impose ‘hierarchy’ among units depending on the particular drop rate associated with each unit. Dropout itself can also be interpreted as a Bayesian approximation (Gal, 2016). A more relevant intuition into the effect of targeted dropout in our specific pruning scenario can be obtained from an illustrative case where the important subnetwork is completely separated from the unimportant one. Suppose a network was composed of two non-overlapping subnetworks, each able to produce the correct output by itself, with the network output given as the average of both subnetwork outputs. If our importance criterion designated the first subnetwork as important, and the second subnetwork as unimportant (more specifically, it has lower weight magnitude), then adding noise to the weights of the unimportant subnetwork (i.e. applying dropout) means that with non-zero probability we will corrupt the network output. Since the important subnetwork is already able to predict the output correctly, to reduce the loss we must therefore reduce the weight magnitude of the unimportant subnetwork output layer towards zero, in effect “killing” that subnetwork, and reinforcing the separation between the important subnetwork and the unimportant one. These interpretations make clear why dropout should be considered a natural tool for application in pruning. We can empirically confirm targeted dropout’s effect on weight dependence by comparing a network trained with and without targeted dropout and inspecting the Hessian and gradient to determine the dependence of the network on the weights/units to be pruned. As in LeCun et al. (1990)

, we can estimate the effect of pruning weights by considering the second degree Taylor expansion of change in loss,

:

(3)

Where if (the weights to be removed) and otherwise. are the gradients of the loss, and is the Hessian. Note that at the end of training, if we have found a critical point , then , leaving only the Hessian term. In our experiments we empirically confirm that targeted dropout reduces the dependence between the important and unimportant subnetworks by an order of magnitude (See Fig. 1, and Section 5.1 for more details).

4 Related Work

The pruning and sparsification of neural networks has been studied for nearly three decades and has seen a substantial increase in interest due to their implementation on resource limited devices such as mobile phones and ASICs. Early work such as optimal brain damage (LeCun et al., 1990) and optimal brain surgeon (Hassibi and Stork, 1993), as well as more recent efforts (Molchanov et al., 2016; Theis et al., 2018), use a second order Taylor expansion of the loss around the weights trained to a local minimum to glean strategies for selecting the order in which to prune parameters. Han et al. (2015a) combine weight quantisation with pruning and achieve impressive network compression results, reducing the spatial cost of networks drastically. Dong et al. (2017) improve the efficiency of the optimal brain surgeon procedure by making an independence assumption between layers. Wen et al. (2016) propose using Group Lasso (Yuan and Lin, 2006)

on convolutional filters and are able to remove up to 6 layers from a ResNet-20 network for a 1% increase in error. A great deal of effort has been put towards developing improved pruning heuristics and sparsifying regularizers

(LeCun et al., 1990; Hassibi and Stork, 1993; Han et al., 2015a; Babaeizadeh et al., 2016; Molchanov et al., 2016; Dong et al., 2017; Louizos et al., 2017; Huang et al., 2018; Theis et al., 2018). These are generally comprised of two components: the first is a regularisation scheme incorporated into training to make the important subnetworks easily identifiable to a post hoc pruning strategy; the second is a particular post hoc pruning strategy which operates on a pre-trained network and strips away the unimportant subnetwork. The two works most relevant to our own are regularisation (Louizos et al., 2017) and variational dropout (Molchanov et al., 2017). Louizos et al. (2017) use an adaptation of concrete dropout (Gal et al., 2017) on the weights of a network and regularise the drop rates in order to sparsify the network. Similarly, Molchanov et al. (2017) apply variational dropout (Kingma et al., 2015)

to the weights of a network and note that the prior implicitly sparsifies the parameters by preferring large drop rates. In addition to our methods being more effective at shrinking the size of the important subnetwork, targeted dropout uses two intuitive hyperparameters, the targeting proportion

and the drop rate , and directly controls sparsity throughout training (i.e., attains a predetermined sparsity threshold). In comparison, Louizos et al. (2017) uses the Hard-Concrete distribution which adds three hyperparameters and doubles the number of trainable parameters by introducing a unique gating parameter for each model parameter, which determines the Concrete dropout rate; while Molchanov et al. (2016) adds two hyperparameters and doubles the number of trainable parameters. In our experiments we also compare against regularization (Han et al., 2015b) which is intended to drive unimportant weights towards zero. The Lottery Ticket Hypothesis of Frankle and Carbin (2018) demonstrates the existence of a subnetwork that – in isolation, with the rest of the network pruned away – both dictates the function found by gradient descent, and can be trained to the same level of task performance with, or without, the remaining network. In our notation, a prediction of this “winning lottery ticket” is ; and the effectiveness of our method suggests that one can reduce the size of the winning lottery ticket by regularising the network.

5 Experiments

Our experiments were performed using the original ResNet (He et al., 2016b), Wide ResNet (Zagoruyko and Komodakis, 2016), and Transformer (Vaswani et al., 2017) architectures; applied to the CIAFR-10 (Krizhevsky and Hinton, 2009)

, ImageNet

(Russakovsky et al., 2015), and WMT English-German Translation datasets. For each baseline experiment we verify our networks reach the reported accuracy on the appropriate test set; we report the test accuracy at differing prune percentages and compare different regularisation strategies. In addition, we compare our targeted dropout to standard dropout where the expected number of dropped weights is matched between the two techniques (i.e. the drop rate of standard dropout runs is set to , the proportion of weights to target times the dropout rate). For our pruning procedure, we perform the greedy layer-wise magnitude-base pruning described in Section 2.3

to all weight matrices except those leading to the logits. In our experiments we compare targeted dropout against the following competitive schemes:

Regularization (Han et al., 2015b): Complexity cost is added to the cost function. The hope being that this term would drive unimportant weights to zero. In our table we denote this loss by where is the cost-balancing coefficient applied to the complexity term. Regularization (Louizos et al., 2017): Louizos et al. apply an augmentation of Concrete Dropout (Gal et al., 2017), called Hard-Concrete Dropout, to the parameters of a neural network. The mask applied to the weights follows a Hard-Concrete distribution where each weight is associated with a gating parameter that determines the drop rate. The use of the Concrete distribution allows for a differentiable approximation to the cost, so we may directly minimise it alongside our task objective. When sparisfying these networks to a desired sparsity rate, we prune according to the learned keep probabilities ( from (Louizos et al., 2017)), dropping those weights with lowest keep probabilities first. Variational Dropout (Kingma et al., 2015; Molchanov et al., 2017): Similar to the technique used for regularisation, Molchanov et al. (2017) apply Gaussian dropout with trainable drop rates to the weights of the network and interprets the model as a variational posterior with a particular prior. The authors note that the variational lower bound used in training favors higher drop probabilities and experimentally confirm that networks trained in this way do indeed sparsify. Smallify (Leclerc et al., 2018): Leclerc et al. use trainable gates on weights/units and regularise gates towards zero using

regularisation. Crucial to the technique is the online pruning condition: Smallify keeps a moving variance of the sign of the gates, and a weight/unit’s associated gate is set to zero (effectively pruning that weight/unit) when this variance exceeds a certain threshold. This technique has been shown to be extremely effective at reaching high prune rates on VGG networks

(Simonyan and Zisserman, 2014). Specifically, we compare the following techniques:

  • [labelwidth=5.25em, leftmargin=5.25em, itemindent=0pt]

  • Standard weight or unit dropout applied at a rate of .

  • Targeted dropout (the weight variant in ‘a)’ tables, and unit variant in ‘b)’ tables) applied to the lowest magnitude weights at a rate of .

  • Variational dropout (Kingma et al., 2015; Molchanov et al., 2017) applied with a cost coefficient of .

  • regularisation (Louizos et al., 2017) applied with a cost coefficient of .

  • regularisation (Han et al., 2015b) applied with a cost coefficient of .

  • Smallify SwitchLayers (Leclerc et al., 2018) applied with a cost coefficient of , exponential moving average decay of 0.9, and a variance threshold of 0.5.

5.1 Analysing the Important Subnetwork

Figure 1: A comparison between a network without dropout (left) and with targeted dropout (right) of the matrix formed by . The weights are ordered such that the last 75% are the weights with the lowest magnitude (those we intend to prune). The sum of the elements of the lower right hand corner approximates the change in error after pruning (Eqn. (3)). Note the stark difference between the two networks, with targeted dropout concentrating its dependence on the top left corner, leading to a much smaller error change after pruning (given in Table 1).
Regularisation Unpruned Accuracy Pruned Accuracy
None 0.120698 38.11% 26.13%
Targeted Dropout 0.0145907 40.09% 40.14%
Table 1: Comparison of the change in loss ( of Equation (3)) for dense networks.

Weight Dropout/Pruning

prune percentage


Unit Dropout/Pruning

prune percentage

Table 2: ResNet-32 model accuracies on CIFAR-10 at differing pruning percentages and under different regularisation schemes. The top table depicts results using the weight pruning strategy, while the bottom table depicts the results of unit pruning (see Sec. 2.3)

In order to analyze the effects of targeted dropout we construct a toy experiment with small dense networks to analyse properties of the network’s dependence on its weights. The model we consider is a single hidden layer densely connected network with ten units and ReLU activations

(Nair and Hinton, 2010). We train two of these networks on CIFAR-10; the first unregularised, and the second with targeted dropout applied to the lowest-magnitude weights at a rate of

. The networks are both trained for 200 epochs at a learning rate of 0.001 using stochastic gradient descent without momentum. We then compute the gradient and Hessian over the test set in order to estimate the change in error from Equation 

3 (see Table 1). In addition, we compute the Hessian-weight product matrix formed by typical element as an estimate of weight correlations and network dependence (see Figure 1). This matrix is an important visualisation tool since summing the entries associated with weights you intend to delete corresponds to computing the second term in Equation (1) – this becomes the dominant term towards the end of training, at which time the gradient is approximately zero. Figure 1 makes clear the dramatic effects of targeted dropout regularisation on the network. In the Figure, we reorder the rows and columns of the matrices so that the first 25% of the matrix rows/columns correspond to the 25% of weights we identify as the important subnetwork (i.e. highest magnitude weights), and the latter 75% are the weights in the unimportant subnetwork (i.e. lowest magnitude weights). The network trained with targeted dropout relies nearly exclusively on the 25% of weights with the largest magnitude at the end of training. Whereas, the network trained without regularisation relies on a much larger portion of the weights and has numerous dependencies in the parameters marked for pruning.

5.2 ResNet

We test the performance of targeted dropout on Residual Networks (ResNets) (He et al., 2016a)

applied to the CIFAR-10 dataset, to which we apply basic input augmentation in the form of random crops, random horizontal flipping, and standardisation. This architectural structure has become ubiquitous in computer vision, and is gaining popularity in the domains of language

(Kalchbrenner et al., 2016), and audio (Van Den Oord et al., 2016). Our baseline model reaches over 93% final accuracy after 256 epochs, which matches previously reported results for ResNet-32 (He et al., 2016a). Our weight pruning experiments demonstrate that standard dropout schemes are comparatively weak compared to their targeted counterparts; standard dropout performs worse than our no-regularisation baseline. We find that a higher targeted dropout rate applied to a larger portion of the weights results in the network matching unregularised performance with only 40% of the parameters. Variational dropout seems to improve things marginally over the unregularised baseline in both weight and unit pruning scenarios, but was still outperformed by targeted dropout. regularisation was fairly insensitive to its complexity term coefficient; we searched over a range of and found that values above failed to converge, while values beneath tended to show no signs of regularisation. Similarly to variational dropout, regularisation does not prescribe a method for achieving a specific prune percentage in a network, and so, an extensive hyperparameter search becomes a requirement in order to find values that result in the desired sparsity. As a compromise, we search over the range mentioned above and select the setting most competitive with targeted dropout; next, we applied magnitude-based pruning to the estimates provided in Equation 13 of Louizos et al. (2017). Unfortunately, regularisation seems to force the model away from conforming to our assumption of importance being described by parameter magnitude.

WeightDropout/Pruning

prune percentage

UnitDropout/Pruning

prune percentage

Table 3: ResNet-102 model accuracies on ImageNet. Accuracies are top-1, single crop on 224 by 224 pixel images.

In Table 3 we present the results of pruning ResNet-102 trained on ImageNet. We observe similar behaviour to ResNet applied to CIFAR-10, although it’s clear that the task utilises much more of the network’s capacity, rendering it far more sensitive to pruning relative to CIFAR-10.

5.3 Wide ResNet

In order to ensure fair comparison against the regularisation baseline, we adapt the authors own codebase222the original PyTorch code can be found at: github.com/AMLab-Amsterdam/L0_regularization to support targeted dropout, and compare the network’s robustness to sparsification under the provided implementation and targeted dropout. In Table 4 we observe that regularisation fails to truly sparsify the network, but has a strong regularising effect on the accuracy of the network (confirming the claims of Louizos et al.). This further verifies the observations made above, showing that regularisation fails to sparsify the ResNet architecture.

Unit Dropout/Pruning

prune percentage

Table 4: Wide ResNet (Zagoruyko and Komodakis, 2016) model classification accuracy on CIFAR-10 test set at differing prune percentages.

5.4 Transformer

Weight Dropout/Pruning

prune percentage

(a) Transformer model uncased BLEU score.

Weight Dropout/Pruning

prune percentage

(b) Transformer model per-token accuracy.
Table 5:

Evaluation of the Transformer Network under varying sparsity rates on the WMT newstest2014 EN-DE test set.

The Transformer network architecture (Vaswani et al., 2017)

represents the state-of-the-art on a variety of NLP tasks. In order to evaluate the general applicability of our method we measure the Transformer’s robustness to weight-level pruning without regularisation, and compare this against two settings of targeted dropout applied to the network. The Transformer architecture consists of stacked multi-head attention layers and feed-forward (densely connected) layers, both of which we target for sparsification; within the multihead attention layers, each head of each input has a unique linear transformation applied to it, which are the weight matrices we target for sparsification. Table 

1(a) details the results of pruning the Transformer architecture applied to the WMT newstest2014 English-German (EN-DE). Free of any regularisation, the Transformer seems to be fairly robust to pruning, but with targeted dropout we are able to increase the BLEU score by 15 at 70% sparsity, and 12 at 80% sparsity; further confirming target dropout’s applicability to a range of architectures and datasets.

5.5 Scheduling the Targeting Proportion

Weight Dropout/Pruning

prune percentage


Unit Dropout/Pruning

prune percentage

Table 6: Comparing Smallify to targeted dropout and ramping targeted dropout. Experiments on CIFAR10 using ResNet32.

Upon evaluation of weight-level Smallify (Leclerc et al., 2018) we found that, with tuning, we were able to out-perform targeted dropout at very high pruning percentages (see Table 6). One might expect that a sparsification scheme like Smallify – which allows for differing prune rates between layers – would be more flexible and better suited to finding optimal pruning masks; however, we show that a variant of targeted dropout we call ramping targeted dropout is capable of similar high rate pruning. Moreover, ramping targeted dropout preserves the primary benefit of targeted dropout: fine control over sparsity rates. Ramping targeted dropout simply anneals the targeting rate from zero, to the specified final throughout the course of training. For our ResNet experiments, we anneal from zero to 95% of over the first forty-nine epochs, and then from 95% of to 100% of over the subsequent forty-nine. In a similar fashion, we ramp from 0% to 100% linearly over the first ninety-eight steps. Using ramping targeted dropout we are able to achieve sparsity of 99% in a ResNet32 with accuracy 87.03% on the CIFAR-10 datatset; while the best Smallify run achieved intrinsic sparsity of 98.8% at convergence with accuracy 88.13%, when we perform pruning to enforce equal pruning rates in all weight matrices, the network degrades rapidly (see Table 6).

6 Future Work

Targeted dropout presents an extremely simple procedure for decoupling dependence between subnetworks. While the focus of this work is the implications upon network sparsification, a generalised interpretation of targeted dropout suggests the potential for targeting strategies aimed at promoting properties such as interpretability and structured representation. A primary feature of nested dropout (Rippel et al., 2014) is creating a hierarchy across units, this comes at the cost of an expensive training procedure requiring a summation over losses equal to the number possible truncation masks (for a single layer of nested dropout this amounts to the number of units in the layer; for multiple layers, this would grow as the product of the layers’ units). Targeted dropout could be trivially adapted to impose hierarchy among the units in each layer in a computationally cheap way. Targeted dropout could also be applied to the bits of weights’ and activations’ numeric representation in order to promote robustness to quantisation. Targeted dropout applied to the parameters of the network suffers from the fact that only one mask sample can be applied per batch, and as such, stands to benefit from incorporating flipout (Wen et al., 2018), which decorrelates within-batch gradients while leaving them unbiased.

7 Conclusion

We propose targeted dropout as a simple and effective regularisation tool for training neural networks that are robust to post hoc pruning. Among the primary benefits of targeted dropout are the simplicity of implementation, intuitive hyperparameters, and fine-grained control over sparsity - both during training and inference. Targeted dropout performs well across a range of network architectures and tasks, demonstrating is broad applicability. Importantly, like Rippel et al. (2014), we show how dropout can be used as a tool to encode prior structural assumptions into neural networks. This perspective opens the door for many interesting applications and extensions.

Acknowledgements

We would like to thank Christos Louizos for his extensive review of our codebase and help with debugging our implementation, his feedback was immensely valuable to this work. Our thanks goes to Nick Frosst, Jimmy Ba, and Mohammad Norouzi who provided valuable feedback and support throughout. We would also like to thank Guillaume Leclerc for his assistance in verifying our implementation of Smallify. Lastly, we wish to thank the for.ai team for supporting the project and providing useful feedback; in particular Siddhartha Rao Kamalakara and Divyam Madaan, who encouraged this work and directly assisted in the late stages of its development.

References