In recent years, systems in every domain of our lives have been moving away from domain specific mathematical solutions towards using deep networks for prediction, classification, and other problems. Domains, such as image or video understanding, speech processing and recognition, recommendation systems, and many more, now rely heavily on deep models that are trained on large data-sets of legacy examples. Due to the substantial improvements driven by deep networks’ based techniques, using deep models became much more popular than prior art methods, and in many of these problems domain specific knowledge has also been integrated into the deep models.
Very recently, however, despite the performance accuracy superiority of deep models, results that question other aspects of deep models have started surfacing. Bias towards the training examples has obviously been one of these aspects. However, another aspect that has had very limited coverage is that of irreproducibility in deep models (also nondeterminism or underspecification due to over-parameterization). Normally, deep models have better average prediction accuracy (or objective loss) on validation data. However, their predictions on individual, yet unseen, examples may diverge largely between two separately trained instances of the same model, even if they were defined identically (architecture, parameters, hyper-parameters, optimizers, etc.) and were trained on the exact same training data-set. Observed Prediction Differences (PDs) can become substantial fractions of the actual predictions themselves (see, e.g., Chen et al. (2020); Dusenberry et al. (2020)).
While focus on irreproducibility has been limited, very recent papers began identifying the problem and its implications. Recent work D’Amour et al. (2020)
identified that while various models can appear to perform well on carefully crafted test data, their performance may be unacceptable on data slices of interest when models are deployed to real data. Because of overparameterization of deep networks, and underspecification of these parameters by the training data set, a trained model converges to a solution it prefers on the training data, which may not be the solution one actually desires for their data slice of interest on the actual real data. Many applications can tolerate irreproducibilty or deviation between performance on test data and real data. However, for some applications such model behavior can be detrimental. For example, two different medical diagnoses for a medical concern may have life-threatening implications. In reinforcement learning or online Click-Through-Rate (CTR) prediction systems(McMahan et al., 2013), decisions may determine subsequent training examples, and divergence may occur if two models generate different predictions leading to different decisions.
Irreproducibility in deep networks is different from related problems widely studied in the literature. It is not overfitting. In the latter, one would observe degradation in performance accuracy due to noise present in the training set mistakenly assumed to be signal. No such degradation is observed with irreproducibility. The relation to prediction uncertainty is more subtle, and harder to distinguish. Irreproducibility can be considered as a form of uncertainty, but different from the widely studied epistemic uncertainty. Randomness (or shuffling) of training examples does induce irreproducibility due to two reasons, different trajectories towards the optimum, or a different optimum altogether. The former is due to epistemic uncertainty, and may be observed as irreproducibility (as we show later), while the latter is caused by multiple identical optima. Epistemic uncertainty diminishes with more training examples, while irreproducibility due to multiple optima does not.
For convex (linear) models, randomness in training may make optimizers approach the optimum from different directions, thus generating prediction differences due to early stopping, where the model prediction is far from the “true” optimum as function of the number of training examples seen. The objective for deep models, on the other hand, will have multiple optima, and many which have roughly equal loss in average over all test examples, but differ on the individual predictions they provide to an example. For such models, nondeterminism in training may lead optimizers to different optima (Summers and Dinneen, 2021) (see also Nagarajan et al. (2018)), that depend on the training randomness (Achille et al., 2017; Bengio et al., 2009).
Unfortunately, in modern large-scale systems, nondeterminism is unavoidable as highly distributed systems are required to train models with vast amounts of data. Multiple factors contribute to nondeterminism resulting in model irreproducibility: randomness in model initialization, desirable or undesirable shuffling of training examples, rounding errors in optimizers, the actual training, test, and real data-sets, the model complexity, the choices of the optimizer, architecture, and hyper-parameters. While augmentation of data in training and stochastic regularization randomly applied in training Summers and Dinneen (2021) also influence irreproducibility, we do not consider these here, as they are in a different category of directly and intentionally adding randomness. Recently, Shamir et al. (2020) showed that the choice of activation in a deep network also plays a significant role in exacerbating irreproducibility, where specifically the celebrated Rectified Linear Unit (ReLU) (Nair and Hinton, 2010) can be a major contributor.
Our Contributions: We empirically study very simple models on data we synthesize to demonstrate that irreproducibility is indeed a concern even with the smallest simple possible models. Generating synthetic data and measuring the phenomenon allows us to show that one should expect this problem in deep models due to their non-convex objective surface. We demonstrate that different initializations of supposedly identical models do lead models to different optima, which elevate prediction differences. We demonstrate how the effect of randomly shuffling the training examples also increases prediction differences. We show, however, that even with minimal shuffling, non-linearity exacerbates prediction differences, starting even with non-linearity generated by hidden layers in deep networks with identity activation. Moving, however, towards non-smooth ReLU substantially worsens the effect, where a smooth activation, such as SmeLU (Shamir et al., 2020) or Swish (Ramachandran et al., 2017)
, exhibits prediction difference larger than identity, but smaller than ReLU. We show that the PD benefits of a smooth activation are limited locally and diminish with more aggressive shuffling because they arise from fewer local optima. Too aggressive shuffling diminishes the effects of fewer optima. Interestingly, prediction differences, although negligible, are observed even with convex models with identical initialization and no shuffling. This is because modern training systems, as TensorFlow, use mini-batches of examples often processed in random order. Rounding errors in computation accumulate to produce nonzero prediction differences. These, however, are no longer negligible with deep models with nonlinear activation, even without data shuffling and with identical initialization. With aggressive shuffling, even convex models can show a high level of irreproducibility stemming from early stopping with a different trajectory to the optimum, as mentioned above. Finally, our results show that PDs increase with model complexity, are also affected by the amount of smoothness of a smooth activation, by the choice of optimizer, by choices of other hyper-parameters, and can be mitigated with warm starting models even when aggressive data shuffling is present.
We build the paper starting from the simplest possible case, where our synthetic data is generated from a simple linear model with a few true binary features. Our baseline trained model is a simple linear model with those binary features. With aggressive shuffling, we observe elevated epistemic uncertainty (early stopping) driven PDs even with this baseline, but without affecting the prediction loss. We first consider a model with the input binary features feeding into a single non-linearity (identity, and then ReLU). Such a model is misspecified relative to the true data model, and is unable to find the best solution. However, even with identity activation, it is no longer convex, and the model can find two different solutions due to its overparameterization, in turn increasing PDs. Accuracy (or prediction loss) is equal for the different models. A ReLU unit can lock the hidden unit to allow only one sign of inputs, again, misspecifying the solution, and requiring more units to explain data with opposite signs. We next move to two units, and demonstrate that while accuracy can improve, PDs actually increase with more degrees of freedom: overparameterization (or underspecification of the model in training). We continue by looking at deeper models with more hidden units, showing that PDs increase. We then demonstrate similar behavior for models in whichembeddings replace the binary inputs. We conclude by showing similar behavior for deep models where the data is generated by a non-linear (quadratic) model. Here, the deep models substantially outperform a linear baseline in terms of prediction loss, but still exhibit irreproducibility as seen for the other models.
Related Work: While there is ample literature about uncertainty in deep models, as mentioned, irreproducibility has been given very little attention. Ensembles (Dietterich, 2000) that reduce uncertainty (Lakshminarayanan et al., 2017) can also reduce PD. They do, however, impose more system complexity and technical debt, and they can trade model accuracy for better reproducibility, especially in large scale systems which operate in the underfitting regime. This is because if one is constrained by computation (number of flops per single example), in order to keep the constraint, the components of the ensemble must consist of narrower layers than those of the single network which is compared to. In the underfitting regime, where adding units still improves accuracy, this degrades model accuracy. Ensembling the components improves it by less than co-training all parameters in a single network.
Distillation (Hinton et al., 2015) is becoming a popular method to transfer information from a high complexity expensive teacher model to a simpler student model, which is trained to learn from the teacher. The simple model can be deployed, saving on deployment resources. Co-distillation, proposed by Anil et al. (2018) (see also Zhang et al. (2018)), addresses irreproducibility by symmetrically distilling among training models in a set, pushing them to agree on a solution. Only a single model from the set, which is more reproducible, must be deployed. The additional constraints, however, can impair the accuracy of the deployed model. Anti-Distillation Shamir and Coviello (2020) embraces ensembles, and adds a loss that forces components to diverge from one another, together capturing larger diversity of the solution space, giving more reproducible ensembles. Additional approaches to address irreproducibility anchor the trained model to some constraints forcing it to prefer solutions that satisfy the constraints over others Bhojanapalli et al. (2021); Shamir (2018). D’Amour et al. (2020); Summers and Dinneen (2021) recently studied the irreproducibitily problem on benchmark data-sets.
2 Set Up and Evaluation Metrics
To train and evaluate models on synthetic data we built a simulation framework that consists of data generation, a model training pipeline and an evaluation component.
Data Generation: Experiments described are on data that was generated by a true linear model (Sections 3-5) and a true quadratic model (Section 6). The linear model is described here, and the quadratic in Section 6. The linear model is a sparse model with binary features. Training and test example
is a sparse vector whose components take valuefor features present in the example, and for features that are not present. We usually partition the components into two sets of , in each set, dimension takes value
with i.i.d. probability. This guarantees nonuniform probability for different features, with some “long tail” of the higher indexed features, where low-indexed features appear more frequently (e.g., for , with probability ).
True log-odds weightsfor each of the features are drawn randomly once. To diversify values of , for each value, one of normal distributions is picked with uniform probability, and then is drawn from that distribution. The distributions are , with . The label for training example is drawn with probability computed by the Sigmoid of the cumulative log-odds, given by
where is the transpose operator. The label
is drawn from a Bernoulli distribution with probabilities in (1).
Training: Each model trains with a single pass over examples generated as described, attempting to learn corresponding parameters to explain the observed labels. For deep models, the model inputs are the vectors . For the baseline linear model, the model learns weights corresponding to the dimensions of . Each experimental setting is trained and evaluated times, consisting of pairs of models. A pair of models trains on the same sequence of example/label pairs . If an experiment uses an identical initialization, the parameters of the models in the pair are initialized identically, as well. To remove dependence on a specific sequence (and specific initialization values in experiments with identical initialization), each pair of models has its own example/label pair sequence (and its own initialization values), which are equal for the two models in the pair, but not to those of other pairs.
Using the TensorFlow Keras framework, training is done in mini-batches ofexamples. Randomness in updates can occur within a mini-batch as in realistic systems, where one has no control of order of operations inside a mini-batch. To test effects of shuffling, we define a window size of the number of mini-batches over which shuffling can occur. For (or ), shuffling of training data occurs only inside the mini-batch. With , the training data of a pair of models is partitioned into windows of size , and training examples are randomly shuffled inside each window differently for each of the two models in the pair.
To illustrate effects of the choice of optimizer and hyper-parameters, we considered two different optimizers, AdaGrad Duchi et al. (2011) (per-coordinate learning schedule) with different learning rates (1.0, 0.1, with accumulator initialization of 0.1), and SGD with learning rates 0.1 and 0.01, decay rate 0.001, and momentum 0.9. Optimizers minimize binary logistic (cross-entropy) loss on the training data.
Evaluation and Metrics: For each pair of models, an identical evaluation set of examples was drawn from the data model defined above. An accuracy metric and a reproducibility metric were generated and averaged over the pairs. The accuracy of a single model on the evaluation set was measured in terms of average excess label loss given for model by
where denotes the th evaluation example for the model, is the true probability of for this example (1), and is predicted for by model . The quantity in (2) has a notion of expected regret on the evaluation set. It measures the extra average loss a model incurs over an empirical entropy rate of the evaluation set. (Clearly, other forms of metrics could be used here, but they are expected to demonstrate similar qualitative behavior.)
Various metrics can be used to measure prediction difference (PD) Chen et al. (2020); Shamir and Coviello (2020); Shamir et al. (2020). They all, however, demonstrate similar qualitative behaviors. Here, we choose a related, but slightly different metric of relative PD on the positive label
where in this definition we assumed that for any , the models are a pair of models. This metric deviates from the definition of PD in Shamir et al. (2020) by averaging over the pair PDs instead of over PDs relative to the expectation on the
models. This is because in our experiments we have cases in which different pairs have different conditions, averaging out the effects of, for example, a specific initialization vector. Relative PD is measured as fraction of the true probability of the positive label. Relative PD, on one hand, gives a metric which reduces dependence on the value of this probability, but on the other, does exacerbate the effect of small predictions. However, with our data model set up the latter does not appear significant. Since we know the true label probability, our ratio is normalized by its value, unlike other works, that had no knowledge of this value and used expectation of predictions instead. This reduces variances of our measurements.
3 Linear Data Model with Single and Double Hidden Units
We start with the baseline model, and with simple models with a single hidden layer, with one or two units. Fig. 1 shows generic graphs of the three simplest models. Binary inputs enter at the bottom. Each link on the graphs is associated with a learned weight , connecting at layer between the th element in the layer below the link (input) and the th element in the layer above the link (output). Triangles with
designate the Sigmoid function as in (1), and triangles with designate a generic activation that could be identity, ReLU, SmeLU, or any other non-linearity. Before applying the non-linearity, in both cases, a learned bias is added to the value forward propagating from the layer below. Thus the predictions for label of the three models in Fig. 1 are defined as
where for the double hidden units network, is a matrix of dimensions .
Fig. 2 shows expected relative PD over the model pairs as function of the shuffling window size (top) and as function of the excess loss (bottom) for values of shuffling window size . Curves are shown for identical initializations of the parameters in a pair of models (link weights and biases), and for distinct initialization of each model in the pair (marked as “diff” in the legend). Curves are shown for the baseline convex linear model with prediction in (4), models with a single hidden unit (5), with either as identity, or as ReLU, and for models with two hidden units (6), with both activations.
As we move left on the bottom graph, accuracy improves (loss decreases). As we move down, PD improves. Consistent with results on real and benchmark data-sets, e.g., Shamir et al. (2020), while PD changes with identical or distinct initialization and with the shuffling window, loss on evaluation data appears to be equal for a given model regardless of these factors. The baseline convex linear model, expectedly, exhibits the best PD curve. Because it is properly specified to the data generation model, where the other models are misspecified, the linear baseline also exhibits the best loss (as shown in Section 6, this is only an artifact of the data generation). With identical initialization and no shuffling, while the linear baseline has the best PD, other models are not far behind. As shuffling increases, there is a sustained gap between the linear baseline and the deep models. With aggressive shuffling, PD of the linear model is also high. Time series evaluations, though, show that, while for the other models there are limited improvements with more training examples, for the linear model, PD improves with more training examples. This suggests that PD for the linear model is dominated by early stopping, whereas with the deep models there are more degrees of freedom that yield prediction differences.
With different initializations of models in each pair, we observe the effect of the activation. While the PD of the convex baseline somewhat degrades with little shuffling (but not with more shuffling), PDs of the deep models degrade much more with little shuffling. PD for the smooth identity activation with a single mini-batch degrades by orders of magnitude relative to the same activation with identical initialization (from to ). ReLU degrades substantially to , which is retained with more shuffling. ReLU with two units, even with identical initialization already clearly demonstrates orders of magnitude higher PD than the baseline. Such degradation is not present with the smooth identity activation with 2 units. Enough shuffling with equal initialization leads to PD levels of equal or higher orders to those with different initializations. The benefits of the smooth identity diminish (and eventually disappear) with more shuffling.
To understand the effects of the different factors, we studied the weights that pairs of trained models converged to. Fig. 3 shows examples of weights learned by pairs of models as function of the parameter index. For each model a different color is used. With the baseline linear model, even with aggressive shuffling and different initialization, the points almost match and seem to deviate very little for all pairs. Note, however, that if we measure the cosine between the difference vectors , and , where the trained vectors are those of the th pair (both trained with different initial weights), we observe a cosine of almost with no shuffling, but one of with maximum shuffling, suggesting that the shuffling results in the model approaching the optimum from a different direction.
For the single activation models, with different initialization, we observe two different phases. One is similar to the linear model, but the other has weights that converge to opposite signs (with little noise) as shown in the second image in Fig. 3. We were unable to observe the latter with identical initialization pairs, even with the most aggressive shuffling. Single ReLU models appeared slightly more noisy than those with identity activation. The bottom images in Fig. 3 for two ReLU units show clearer deviations between the models with minimum shuffling, and stronger deviations with maximum shuffling. These conclusions are even more obvious by observing Fig. 4, that shows only the weight differences between models in a pair for a single ReLU model. The scale of differences with minimum shuffling is and increases to with maximum shuffling.
4 Multi Layers
We next move to more complex models with the same input but deeper with more units as shown in Fig. 4(a). With the linear data model, these are still misspecified. As shown in Fig. 6, this yields worse PD for the ReLU model even with minimal shuffling and with identical initialization of the model pair. Some model instantiations (even with little shuffling) are stuck, unable to learn, degrading expected loss and PD substantially. Fig. 7 demonstrates the “wide correlation cloud” between the learned weights of two models in an identically initialized pair of ReLU activated models. This cloud suggests that the models converge to very different optima (although, despite that, when they manage to learn, excess loss on evaluation data is similar). PDs of both the ReLU and the identity activation are high and worse than a single ReLU also with more shuffling. With minimal shuffling, identity still retains low PDs of .
5 Wider Layers with Trained Embeddings
In this section, we study wider models where inputs are mapped to an embedding space, as shown in Fig. 4(b). Each feature in each of the two sets of features is mapped to a two dimensional vector . For each example, the vectors representing all active features in a set of (nonzero components of ) are summed into an embedding input, which is fed into hidden units, activated by .
Fig. 8 shows PDs as function of the shuffling window and the excess loss for different activation functions, optimizers, and learning rates for model pairs with identical initialization. Again, we observe mostly consistent excess loss for the same configurations while PDs change with shuffling. Similar behavior to previous models is observed as function of the activation, with the baseline linear having the best PD curve. However, loss and PD are also affected by the choice of optimizer and its hyper-parameters, where too large learning rates can degrade loss and PD. Interestingly, well tuned learning rates even lead to improved loss over the baseline linear with the identity activation. Again, ReLU exhibits worse PDs even with minimal shuffling. Fig. 9 shows correlation curves for two pairs of ReLU models for both embeddings and hidden weights. If embeddings are aligned, a narrow cloud is observed for the hidden weights. However, in pairs in which embedding weights are less correlated, hidden weights also deviate more substantially.
6 A Quadratic Data Model
A linear data model may be too simplified for many real cases. To demonstrate irreproducibility in other data models, we also studied a synthetic quadratic model with inputs dimensions. Here, the 32 inputs were partitioned into 8 blocks of 4 units, , and the log-odds ratio for the label of each pattern is given by , where . Each is a lower-triangular matrix, whose 10 components are generated independently from a mixture of three normal distributions, as before. In each block of , the
-th feature is selected with a prior probability, for , to obtain a comparable tail distribution. Models, containing two hidden layers, with sizes , were trained as before, with a variety of activation functions, including Swish (), and SmeLU ( for , for , and for ), with different parameters.
Fig. 10 shows PD and excess loss as function of shuffling for identically initialized model pairs. The graphs are very similar to the linear model, with the exception of superior loss with non-linear activations, with very poor performance for the baseline linear model. Again, excess loss seems unaffected by shuffling, whereas PD in Fig. 10 also shows how properly tuned SmeLU and Swish can improve PD (with comparable loss) over ReLU as long as shuffling is limited. As shuffling becomes more aggressive, PD gains diminish and eventually disappear because aggressive shuffling can still find different optima even when there are fewer optima. (The well tuned SmeLU slightly outperforms the well tuned Swish.) We also observe trade-offs as function of for both activations consistent with those described for real and benchmark data-sets in Shamir et al. (2020). (Similar results for SmeLU and Swish were also observed with the narrow and wide models.)
Finally, Fig. 10 studies the effect of warm starting
with transfer learning (TL) on the ReLU models. Here, we trained a ReLU model with the same data generation, and used the resulting model weights to initialize the model pair, which are now trained with smaller learning rate. The lower learning rate reduces PD levels even with aggressive shuffling, maintaining competitive excess loss. We should emphasize that the loss comparison is not a fair one, as the TL models have effectively seen twice the amount of training examples.
We empirically studied irreproducibility in deep networks, showing that the phenomenon exists even for the simplest possible models with the simplest possible data generation. We demonstrated that irreproducibility can emerge from randomness in initialization of model parameters, randomness stemming from intentional (or unintentional) shuffling of the training examples. It can then be exacerbated by rounding errors, non-smooth activations, model complexity, data model, choice of model hyper-parameters and other related factors. Smooth activations were shown to mitigate the problem under limited data shuffling, but their benefits diminish with more aggressive shuffling. We observed how irreproducibility is manifested in the internal representations of a model, with parameters deviating from one another for comparable pairs of models.
Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856. Cited by: §1.
- Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. Cited by: §1.
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1.
- On the reproducibility of neural network predictions. Cited by: §1.
- . arXiv preprint arXiv:2008.07032. Cited by: §1, §2.
- Underspecification presents challenges for credibility in modern machine learning. External Links: Cited by: §1, §1.
- Ensemble methods in machine learning. Lecture Notes in Computer Science, pp. 1–15. Cited by: §1.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. Cited by: §2.
- Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 204–213. Cited by: §1.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §1.
- Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §1.
- Deterministic implementations for reproducibility in deep reinforcement learning. arXiv preprint arXiv:1809.05676. Cited by: §1.
Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §1.
- Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
- Anti-distillation: improving reproducibility of deep networks. arXiv preprint arXiv:2010.09923. Cited by: §1, §2.
- Smooth activations and reproducibility in deep networks. arXiv preprint arXiv:2010.09931. Cited by: §1, §1, §2, §3, §6.
- Systems and methods for improved generalization, reproducibility, and stabilization of neural networks via error control code constraints. Cited by: §1.
- On nondeterminism and instability in neural network optimization. Cited by: §1, §1, §1.
- Deep mutual learning. In , pp. 4320–4328. Cited by: §1.