1 Introduction
Deep learning has been widely applied and made great progress in many fields such as image recognition and synthesis, video understanding, and natural language processing in recent years. However, the generalization of neural networks is still affected by the unexplainable blackbox structure of algorithms, biased datasets, noisy labeling, and various evaluation metrics, becoming an open problem. In particular, the overparameterized deep neural network has far more parameters than training samples. Although, it is predicted to severely overfit by classical learning theory, it generalizes remarkably well^{VallePerez et al. (2018)}. In this regard, Zhang^{Zhang et al. (2016)}
found that the deep neural network trained using SGD can easily fit random labels. In this case, Rademacher complexity, which is a measurement of the ability to fit random noise in the statistical learning theory, will be approximately equal to 1. This will result in loose and invalid generalization error bounds. A similar reasoning is also appropriate for VCdimension
^{Zhang et al. (2016)}. It shows that the traditional learning theory fails to explain why deep neural networks generalize well from the training set to new data.Recently, some researchers have tried to develop a new generalization theory adapted to overparameterized deep neural networks by explaining the phenomenon formation mechanism from the memorization  generalization theory of deep learning. Zhang^{Zhang et al. (2016)} argues that there are remaining signals in correctly labeled training data and neural networks can not only capture them but also rely on memory to forcibly fit noisy part. Therefore, the ability to extract patterns while memorizing exception samples is the real reason for the high performance generalization of deep neural networks. Feldman^{Feldman (2020)} further pointed out that the memorization of rare and atypical samples is the source of unreasonable beyondexpectation generalization performance^{Sejnowski (2020)} of deep neural networks due to the longtailed distribution of images in real world. Both of them imply differences in the distribution of patterns contained in samples and the powerful memory of the neural networks. A fundamental question then arises: How to represent the distribution of intrinsic pattern in samples?
Differences in sample distribution have become one of the common bottlenecks in the field of deep learning, such as training set biases, i.e., inconsistency in the distribution between training and testing sets. Existing machine learning researches show that the amount of information contained in samples and their contributions to training are not equal due to difference in distribution^{Katharopoulos and Fleuret (2018)}. Reweighing training samples and training in order of difficulty can help alleviate sample biases. OHEM^{Shrivastava et al. (2016)} represents the distribution based on loss and selects difficult samples; Data Dropout^{Wang et al. (2018)} thinks that the impact of samples on training can be quantified by the influence function^{Koh and Liang (2017)}; Ioubalanced sampling^{Pang et al. (2019)} explores the sample distribution on detection tasks and proposes an Ioubased representation space in which training data are collected uniformly to improve generalization performance; GHM^{Li et al. (2019)} argues that sample differences can be reflected by gradient norm; PISA^{Cao et al. (2020)} proposes a novel IoU hierarchical local rank strategy as a quantitative way to evaluate sample differences. More recently, a new line of research has been opened, which empirically represent sample distribution based on a memorization  generalization theory. Jiang^{Jiang et al. (2020)} describes that cumulative binary training loss (CBTL) can be used as a lightweight proxy for samples’ consistency score; Toneva^{Toneva et al. (2018)} novelly proposes a forgetting event based sample representation.
Inspired by the above work, we attempt to explore the way of representing distribution of intrinsic pattern in samples through the memorization  generalization theory, empirically demonstrating the exposed regularity of the underlying patterns in the data. However, we find that unidimensional sample representation with CBTL or the number of forgetting events alone can cause confusion in distinguishing samples. Therefore, we incorporate longterm stable information and shortterm dynamic signals during training to propose unified regularity measures with bidimensional representation. They represent the distribution of training samples and testing samples in a CBTL  forgetting events and a CBGL  malgeneralizing events space, respectively. This study will be beneficial in the fields of fewshot learning, training acceleration, difficult sample selecting, data distribution consistency, algorithm testing and so on. The major contributions of this paper are summarized as:

A bidimensional representation involves forgetting events and CBTL is proposed to measure the regularity of training samples.

Likewise, a bidimensional representation combining newly defined CBGL and malgeneralizing events, is proposed to measure the regularity of testing samples.

Our experimental findings suggest that samples with higher regularity seem to contribute little in both training and testing task, which in return validates the effectiveness of the proposed measures.
2 Preliminary Works
The general form^{Li (2012)} of the learning system can be described as the follows:
Given a training dataset: , where
denotes the input observation  label sample pair. The learning system is trained with the given training dataset to obtain a model, denoted as a conditional probability distribution
or a decision function, to describe the mapping between input and output random variables. The optimal model is generally trained by the strategy of minimizing the empirical risk
, where is the loss.2.1 Forgetting Events and Cumulative Binary Training Loss
Continuous learning in real world requires that intelligent systems are able to learn on successive tasks without performance degradation on the preceding training tasks, just as humans do. Researchers have found that this problem setting poses a great challenge for connectionistbased neural networks^{French (1999); McCloskey and Cohen (1989); Ratcliff (1990)}. The phenomenon, known as ”catastrophic forgetting”, is mainly manifested by the tendency of neural networks to quickly and brutally forget all the acquired knowledge of the previous task (e.g., task A) after the current task (e.g., task B) is added. Even deep neural networks, which have achieved great successes in recent years, are not able to overcome this shortcoming well^{Kirkpatrick et al. (2017)}, which undoubtedly increases the doubts about the research on general artificial intelligence based on them. The main source of this phenomenon is that when the task goal changes, the acquired weights of the network adapted to the previous task also adjust to the needs of the new task, thus failing to generalize on the previous task.
Inspired by this, Toneva^{Toneva et al. (2018)}
argues that a single learning task optimization based on minibatch stochastic gradient descent can be considered as a process similar to continuous learning. In this process, each minibatch of training data can be considered as a small task that is sequentially handed over to the deep neural network. This leads to the following definition of sample forgetting events
^{Toneva et al. (2018)}.Forgetting Events
: During the minibatch sample learning process, the acquired (i.e., correctly classified) training sample at time
are misclassified at subsequent time ().Toneva^{Toneva et al. (2018)} explored the memory dynamics during training and classified the samples into forgettable and unforgettable samples based on forgetting events. At the same time, they experimentally verified the features atypicality and visual illegibility of the forgettable samples. However, both the extremely quickly learned simple samples and difficult learned exception samples have few forgetting events, and apparently the statistics of forgetting events are symmetric. They cannot well distinguish the differences in samples. In response, Jiang^{Jiang et al. (2020)} quantified the regularity of samples by measuring the consistency of the sample with the overall distribution of all samples through a lightweight proxy of CBTL, defined as follows.
Cumulative Binary Training Loss (CBTL)
: During the minibatch sample learning process, the cumulative number of correct classifications of the training sample up to time .
The ResNet110 network was trained with Cifar10 dataset and CBTL and the number of forgetting events were recorded. Figure
1 reflects that samples in the Cifar10 training set are not equally difficult to classify when CBTL is the same but the number of forgetting events is different. It is obvious that the images in the top row are more regular and easier to recognize than the bottom row. At this point, CBTL alone can’t distinguish the pattern differences in samples well.Inspired by this, we try to represent differences in the pattern distribution of samples based on forgetting events jointly with CBTL. We consider CBTL as a longterm stability measure of the learning difficulty of one sample. The occurrence of a forgetting event, on the other hand, implies that the sample crossed in the wrong direction as the decision boundary changed, demonstrating that the direction of the loss incurred by the sample at this moment is not consistent with the overall loss of the training set. Therefore, statistics of forgetting events can be considered as a measure of uncertainty embedded in the sample in learning a decision boundary.
As a result, in this paper, we empirically propose a bidimensional sample regularity representation, which is in the CBTL  forgetting events and CBGL  malgeneralizing events spaces, as shown in Figure 2. Some good properties of this representation are explored and the insight of dataset compressibility is validated in the later section.
Due to high computational cost of performing statistics after each minibatch during training, this paper takes the update after each epoch, as detailed in the following Experimental Setup section. From this, we formalize the above definitions of forgetting events and CBTL as follows.
The predicted label for the training example obtained after epochs of optimization is denoted as . Let
, which is a binary variable, indicating whether the sample is correctly classified at time epoch
. The formalized definitions are as follows.Forgetting Events
: Let , the number of forgetting events of one training sample at epoch is defined as follows:
(1) 
Cumulative Binary Training Loss (CBTL)
: For training sample , CBTL at epoch is defined as follows:
(2) 
2.2 Malgeneralizing Events and Cumulative Binary Generalizing Loss
This paper is equally concerned with the dynamics of generalization on the testing set during sequential learning of minibatch data. Imitating definitions of forgetting events and CBTL, the following definitions are available.
Malgeneralizing Events
: During the minibatch sample learning process, the testing samples can be correctly classified at epoch but misclassified at subsequent epoch ().
Cumulative Binary Generalizing Loss (CBGL)
: During the minibatch sample learning process, the cumulative number of correct classifications of testing samples up to epoch .
3 Experimental Verification and Analysis
3.1 Experimental Setup
As described in Toneva’s work^{Toneva et al. (2018)}, it would be quite time and computationally expensive to calculate whether a forgetting event occurs for all training samples after each minibatch. Therefore they only calculate for the minibatch samples involved in the training after each minibatch. The calculation of malgeneralizing events faces the same dilemma, i.e., it is not feasible to calculate the generalization status of all testing samples after each minibatch. Then, considering the limited impact on model performance after minibatch samples training, this paper adopts a very different strategy from the above approach, i.e., updating the inference states of all training and testing samples after each epoch. To confirm this idea, the results of our strategy is compared with that proposed by Toneva’s^{Toneva et al. (2018)}
. We trained the ResNet110 network with Cifar10 dataset. The average errors of CBTL and the number of forgetting events are 1.2614 and 0.6762, respectively. The Pearson correlation coefficient of the vectors of CBTL and the number of forgetting events for 50000 samples in Cifar10 training set are 0.9896 and 0.9835, respectively, both of which are very strongly correlated. Therefore, the strategy of making one inference and updating the states after one epoch is used as a lightweight proxy. When the number of training samples is 50,000, the number of testing samples is 10,000, and the batch size is 128, the number of model inferences required is reduced to about 1/390 of the ideal case. Note that this approximation is likely to aggravate the randomness of model generalization.
This paper mainly explores inference states based on the training process of ResNet110 with the Cifar10 dataset. The model is trained for a total of 200 epochs, and its average training performance is close to the highest accuracy of the architecture on the Cifar10 dataset, i.e., . In particular, the initial learning rate is 0.1, which decreases to 0.01 at the 81st epoch and to 0.001 at the 122nd epoch. In this paper, the same network is trained 10 times under the same hyperparameter settings and its mean value is taken to eliminate the effect of randomness of model inference on the empirical analysis.
3.2 Representation of Sample Distribution
The sample distribution is represented from the perspective of memorization  generalization theory of neural networks. Inspired by Jiang’s^{Jiang et al. (2020)} and Toneva’s^{Toneva et al. (2018)} works, this paper empirically proposes a bidimensional sample regularity representation, which is in the CBTL  forgetting events and CBGL  malgeneralizing events spaces. This section explores this representation.
3.2.1 Unidimensional Sample Distribution Representation
Firstly, the histogram of the sample distribution in a single dimension is made as Figure 7.
The distributions in Figure 7 are all longtailed, with simple samples having high CBTL/CBGL and a small number of forgetting/malgeneralizing events dominating. The distribution of training and testing samples is very similar, but the testing samples are not involved in training, resulting in a longer tail. Forgetting/Malgeneralizing events have a smaller range of the distribution and more detailed local information than CBTL/CBGL.
3.2.2 Bidimensional Sample Distribution Representation
Further, the regularity of samples’ distribution is depicted in the CBTL/CBGL  forgetting/malgeneralizing events space as shown in Figure 2 (a) and (b). The samples are symmetrically distributed in our defined bidimensional space with respect to the forgetting/malgeneralizing events, while the distribution of forgetting/malgeneralizing events under the same binary training/testing loss is widely disparate, as shown in Figure 10. This likewise confirms the idea that a single dimension does not distinguish well the intrinsic patterns of the samples.
From Figure 2 (a) and (b), they also show that the sample density in the lower right corner is larger, i.e., the number of simple samples is dominant, which is consistent with the previous findings. The symmetry distributions of the testing and training samples are also approximate, but the symmetry center axis of the training samples is more to the right and the distribution is tighter. It is also consistent with the intuitive understanding that the empirical error tends to be smaller than the test error.
3.2.3 Visual Verification
We visualized some samples at different positions of the distribution as shown in Figure 2 (b). It can be seen that the pattern distribution of the samples does differ. For the samples in the lower right corner, the brown horse on the green grass and the green frog on the gray rock, the objects are highly recognizable and regular. The samples in the middle part, on the other hand, have some confusion between the background and foreground objects, which are more difficult to identify. For the samples in the bottom left corner, the cat with green light and the triangular boat, both with unconventional category features, they look more like a frog and a tree, and are very easy to be misclassified.
3.3 Exploring the Measures’ Properties
3.3.1 Training Randomness
Due to the randomness of the optimization process and the uncertainty of model inference, the results of repetitive experiments with the same hyperparameter settings for the same network architecture vary, so the stability needs to be explored. In this paper, we explore the statistics of 10 repetitive experiments.
As shown in the first row of Figure 15, the epoch and the number of the malgeneralizing events’ occurrences and the CBGL for the same sample vary in repetitive experiments. For example, over 10 training sessions, the epoch of the malgeneralizing events’ occurrences for testing sample #1 varies between , the number varies between ; and the CBGL varies between . The forgetting events and CBTL also have similar properties. This means that CBTL/CBGL and forgetting/malgeneralizing events should be counted for multiple repetitions of the experiment.
Further, in order to intuitively quantify the effect of randomness from training on the sample distribution representation, we measure the correlation between the results of distributions from different training sessions in this paper. The correlation matrix is obtained based on the correlation between the sample density vectors in each representation as shown in Figure
18. The sample density vectors is obtained by normalizing the vectors gotten by the density calculation method described in the caption of Figure 2.The average Pearson correlation coefficient of 10 training sessions is 0.8634 for training samples and 0.8733 for testing samples. This shows that our representation of the sample distribution has more stable statistical properties for repetitive training sessions.
3.3.2 The Relationship Between Memorization and Generalization
Due to the similarities in definitions and the computational process, here we discuss the similarities between malgeneralizing events and forgetting events to understand the relationship between memorization and generalization through statistics of neural network’s inference performance dynamics on training and testing samples.
As shown in Figure 21, the histogram distribution of forgetting events is similar to that of malgeneralizing events for model , as well as for model , with Pearson correlation coefficients of 0.9985 and 0.9984, respectively. Likewise, the histogram distribution of CBTL is similar to that of CBGL in the experimental statistics of each model. According to the record^{Krizhevsky et al. (2009)} for producing Cifar10 dataset, the training and testing sets are independently and identically distributed. This means that ideally, the similarity between the distribution of forgetting events and malgeneralizing events of the model can be used to measure the similarity between the training and testing sets. In addition, the red arrows indicate the distribution variability, which to some extent reflects the difference between model learning and generalizing, i.e., generalization error.
Further, we investigate the synchronization between malgeneralizing events and forgetting events, where the synchronization of the particular testing sample with a training sample means that the epoch when its malgeneralizing event occurs during the training process corresponds exactly to the epoch when the forgetting event of that training sample occurs. Testing samples that generalize successfully or unsuccessfully in all training epochs do not have synchronization with any training sample because there are no malgeneralizing events happening. Thus, we are first concerned with the synchronization in different generalization cases. As shown in Figure 22, the number of synchronized training samples varies with the number of malgeneralizing events for the testing samples. Specifically, the number of synchronized training samples is higher in the case of a high number of malgeneralizing events. For example, for sample 90 (with 2 malgeneralizing events ), the number of synchronized training samples is 13266; while sample 5695 (with 15 malgeneralizing events) has about twice as many synchronized training samples during a single training.
Considering the aforementioned randomness in the training and generalization process, it is also recorded in Figure 22 that the number of synchronized training samples for a particular testing sample decreases significantly, or even drops to 0, when extended to 10 trainings. This indicates that most of the forgetting events of synchronized training samples just happen to occur at the same epoch as the malgeneralizing events of the target testing sample. Although this synchronization occurs more than hundreds of times. This confirms that the neural network does not depend on specific training samples for the extraction and generalization of patterns embedded in the training data. Then does the generalization of specific irregular samples depend on memorizing certain training samples? We extend to 20 trainings for samples 521, 5695, and 889 to observe the distribution of their synchronized training samples, as shown in Figure 23.
To further verify the effect of these synchronized training samples on the generalization of the corresponding testing samples, we train a binary classifier using SVM to determine whether a particular testing sample belongs to a certain class. Its training data input is 4096dimensional features extracted from the training samples using the VGG19 network pretrained by ImageNet. The experimental setup is divided into two types:
random training samples containing 46 synchronized samples ( each of training samples belonging to the same and different categories as the testing sample, same later); random training samples not containing these synchronized samples. The results are shown in Table 1, where the classifier trained with synchronized samples can better generalize its corresponding testing samples for sample 889 with lowregularity. For example, when the number of synchronized samples accounts for of the training samples, the classification accuracy of the classifier trained with synchronized samples for sample 889 is about 14 percentage points higher than that of the classifier trained without synchronized samples. The lead is still about 3 percentage points when the proportion of synchronized samples is reduced to . We conjecture that these synchronized training samples tend to play the role of ”support vectors”. That is, stochastic gradient descent tends to converge implicitly to the solution that maximizes the differentiation of the dataset ^{Soudry et al. (2018)}, and the synchronized training samples are more relevant to the learning of the corresponding testing samples’ classification boundaries than the other samples.Sample ID  N=100, 46 synchronized training samples  N=460, 46 synchronized training samples  

w/ synchronized samples  w/o synchronized samples  w/ synchronized samples  w/o synchronized samples  
521  0.9585  0.9737  0.9950  0.9768 
889  0.3528  0.2147  0.1561  0.1238 
5695  0.7448  0.7496  0.6470  0.5107 
3.3.3 Robustness
The generalization of a neural network varies across different optimizers and architectures, so the statistics of the training depend on the given network architecture trained with the given optimizer. It is thus important to explore the effect of these control variables on these four event statistics, and we further explore the robustness of these statistics.
Different Architectures
: Our statistics need to be collected during training, and if the network has more layers or a complex structure, the time cost of statistics will be more expensive. Therefore, we explore how this distribution representation is affected by different network architectures. We hope to see whether the statistic is robust to different network architectures in order to explore the possibility of reliably representing the sample distribution with simpler architectures with lower time cost. We repeatedly trained each of the four network architectures, ResNet20, ResNet32, ResNet110, and DenseNet, 10 times and took the average of statistics for analysis. As shown in Figure 28, the sample distribution does not change significantly whether the number of layers is reduced from 110 to 32, 20, or to the more densely connected DenseNet. This conclusion is confirmed quantitatively by the correlation calculations of the distributions and statistics. As shown in the table 2
, the Pearson correlation coefficients are all above 0.88, which are very strong correlations. This indicates that our proposed sample distribution representation is robust to the network architecture and can be migrated between different architectures. Since ResNet32 takes only 80 minutes while DenseNet takes up to 1367 minutes with the same computational power (single GEFORCE RTX 2080Ti GPU). This opens up the prospect of computing sample distribution representation statistics in simpler architectures. Thus, the similarity of 2D sample distribution representations across architectures allows them to be computed quickly by approximating proxy networks to reduce time and computational cost, e.g., estimating sample distribution statistics for DenseNet by ResNet32 can save up to 21.5 hours of training time and reduce the number of parameters by 7.03 megabytes.
Pearson correlation coefficient  ResNet20  ResNet20  ResNet20  ResNet32  ResNet32  ResNet110 

vs  vs  vs  vs  vs  vs  
ResNet32  ResNet110  DenseNet  ResNet110  DenseNet  DenseNet  
CBTL  0.9986  0.9935  0.9826  0.9977  0.9745  0.9585 
CBGL  0.9973  0.9901  0.9841  0.9962  0.9733  0.9533 
Forgetting events  0.9996  0.9978  0.9953  0.9992  0.9950  0.9925 
Malgeneralizing events  0.9995  0.9989  0.9894  0.9993  0.9924  0.9914 
Density for training samples  0.9603  0.9480  0.9107  0.9594  0.9057  0.9005 
Density for testing samples  0.9560  0.9327  0.8846  0.9524  0.8955  0.9025 
Different Optimizers
: Most deep learning algorithms involve some forms of optimization, generally with the goal of minimizing the loss function for parameter solving. Thus, optimizers play a crucial role in deep learning algorithms, and in this section we explore the effect of different optimizers on the representation of the sample distribution. In addition to the common minibatch stochastic gradient descent method (SGD), AdaGrad and AdaMax are also chosen as the comparison experimental setup.
As shown in Figure 32, the sample distribution representations under different optimizers differ significantly. The Pearson correlation coefficients of SGD  AdaGrad, SGD  AdaMax, and AdaGrad  AdaMax are 0.8914, 0.6877, 0.6962 for the training samples and 0.8877, 0.6628, 0.7062 for the testing samples, respectively. It thus confirms that different optimizers have a great impact on our proposed bidimensional representation. Essentially, the optimization algorithm determines the direction of optimization at each step of the training, which leads to changes in the acquisition parameters, resulting in differences in the learning process and thus affecting the representations of the sample distribution.
4 Application
4.1 Training Acceleration
The Cifar10 training set is extremely unevenly distributed in our bidimensional representation space. The samples in the lower right corner are relatively simple and have little impact on the performance of the final model, but are densely distributed and dominate in the training set. As a result, there are a large number of redundant pattern approximation samples in the training set. We attempt to accomplish an efficient training process with small sample sets by eliminating the highdensity redundant samples. We compute the density values of each sample in our bidimensional representation space according to the caption of Figure 2 and remove the highdensity training sample by sorting them in descending order of density values to investigate the effect of the proportion of removed samples on the generalization performance. Since the computed density value is related to the radius described in the caption of Figure 2, we firstly explore the effect of on training performance as shown in Figure 35 (a). From Figure 35 (a), we can see that has a large effect on the training performance and the overall performance is relatively best for . Therefore, we take the calculated value of density with for the subsequent experiments. Then, we follow the strategy of removing samples according to the metric of CBTL proposed in Jiang’s work^{Jiang et al. (2020)} and forgetting events proposed in Toneva’s work^{Toneva et al. (2018)} as a comparison with our method, as shown in Figure 35 (b).
It can be observed in Figure 35 (b) that the performance fast drops when the samples are removed randomly. However, when samples are removed according to some strategies, there is no significant effect on the generalization performance, and the test accuracy remains above 0.91 even after removing of the training samples. In addition, the strategy proposed in this paper slightly outperforms the strategies of forgetting events and CBTL. This validates to some extent the insight of dataset compressibility and contributes to accelerating training with small sample set and low time and computational cost.
4.2 Testing Acceleration
Unlike the training, which is a pattern extraction process that can reduce sample complexity by removing samples with close pattern representations, the goal of the testing process is often to maximize discrimination of testees’ generalization level using minimal sample complexity. This requires reducing the number of samples involved in evaluation while ensuring adequate discrimination for the generalization performance of the testees. It means that more attention needs to be paid to difficult samples in practical experience, which is often achieved by means of difficultybased sampling. Our proposed bidimensional sample representation can be used as a measure of difficulty, obtained by dividing the defined bidimensional space based on the sample distribution in a uniform angle.
As shown by the red dashed line in the Figure 2 (b), we perform uniform difficulty division of the samples by rays emanating from a uniform angle at the center, which is the midpoint of the range of 10,000 testing sample representation’s horizontal coordinates values. The smaller the angle is, the finer the difficulty division is. (Note that the figure is only for illustration, please temporarily ignore the problem of inconsistent horizontal and vertical coordinates ratio of the figure.)
We first use the original 10,000 testing samples to evaluate 17 algorithms including deep learning methods and traditional machine learning methods. Then, based on the balanced difficulty sampling method described above, the representation was divided into 4 bins and 10 bins at 45 degrees and 18 degrees, respectively. We randomly sampled the samples in each bin to obtain a smaller dataset containing 400 testing samples, again perform evaluation for the 17 algorithms, and the results are presented in table 3.
Algorithm  Original  400 samples  400 samples 

10000 samples  (10 bins)  (4 bins)  
LeNet  0.7505  0.3875  0.5325 
NetinNet  0.8859  0.5125  0.6400 
VGG19  0.9338  0.6725  0.7475 
ResNet20  0.9138  0.5850  0.7050 
ResNet32  0.9219  0.5925  0.6850 
ResNet110  0.9335  0.6175  0.7275 
WideRes16  0.9504  0.7450  0.7775 
WideRes28  0.9542  0.8150  0.8425 
EfficientNet(B8)  0.7353  0.4025  0.4925 
DenseNet100  0.9465  0.7825  0.8225 
DenseNet160  0.9528  0.7800  0.7925 
SVM  0.5441  0.3075  0.4225 
Decision Tree  0.2691  0.1575  0.2475 
Logistic Regression  0.3983  0.2800  0.3625 
KNN  0.3398  0.2800  0.3075 
MLP  0.4440  0.3050  0.3925 
Random Forest  0.4749  0.3400  0.3875 
It is clear that the algorithm performance drop significantly when tested with samples from balanced difficulty sampling. Especially, it applies to traditional machine learning methods as well. The longtailed nature of the distribution of the original dataset results in a very low proportion of difficult samples, which are overwhelmed by a large number of simple samples. We increased the proportion of difficult samples by balanced difficulty sampling. The algorithm has worse performance when tested on the samples selected based on 10 bins than 4 bins. This is because the uneven distribution of the original dataset causes that more finegrained sampling can increase the proportion of difficult samples. We use the test results with original 10000 testing samples as the benchmark. To measure the discriminative ability of our sampled small dataset for the algorithm performance, we calculated the Spearman correlation coefficients of the evaluation results with sampled small datasets based on 10 bins and 4 bins with the benchmark. They are 0.9871 and 0.9877, respectively, both of which have extremely strong correlations and sufficient discriminative ability for the algorithm.
Further, we explore how many samples in each bin sampled on the basis of sampling from 10 bins can most efficiently maintain the discriminative ability of the algorithm performance. We consider the samples on the horizontal axis, i.e., the samples whose forgetting events is 0.
There are only three samples located on the left halfaxis, and we sample all of them, followed by for the second bin, for the third bin, , for the 11th bin, and the for the last bin. Except for the first bin having only three samples which are all sampled, in each of the remaining bins, we select samples.
Since the least number of samples in the latter 11 bins is 49, we explore the effect of , where , on the test results. As we are concerned about the discriminative ability of the testing set on the algorithm performance, i.e., more concerned about the ranking stability of the algorithm performance evaluation, we calculate the Spearman correlation coefficient of testing results with sampled samples compared to that with original 10000 testing samples and the mean average precision of the algorithm ranking. The results are plotted as a line graph as shown in Figure 36. We can see that sampling 30 samples in each bin, i.e. samples, is the best, thus we construct small testing set, Cifar10333, to perform fast testing of algorithm performance.
Experiments show that uniform regularity sampling for the testing set can significantly reduce the sample complexity while maintaining discriminative ability on the generalization performance of the tested algorithms.
5 Conclusion
In this paper, we propose to measure the samplewise regularity that certain sample shows when being learned or generalized. We show that for i.i.d. training and testing sets, sample distributions are alike under statistics of similar definition, i.e., CBTL/CBGL and forgetting/malgeneralizing events. Inspired by this result, we propose a pair of bidimensional representation for measuring sample regularity in network learning and generalization process, respectively. In the property investigation, we find that the proposed measures seem to be fairly stable with respect to the different characteristics of training and testing stage. Further applications in training and testing acceleration show that samples with higher regularity seem to contribute little in both process, which in return validated the effectiveness of the proposed measures.
Declarations
Funding
This work was supported by the National Natural Science Foundation of China under Grant 61973245.
Conflicts of interest/Competing interests
The authors declare that they have no conflict of interest/competing interests.
Availability of data and material
Not available.
Author’s contributions
Chi Zhang conceived the conception and the method of the study; Chi Zhang and Yu Liu jointly designed the experiments and analyzed experimental results (each contributing 42.5% of the whole research); Yu Liu performed the experiments; Le Wang, Jingxue Hu and Yuehu Liu helped perform the analysis with constructive discussions (each contributing 5% of the whole research); Chi Zhang and Yu Liu prepared the manuscript, and all authors provided feedback during the manuscript revisions.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent to publication
Not applicable.
References

Prime sample attention in object detection.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 11583–11591. Cited by: §1. 
Does learning require memorization? a short tale about a long tail.
In
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
, pp. 954–959. Cited by: §1.  Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §2.1.
 Exploring the memorizationgeneralization continuum in deep learning. arXiv preprint arXiv:2002.03206. Cited by: §1, §2.1, §3.2, §4.1.
 Not all samples are created equal: deep learning with importance sampling. In International conference on machine learning, pp. 2525–2534. Cited by: §1.
 Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.1.
 Understanding blackbox predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894. Cited by: §1.
 Learning multiple layers of features from tiny images. Cited by: §3.3.2.
 Gradient harmonized singlestage detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8577–8584. Cited by: §1.
 Statistical learning methods. Tsinghua University Press. Cited by: §2.
 Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §2.1.
 Libra rcnn: towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §1.
 Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §2.1.
 The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences 117 (48), pp. 30033–30038. Cited by: §1.
 Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769. Cited by: §1.
 The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §3.3.2.
 An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159. Cited by: §1, §2.1, §2.1, §3.1, §3.2, §4.1.
 Deep learning generalizes because the parameterfunction map is biased towards simple functions. arXiv preprint arXiv:1805.08522. Cited by: §1.

Data dropout: optimizing training data for convolutional neural networks
. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 39–46. Cited by: §1.  Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §1.