A synthetic dataset for deep learning

06/01/2019 ∙ by Xinjie Lan, et al. ∙ 0

In this paper, we propose a novel method for generating a synthetic dataset obeying Gaussian distribution. Compared to the commonly used benchmark datasets with unknown distribution, the synthetic dataset has an explicit distribution, i.e., Gaussian distribution. Meanwhile, it has the same characteristics as the benchmark dataset MNIST. As a result, we can easily apply Deep Neural Networks (DNNs) on the synthetic dataset. This synthetic dataset provides a novel experimental tool to verify the proposed theories of deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is a subset of machine learning algorithms that construct the Deep Neural Networks (DNNs) to solve complex problems

[1]. Although it has achieved great success in various fields, such as speech recognition [2] and image classification [3], the internal logic of deep learning is still not convincingly explained and DNNs have been regarded as "black boxes" [4].

Based on an underlying premise that DNNs establish a complex probabilistic model [5, 6, 7, 8], numerous theories, such as the representation learning [9, 10, 11], the Information Bottleneck (IB) theory [12, 13, 14, 15], have been proposed to explore the working mechanism of deep learning. Though the proposed theories reveal some important properties of deep learning, such as hierarchy [9, 10] and sufficiency [12, 15], a fundamental problem is that the proposed theories cannot be directly validated by empirical experiments due to the fact that the distributions of the benchmark datasets, e.g., MNIST, are unknown. For example, hierarchy is an important property of DNNs, but we still cannot explicitly formulate the hierarchy property and directly validate it by empirical experiments.

Figure 1: The first row shows three synthetic images of handwritten digits, the second row shows their respective histograms, and the red curve indicates the Gaussian distribution .

To solve this problem, we propose a novel algorithm for generating a synthetic dataset obeying a Gaussian distribution based on the NIST 111https://www.nist.gov/srd/nist-special-database-19 dataset of handwritten digits by class. In particular, the synthetic dataset has the same characteristics as the benchmark dataset MNIST [16]. Specifically, the synthetic dataset consists of 70,000 grayscale images in 10 classes (digits from 0 to 9). Each class has 6,000 training images and 1,000 testing images. Fig. 1 shows three synthetic images. Therefore, we can easily apply various DNNs on the synthetic dataset like MNIST. Since all the grayscale images are sampled from a known distribution, the synthetic dataset obeys the Gaussian distribution.

This paper is organized as follows. Section 2 describes the specific method for generating the synthetic dataset obeying a known Gaussian distribution and Section 3 shows that the synthetic dataset can be easily applied to most commonly used DNNs. Section 4 demonstrates that given the synthetic dataset, we can verify some important properties of deep learning, e.g., hierarchy, based on the recent proposed probabilistic explanation of hidden layers of DNNs [17, 18].

2 The method for generating the synthetic dataset

An underlying assumption of deep learning is that the given training dataset

is composed of i.i.d. samples from a joint distribution

, where describes the prior knowledge of , describes the connection between and , and indicate the parameters of . Since we can easily formulate given , is the key of explicitly formulating .

Unlike previous works using a complex probabilistic model to formulate [19, 20] for a given dataset, we first generate a random dataset obeying a Gaussian distribution and then use the generated random dataset to construct a synthetic image based on the mask derived from a benchmark dataset. Since each data in the random dataset obeys , we can conclude that the synthetic image also obeys based on the spatial stationary property, i.e., .

More specifically, the method includes seven steps: (i) generating a random vector

by sampling the Gaussian distribution for constructing a synthetic image with dimension ; (ii) converting an image of the NIST dataset into a binary image; (iii) extracting the central part of the binary image and the dimension of the derived image is ; (iv) downsampling the derived image in the previous step to obtain a binary image with dimension ; (v) generating the mask of the binary digits image based on the Canny edge detection algorithm [21], and the mask indicates four parts of the binary image: outside, outside boundary, inside boundary and inside; (vi) deriving an ordered vector by sorting in the descending order and decomposing into four parts, i.e., , where corresponds to the outside, the inside boundary, the outside boundary, and the inside. (vii) generating a synthetic image by randomly placing each pixel in the four sub-vectors into a random position within the corresponding masks.

Figure 2: The performance of CNN1 on the synthetic dataset

The method for generating synthetic image is summarized in Algorithm 1, and Fig. 3 visualizes the relationship between and their corresponding masks.

1:NIST dataset of handwritten digits by class
2:repeat
3:     sampling to derive a random vector
4:

     binarizing an image of NIST to obtain

5:     extracting the central part of to obtain with dimension
6:     downsampling to obtain with dimension
7:     extracting the edge of to obtain the mask image
8:     decomposing into four parts , , , and .
9:     sorting in the descending order to derive
10:     decomposing into four parts, i.e.,
11:     Placing each pixel of into a random position within the corresponding masks to generate a synthetic image.
12:until (20,000 synthetic images are generated)
13:The synthetic dataset
Algorithm 1 The algorithm for generating the synthetic dataset
Figure 3: The first row shows an original image, its edge, and the corresponding synthetic image based on the original one. The second row uses white pixels to show the four parts of the mask image . The third row shows the synthetic image corresponding to each part of .

3 Experiments

In this section, we demonstrate that the synthetic dataset can be easily used to DNNs. First, we design a simple but comprehensive Convolutional Neural Network (abbr. CNN1) for classifying the synthetic dataset. CNN1 has five hidden layers: two convolutional layers, two ReLU operator, and two max pooling layers. Table

1 summarizes the architecture of CNN1.

We take 30 training epochs to train CNN1 for classifying the synthetic dataset, and the learning rate is 0.008. Fig.

2 shows the performance of CNN1 on the synthetic dataset. We can see that CNN1 achieves zero training error after 20 training epochs, the testing error is also very small. Overall, we can conclude that the synthetic dataset can be applied to DNNs


R.V. Layer Description CNN1
Input
Conv ()
Maxpool + ReLU
Conv ()
Maxpool + ReLU
Fully connected
Output(softmax)
Table 1: The architectures of CNN1 for experiments

4 Conclusion

In this work, we propose a novel method for generating a synthetic dataset. In contrast to the commonly used benchmark datasets with unknown distribution, the synthetic dataset has a explicit distribution, i.e., Gaussian distribution. In particular, it has the same characteristics of the benchmark dataset MNIST. As a result, we can easily apply Deep Neural Networks (DNNs) on the synthetic dataset.

References