## I Introduction

Today’s world of high quality document digitization has provided a stirring alternative to preserve precious ancient manuscripts. It has provided easy, hassle-free access of these ancient manuscripts for historians and researchers. Retrieving information from these knowledge resources is useful for interpreting and understanding history in various domains and for knowing our cultural as well as societal heritage. However, digitization alone cannot be very helpful until these collections of manuscripts can be indexed and made searchable. The existing characters in the document should be recognized and there are several machine learning based approaches in the literature for this purpose. But one of the primary necessity of these kinds of training based systems is the availability of labeled training dataset. Labeling datasets is not only a costly process but also highly rigorous and error prone. That’s why, in this research work, we propose to automatically generate character images with the help of labeled dataset. Later these generated (that’s why would be labeled automatically) dataset could be used for the training purpose which would inherently enhance the performance of the classification system.

There have been several work in the domain of Deep Generative Model (DGM) to generate various kinds of images e.g. object images, scene images etc. In this work, we perform an evaluation of some of the popular DGMs and tested their performance on 2 datasets. The description of these datasets are given below.

Principally, there are 2 main category of DGM exists in the literature. The first one is auto-encoder (AE) and the second one is Generative Adversarial Network (GAN) As part of this paper, we will focus on both types of the generative models.

## Ii Auto-Encoder

An auto-encoder [Ng]

is an unsupervised machine learning technique, is a artificial neural network employed to recreate the given input. It takes a set of unlabeled inputs, encodes them and then tries to extract the most valuable information from them. Traditionally, the reduction in dimensionality is dependent on linear methods such as Principal Component Analysis (PCA), which finds the directions of maximum variance in large data. However, the linearity of PCA imposes significant limitations on the types of extracted dimensional characteristics. AE overcoming these limitations by exploiting the inherent non-linearity of neural networks. An auto-encoder consists of three components: the coding model, code and decoding model. The purpose of the encoder function is to create a (multiple) hidden layer(s) that contains one code to describe the input. The decoder function then reconstructs the input using this code only. It has an important role during training, to force the auto-encoder to select the most important features in the compressed representation.

### Ii-a Vanilla auto-encoder and Multilayer auto-encoder

To build an auto-encoder [GaleoneAutoencoder] [Ng]

, we need three things: a coding method, a decoding technique and a loss function to compare the output with the target. We will explore them in the next section.

#### Ii-A1 Architecture

The encoder is an *f* function that maps an entry *x*

to the hidden representation

*h*. It has the form: , where

is a nonlinear activation function, typically a logistic sigmoid

. The encoder is parameterized by a weight matrix*W*

and a bias vector

. The decoder is a function*g*maps the hidden representation

*h*return to a reconstruction : where is an activation function. The decoder parameters are the weight matrix and a bias vector

. In its simplest form, auto-encoder is a two-layer network, i. e. a fully connected feed-forward neural network with hidden layer(s). The architecture of vanilla auto-encoder is shown in Figure

1, whose input and output layer have the same number of neurons, the hidden layer is smaller than the size of the input and output layer. The hidden layer is a compressed representation, and we learn two sets of weights and bias that encode our input data in the compressed representation and decode our compressed representation in the input space.

A natural thought that may arise is to extend auto-encoder beyond the single layer; which can be easily achieved just by keeping the dimensionality of entry and exit same, where as increasing the number of hidden layers.

##### II-A1a Loss function

An auto-encoder tries to learn an approximation of the identity function, to produce the reconstruction which is similar to the input . The loss function is calculated either by using the mean squared error *MSE* or by binary cross entropy.
If the input values are in the range[0-1], then we typically use the cross entropy loss function.

(1) |

Otherwise, we have a simple mean squared loss:

(2) |

Our goal is to minimize this loss function. This error represents how close our reconstruction is to the true input data. We don’t expect a perfect reconstruction because the number of hidden neurons is less than the number of input neurons, but we want the parameters to give us the best possible reconstruction.

### Ii-B Sparse auto-encoder

In the vanilla auto-encoder, we assume that the number of hidden units is small. But even when the number of hidden units is greater than the number of input units, we can still discover some interesting representation of the input data. To achieve this for a given input, most of the hidden neurons should produce only weak activation [Ng]. In other words, its average activation value should be a small (sigmoid activation function gives activation value close to 0 and activation function gives activation value close to -1). In this sort of specific structure, the auto-encoder will discover an interesting structure in the data. Which inherently means that, for a given instance, only an informative set of units is activated, so that more discriminating representation could be captured.

The average activation of each hidden unit: , where indicates activation of this hidden unit. The constraint is imposed by , where is *"sparsity parameter"*.
To achieve this, we have to add an additional penalty term to our optimization objective which penalizes significantly deviating from : , where is the number of units in the hidden layer, is the index of the hidden unit in the network, KL divergence is a standard function to measure the difference between two different distributions:
If , this penalty function has the property that . Otherwise, it increases monotonously when diverge from .

Our overall cost function is now becomes:

(3) |

where is defined in the vanilla auto-encoder; control the weight of the term sparsity parameter.

### Ii-C Convolutional auto-encoder

So far, we have seen that the auto-encoder inputs are images. It is therefore logical to ask whether a convolutional architecture can work better than the classical auto-encoder architectures previously discussed. Instead of using fully connected layers, we use convolution and grouping layers to reduce our input to a coded representation [convolutionalAE-Galeone]

. We recall that the auto-encoder consists of two parts: coding and decoding. For coding, we use a traditional convolutional neural network whose main mechanism for reducing information in this convolutional network is the

*max-pooling*layer. To resize our encoded representation to the same form as the encoding, a simpler operation is used to increase the spatial size of the representation. Unlike the

*max-pooling*technique,

*un-pooling*

technique is used. This layer corresponds to the inverse of the max-pooling operation under certain simplifying conditions. The

*un-pooling*layer is performed by simply replacing each entry of a feature map with a block with the input value in the top left corner and zeros elsewhere.

### Ii-D De-Noising auto-encoder

A de-noising auto-encoder is an extension of the convolutional auto-encoder. Suppose we have an input image with noise (these noisy images are actually pretty common in real-world scenarios). For a de-noising auto-encoder [denoisingAE_2], the model we use is identical to the convolutional auto-encoder. However, our training and test data are different. For our training data, we add random Gaussian noise, and our test data is the original and clean images. Our input data is the: , where is a percentage of the amount of noise applied to the input images and is the distribution for generating Gaussian noise. This causes the de-noising auto-encoder to produce clean images from noisy images given as input to the system.

###
Ii-E Contractive auto-encoder - *Cae*

The aim of a contractive auto-encoder is to make the learned representation be robust towards small changes around its training examples. The contractive auto-encoder [Rifai2011], is a special form of regulated auto-encoder that is trained to minimize the following regularized reconstruction error:

(4) |

where . represents the complete training dataset. is the Frobenius norm and is the positive parameter that control the regularization. Note that the success of the minimization of CAE criterion strongly depends on the parameter and and in particular the tied weight constraint used, with and . Where represents hidden/encoded representation obtained from given input and represents the generated output obtained from . The above regularization term forces (as well as

, because of the related weights) to be contractive, that is to have singular values lower than

. The higher values of give more contraction (smaller singular values) but in the local directions where there is little or no variations of data, the degree of data contraction is less.###
Ii-F Variational auto-encoder - *Vae*

In the language of neural networks, a variational auto-encoder [Doersch2016] consists of: A probabilistic encoder and a generative decoder and a loss function. Where represents hidden/encoded representation obtained from given input and represents the generated output obtained from . The weights and biases for encoder is mentioned by and for decoder.
In the decoding process, information is lost because it passes from a smaller to a larger dimension. The amount of information lost must therefore be measured using the reconstruction *log-likelihood*: . This measure signifies how effectively the decoder has learned to reconstruct an input image given its latent representation .
The *loss function* of the variational auto-encoder is the negative log-likelihood with a regularizer. The following defined loss function is decomposed into only terms which depends on single data point . The total loss then becomes for total data points. The loss function for data point is

(5) |

The term in Equation 5 is the reconstruction loss or expected negative log-likelihood of the data point and the term is the *Kullback-Leibler* divergence between the encoder’s distribution and , which measures information loss (in units of nats) when using to represent .

###
Ii-G Conditional Variational auto-encoder - *Cvae*

The conditional variational auto-encoder is an extension of the variational auto-encoder. The VAE aims to formulate the problem of data generation as a Bayesian model. This model is learned by optimizing its lower limit. However, we have no control over the VAE data generation process. This could be problematic if we want to generate specific data. That is why the CVAE has been developed. While VAE models mainly latent variables and data directly, CVAE models latent variables and data [conditionalVariationalAE]

, both conditioned by few random variables. For CVAE, the model is now conditioned to two variables

and*c*: The encoder ; the decoder . So, the goal we take is to:

(6) |

We have just conditioned all distributions with a variable *c*. Now, the latent variable is distributed under .

##
Iii Variations of *GANs*

### Iii-a Generative Adversarial Nets

#### Iii-A1 Architecture

The GAN [Goodfellow]estimates a generative model via a contradictory process by simultaneously forming two models: The generator - *G*, which creates samples intended to come from the same distribution as the learning data; The discriminator - *D*

learns using traditional techniques of supervised learning, dividing the entries into two classes (real or false). The architecture of GANs is shown in Figure

2 below.#### Iii-A2 The learning process

For the generator, we start by sampling the vector of the distribution a priori. The generator function () is applied to the input vector . The generator function is a differential function with parameters that can be learned with gradient descent. The discriminator (type of differential function) is the opposite of generator which is fed by the generated images and by certain training images at the same time. It is learned by descending gradient similar to the generative function. The goal of generator is to generate image looks like to the real ones whereas the discriminator’s goal is to discriminate real ones from generated ones.

#### Iii-A3 The loss function

take as input and use as parameters, whereas *G* take as input and use as parameters. Both players have loss functions that are defined according to these parameters.
The discriminator wants to minimize and must do so by controlling only .
The generator wants to minimize and must do so while controlling only .
The loss function for the discriminator which is presented below:

(7) |

This is just the standard cost of cross entropy that is minimized when forming a standard binary classifier with a sigmoid output.

is directly related to , we can summarize the whole game with a value function specifying the gain of the discriminant:(8) |

Zero-sum games are also called minimax games because their solution involves minimizing in an external loop and maximizing in an internal loop.

(9) |

###
Iii-B *Conditional Generative Adversarial Networks - CGANs*

M. Mirza and S. Osindero [Mirza2014a] extend the GAN model by conditioning both networks and by an additional parameter, which could be any type of auxiliary information, such as class labels or data from other modalities. In this context, the value function is changed as follows:

(10) |

The CGANs are interesting for two reasons: First, the CGANs learn how to use the additional information and therefore, they are able to generate better samples. Secondly, with CGANS, we have a way of controlling image representations. For example, in the case of face generation, with GANs, all information is encoded by . With CGANs, when we add conditional information to it, these two and now encode different information could describe attributes such as hair color, skin color or gender.

###
Iii-C *Deep Convolutional Generative Adversarial Networks - DCGAN*

A.Radford [Radford2015b] presents a topologically constrained variant of the conditional . To build a , two deep convolutional neural networks are used. The first network consists of deep architecture which is used to look at a picture and processes it through several layers to recognize increasingly complex features in the image. Whereas, the second neural network is learned to create false images. propose modifications to by
replacing all layers of *pooling*

with stride convolutions (for discriminant) and fractional stride convolutions (for generators). Batch normalization is used in the generator (all layers except the output layer) and in the discriminator (all layers except the input layer). Leaky

is used in all layers of the discriminator and activation is used in all layers of the generator (except the output layer that uses activation function).###
Iii-D *Wasserstein Generative Adversarial Networks - WGAN*

In fact, in the training procedure based on GANs, two models (each model updates its cost independently) are trained simultaneously to find a balance between two-players non-cooperative game. Therefore, it is unknown when to stop training (no convergence).
The classical GAN’s minimize the divergence of *Jensen-Shanon* which is equal to zero if the actual and false distribution does not overlap (which is the usual case). Thus, instead of minimizing *Jensen-Shanon* divergence, we can use *Wasserstein’s* distance ().
WGAN [Arjovsky2017] adds some tricks to allow the discriminant to approach the *Wasserstein* distance between the real distributions and models. The authors propose to approach with a set of functions *K-Lipschitz* by solving the following problems:

(11) |

The distance from Wasserstein is also called Earth Mover’s distance - EMD, which is defined by following Equation 12.

(12) |

The authors argue that compared to vanilla GAN, WGAN has the following advantages: Significant Loss Measure: The loss of D correlates well with the quality of the generated samples, allowing less monitoring of the training process; Improved stability: When D is trained to the optimum, it provides a useful loss for G training. This means that the training of D and G must not be balanced in number of samples (it must be balanced in the vanilla GAN approach).

###
Iii-E Adversarial Auto- encoder - *Aae*

One of the main disadvantages of variation auto-encoders is that the integral of the *KL-divergence* term has no closed form analytical solution except for a handful of distributions. Moreover, it is not easy to use discrete distributions for the latent code. Indeed, back-propagation by discrete variables is generally not possible, which makes the model difficult to train effectively. AAE [Makhzani2015a] is an approach to do so in the context of the VAE has been introduced.
AAE avoids using *KL-divergence* altogether by using contradictory learning. In this architecture, a new network is formed to discriminatingly predict whether a sample is from the hidden code of the auto-encoder or priority distribution determined by the user.

Figure 3 shows schematically how AAE works when we use a Gaussian a priori for the latent code (although the approach is generic and can use any distribution). The top row is equivalent to a VAE. First, a sample is plotted against the generator network , this sample is then sent to the decoder that generates from . The reconstruction loss is computed between and and the gradient is retro-propagated by and accordingly.

## Iv Dataset and Experimental Protocol

### Iv-a Dataset

In this work, we have used following 2 datasets.

#### Iv-A1 MNIST Dataset

It is handwritten dataset consists of handwritten digit (0-9) images for training and images for testing. The size of these handwritten digit images is normalized and the digits are centered in a fixed-size image to fit into a pixel space in binary format.

#### Iv-A2 BALI Dataset

This is BALInese palm manuscript images dataset comes from BALI, Indonesia. The sample images are randomly selected from 23 different collection (contents) with total of 393 pages. The isolated character dataset is formed by segmenting all the patch images and annotating them at the character level. It consists of total character classes, with total number of character samples. Among classes classes are chosen because it contains more than to samples. From each class images are chosen for training purpose and then remaining images are chosen for the testing purpose.

### Iv-B Experimental Protocol

Each of the aforementioned models are tested for several times with different parameters such as the learning rate, batch size and number-of-epochs and the obtained best results of each model are presented here. For all the models, the batch-size is taken as

and the learning rate is taken as but the number-of-epoch varies for every model. Different models are tested on 2 datasets by using then following parameters. For AE, CAE, convolutional AE, de-noising AE, AAE, GAN, cGAN, dcGAN; number-of-epoch is taken as and for SPAE, VAE, CVAE, wGAN; the number-of-epoch is taken as . For models such as AE, CAE, SPAE, convolution-AE, de-noising AE, we used testing images as the input to test model performance. For the remaining models, we do not use the testing database because the models takes distribution a priori as the input. However, for models such as cGAN and CVAE, image labels are also provided as the input to generate the desired output class. The generated images by different models are shown in the following Table II.To evaluate the performance of the system, the generated images are recognized by using character recognition algorithm. The recognition of generated characters are only applied for those models which can generate images of a predefined class; i.e. AE, CAE, SPAE, convolutional-AE, de-noising-AE, CVAE, cGAN. However by simple visual inspection it can be visible that the quality of images generated by the convolution auto-encoder and de-noising auto-encoder models are not good. We therefore apply the character recognition technique (defined below) only for the following models : AE, CAE, SPAE,CVAE, cGAN.

#### Iv-B1 Brief Description of Recognition Technique

We use *Convolutional Neural Network*

based recognition system to recognize the generated images For MNIST database, the following CNN architecture is used. This network consists of at first a convolutional layer which take

convolution matrices of size followed by*Max-Pooling*of size , and strides = . The second convolutional layer takes convolution matrices of size followed by a

*Max-Pooling*layer of with strides = . Which results in dimension reduction from to which is then fed into fully connected neural network is applied to classify the images into classes.

In case of BALI database, deeper CNN based architecture is used due to the bad image quality of BALI database.
The architecture of this CNN is as follows: The first convolutional layer take convolution matrices of size , followed by *Max-Pooling* layer of size , strides = .
The second convolutional layer takes convolution matrices of size , followed by a *MaxPooling* layer of size , strides = .
The third convolutional layer takes convolution matrices of size , followed by a *MaxPooling* layer of size , strides = .
The feature vector is reduced from dimensions to dimensions which is then fed into fully connected neural network to classify the images into classes.

Model name | MNIST | BALI |
---|---|---|

AE | 89.98/97.25 | 40.01/60.01 |

CAE | 96.79/97.25 | 47.5/60.01 |

SPAE | 95.92/97.25 | 54.99/60.01 |

CVAE | 92.40/97.25 | 35.01/60.01 |

cGAN | 87.52/97.25 | 29.97/60.01 |

(a) | (b) | (c) | (d) | (e) |

(f) | (g) | (h) |

(i) | (j) | (k) | (l) | (m) |

## V Results and Discussion

The recognition results are shown in I. It can be visible that CAE, SPAE and CVAE has performed better in the case of MNIST dataset whereas SPAE has performed well for BALI dataset. In case of BALI dataset, most of the auto-encoder based models function better than GANs based models. Because, the main idea of auto-encoder based models is to reconstruct the original images from the hidden representations. While, the GANs based models try to generate the images from the a priori distribution. This is the reason why GANs need several learning images compared to auto-encoder. Moreover, the quality of the images in BALI database are degraded and noisy. So, it is difficult for generative models to reconstruct such images. Among all the auto-encoders [Rifai2011], the SPAE model gives the best results because this model adds an additional penalty term to the optimization function. This term allows the SPAE to learn representation robust towards small changes around its training examples. The convolutional neural network needs many samples for each class (does not works well with BALI dataset), as the one exists for MNIST dataset (). Among the GANs based models, the wGAN works better than others because wGAN resolves the convergence problem (exists in classical GANs model) during learning process by using the Wasserstein distance.

## Vi Conclusion and Future Work

In this article, we tested the various models of two kinds of generative model (GAN and Auto-encoder) with two data sets: MNIST and BALI. From the experimental evaluation, it is visible that certain models work well with BALI and MNIST data sets and certain are not. In the future, we plan to work with GANs based model to improve the performance of BALI dataset by proposing certain techniques to define better the apiori distribution as input for generating any particular class of images (i.e. improving cGAN model).

Comments

There are no comments yet.