Log In Sign Up

Offline Text-Independent Writer Identification based on word level data

by   Vineet Kumar, et al.
iit guwahati

This paper proposes a novel scheme to identify the authorship of a document based on handwritten input word images of an individual. Our approach is text-independent and does not place any restrictions on the size of the input word images under consideration. To begin with, we employ the SIFT algorithm to extract multiple key points at various levels of abstraction (comprising allograph, character, or combination of characters). These key points are then passed through a trained CNN network to generate feature maps corresponding to a convolution layer. However, owing to the scale corresponding to the SIFT key points, the size of a generated feature map may differ. As an alleviation to this issue, the histogram of gradients is applied on the feature map to produce a fixed representation. Typically, in a CNN, the number of filters of each convolution block increase depending on the depth of the network. Thus, extracting histogram features for each of the convolution feature map increase the dimension as well as the computational load. To address this aspect, we use an entropy-based method to learn the weights of the feature maps of a particular CNN layer during the training phase of our algorithm. The efficacy of our proposed system has been demonstrated on two publicly available databases namely CVL and IAM. We empirically show that the results obtained are promising when compared with previous works.


page 1

page 5


FragNet: Writer Identification using Deep Fragment Networks

Writer identification based on a small amount of text is a challenging p...

Text-independent writer identification using convolutional neural network

The text-independent approach to writer identification does not require ...

Shift Convolution Network for Stereo Matching

In this paper, we present Shift Convolution Network (ShiftConvNet) to pr...

Deconvolutional Feature Stacking for Weakly-Supervised Semantic Segmentation

A weakly-supervised semantic segmentation framework with a tied deconvol...

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Dense and Deep Sarcasm Detection

Recent work in automated sarcasm detection has placed a heavy focus on c...

I Introduction

Biometrics refers to the automatic identification or verification of a person utilizing his biological [Xu:2018] or behavioral [Chi:2018] features. In the category of behavioral biometric writer identification and verification have gained popularity as a research topic in recent times owing to it’s application in the field of forensic analysis [Fernandez:2010], historical document analysis [Bulacu:2007], [He:2014] and security [Faundez:2020].

Writer identification relies on finding the authorship of a given unknown document from a set of reference documents stored in the database based on a set of designed features. Based on the relationship between the query and reference document, the writer identification technique can be divided into text-dependent or text-independent approaches. In text-dependent approach the writer need’s to reproduce a sample of given text based on which the identity of a writer is verified. Signature verification falls in the category of test-dependent system [Sharma:2017]. Text-independent approach, on the other hand, identifies the writer regardless of the written text material, as such it requires the presence of a sufficient number of samples to extract robust features. Thus, the availability of a certain amount of minimum text is crucial for the effectiveness of text-independent system. On the basis of handwriting acquisition modality, text-depend and test-independent approaches can be further divided into an offline and online method. In online method, Spatio-temporal information (in terms of and coordinates) along with other features (such as pressure, azimuthal, and inclination) is recorded by the data acquisition system which is subsequently used to characterize a writer. In offline method, handwritten data is presented the form of scanned image, and writer identification technique extracts various allographic features of the image to identify the writer. In this paper, we will mainly focus on offline text-independent writer identification. On the basis of available literature [Xiong:2017]

offline text-independent writer identification can be divided into the following categories: Texture-based approach, Shape-based approach, and Deep learning-based approach.

Texture based approach treats handwriting as a series of texture, which depending upon the features used to extract these texture properties can be subdivided into: Frequency domain based features, and Spatial based features. Frequency domain based technique treats the whole handwriting image as a texture and uses the frequency characteristics to extract features from a handwritten image. Said

et al. [SAID:2000] proposed a multi-channel 2-D Gabor filter-based approach for texture analysis by extracting features from handwritten samples at different frequencies and orientations. Bertolini et al. [Bertolini:2013] used Local phase quantization (LQP) [Heikkila:2009]features extracted from 2-D short term Fourier transform (STFT) for writer identification. Spatial feature-based methods treat handwriting as a combination of edges and contours, thus using their distribution to describe the handwriting characteristics. Djeddi et al. [Djeddi:2013] used run-length features based on the gray level run matrix (GLRM) to describe the handwriting characteristics. Balacu et al. [Bulacu:2003] proposed edge base features namely: edge-hinge, and edge-direction features to describe the individuality of handwriting. Hannadet al. [Hannad:2016] divided the handwriting image into small fragments upon which a set of features namely Local Binary Pattern (LBP), Local Ternary Pattern (LTP), and Local Phase Quantization (LPQ) is extracted. These features represented in the form of a histogram are used for writer identification. Wu et al. [Wu:2014] used the SIFT features and its scale and orientation information extracted from the handwriting samples to characterize a writer. Faraz et al. [Khan:2018]

used a combination of SIFT and RootSIFT to construct a set of Gaussian mixture models namely similarity GMM (SGMM) and Dissimilarity GMM (DGMM) to capture the intra-class similarity and inter-class dissimilarity between same and different writers of the enrolled database. In addition to this other textual features such as oriented basic image features

[Newell:2014] have also been employed for the purpose of writer identification.

Compared to the Texture-based approach shape-based approach divides the handwriting into a group of segmented shapes and uses the distribution of the letters/characters in the closed region to characterize the handwriting. Schomaker et al. [Schomaker:2004] constructed a code-book based on connected-component contour () to describe the shape of commonly occurring allograph for a group of enrolled writers based on which a histogram is constructed for each of the enrolled writers. Siddiqi et al. [Siddiqi:2010] used a combination of shape and texture features in the form of code-book and curvature to represent the writer’s features. He et al. [Schomaker:2015] proposed a junction detection based approach for writer identification. In their approach, they used the information of various types of junction created by a writer to differentiate one set of writers from another. Bennour et al. [Akram:2019] used Harris corner detection [Harris:1988] to extract key points in the form of junctions and corners from writer image. These keypoints are used to construct a codebook based on which a writer identification is performed.

Unlike Texture and shape-based approach discussed so far which uses hand-crafted feature to learn application specific features, Deep learning-based methodology learn features directly from the training data. As such these methods show better performance by learning data-adaptive features. Christlein et al. [Christlein:2015]

used convolution neural network (CNN) to learn the hidden layer (the layer before the classification layer) features and encoded them using Gaussian mixture model (GMM) to classify writers. Nguyen

et al. [Nguyen:2018] built an end-to-end CNN network to extract handwriting features at the character and sub-character level. These features were then aggregated to form a global representation based on which classification is performed. Rehman et al. [Rehman:2019] instead of using the fully connected layer of a convolution network for representing writer features, used the individual convolution layer output corresponding to each of the convolution layer of the trained network for feature representation, following which the convolution layer corresponding to which maximum classification accuracy is achieved is used to represent writer characteristics. He [He:2020] proposed a word-based writer identification system. They constructed a CNN-based network (called FragNet) to learn writer-specific features by extracting small fragments from input word images.

Ii Research Framework

Since, the introduction of affordable electronics gadgets (such as mobiles, tablets, etc.), people have switched from writing on paper to typing on a keyboard. As such handwriting is becoming a rare activity. In such a scenario, there is a need for a real-world writer identification system that can recognize writers based on a limited amount of handwritten text samples. Keeping this in mind, we propose a HOG based writer identification system that uses CNN network to identify a writer based on word image. Writer identification based on word image is a challenging task as writer-related information is limited in word image, as such to make word image based writer identification feasible we need to extract multiple informative regions from an input image. Our work is inspired by Discriminatively Trained part-Based Models [Felzenszwalb:2009] which treats an object as a collection of multiple segments that are trained separately and combined to recognize an object. In a similar way, we treat a word image as a combination of multiple small segments of an image containing information in the form of patterns which are trained separately and combined to get the identity of a writer.

To extract multiple informative segments from a word image we use SIFT [Lowe:2004] algorithm, which uses multi-scale analysis to extract keypoints at various levels of abstraction (comprising allograph, character, or combination of characters)). SIFT keypoint detector provides us with the following advantages over other keypoint detection algorithms:

  • SIFT algorithm works on variable size images, as such the size of the image need not be fixed beforehand.

  • The size of the keypoint is adjusted based on the scale at which, it is detected. Thus, removing the need to adjust the window size around a keypoint manually.

  • SIFT algorithm can extract a large number of keypoint that approximately cover the whole image over a full range of scale and size. Having, such a large quantity of key points makes the job of identification easier.

The detected SIFT keypoints are then passed through a CNN network trained on English alphabets, to extract features from the keypoints. Based, on experimentation a set of convolution block is selected and feature maps corresponding to that convolution block is extracted. The size of the generated feature maps varies following the scale at which the keypoints are extracted, to transform the variable sized feature map into a fixed size representation while retaining the overall information and spatial relationship between image pixels of the feature map HOG [Dalal:2005] feature descriptor is used. As convolution network contains multiple sets of filters in each convolution layer which generates a feature map for each filter weight. Extracting HOG feature corresponding to each feature maps of a convolution layer and training a classifier for each of these extracted feature map leads to an increase in dimensionality as well as adds to computational load, to overcome this issue, we assign saliency value to each of the feature map corresponding to a given convolution layer based on their information content and use it to generate a combined feature map. Thus, reducing the computational cost without improvising on the information generated by the convolution layer. To the best of our knowledge, this is the first work that uses an entropy based approach to weigh the feature maps of a convolution layer. The HOG features extracted from the weighted feature map are then fed to an SVM classifier. The SVM classifier generates a score corresponding to each of the extracted SIFT fragments form word image which is then combined using average pooling to arrive at a final classification score based on which writer identity is established.

Ii-a Block schematic of Our proposal

Fig. 1: Pictorial Overveiw of our proposed system.
Fig. 2: Block diagram representation of Entropy based weighting block. Here are the weights learned using entropy during the training phase of CNN corresponding to the feature maps of layer in the convolution block.

Fig.1 presents the overview of our proposed algorithm. The segmented image word is passed as an input to the SIFT keypoint extractor block which generated number of fragments of different sizes. These fragments are then passed through CNN weighting block which extracts feature maps of

convolution block and combines them using the weight generated using entropy during the CNN training phase. The weighted feature map is converted into one-dimensional features vector using HOG representation. The HOG features are then passed through the SVM block which is trained on enrolled word images of the writer during training phase. The SVM score generated by fragments on the testing word images is sent to the last block which performs average pooling on each of the fragment scores to arrive at a final score that establishes the authorship of the writer.

The rest of the paper is organized as follows. In section III we describe in detail the process of keypoint extraction using SIFT algorithm. This is followed by a detailed explanation of the CNN block used for extracting features from SIFT keypoints in section IV. In section V we explain in detail about HOG operation used to convert variable size image feature map into a fixed size representation. In section VI we discuss in detail the methodology used for assigning saliency values to each filer of a selected convolution layer for feature weighting, which forms an important aspect of this paper. In section VII the process of incorporating the weights of the convolution filters to generate a combined feature is discussed along with the classification strategy. This is followed by a brief discussion about the dataset used for evaluating our algorithm in section VIII. In section IX we evaluate the performance of our proposed writer identification algorithm and compare its performance with other prior works finally, we summarize our paper in section X.

Iii SIFT Keypoint generation

Locating important features across images is a common issue prevalent in various domains of computer vision. When the images under consideration are similar (in terms of size, scale, orientation, etc.) a simple key point detector such as Harris corner detector can work effectively, by extracting a fixed size region around a detected keypoint. But in cases where the image is of different shape and size, as in the case of handwritten words, we need a scale-invariant feature algorithm to locate keypoint of different shape, size and orientation generated by a writer. Since, its introduction SIFT algorithm

[Lowe:2004] has gained widespread recognition as a preferred method of keypoint localization and feature extraction in areas such as object detection, object retrieval, image stitching, etc. One of the principal feature of this algorithm is it’s ability to extract distinct and scale-invariant features from an input image. SIFT algorithm has been widely used in the field of writer identification [Wu:2014, Khan:2018, Christlein:2017]

and is divided into four stages: First, an input image is broken into a Gaussian pyramid, each level of pyramid (called an octave) is successively convolved with a Gaussian filter of different variance to form sub-levels of Gaussian blurred images. In the second stage, the blurred image in an octave are used to generate Difference of Gaussians (DoG) images, which are used to find interest points in an image. In the third stage, stable keypoints are selected from the interest point detected in stage two, and the location scale and orientation of these stable keypoints are computed. In the last stage SIFT descriptor is computed based on the location of stable keypoints.

In this work, we, use SIFT algorithm to get the location of keypoints at each level of an octave and extract a region around the keypoint based on the size of the kernel used for blurring the image in the selected octave. In this way, we locate keypoints of variable size and shape at different level of octave using this algorithm. Some of the keypoints extracted using SIFT algorithm on input word image is shown with the help of Fig. 3.

Fig. 3: Example of keypoint detection using SIFT. (a) segmented Input word image, (b) detected Keypoint

Iv CNN feature extraction

Feature extraction plays an important part in a writer identification system. It helps in extracting local and discriminative information from an image which increases the accuracy of the model. In recent times the use of CNN for extracting features has gained popularity, as it can learn data-adaptive features from training data thus increasing the overall accuracy of the system.

A general CNN architecture consists of a combination of several convolution blocks each having a certain number of filters, whose number increases for the depth of the convolution layers. In a convolution block, each set of filters extracts a feature from the preceding input convolution block. These input features are extracted by convolving the learned filter weights over the preceding input layer, as such a particular filter shares its weights across the input layer. The weight of the filters is learned in the training phase, during which its weight is tuned continuously to optimize the loss function. Convolution operation of a CNN can be mathematically represented as:


Here, represents the feature map of layer ,

represents the non-linear activation function,

represents a section of the input feature map, represents the convolution filter kernel, represents the number of the corresponding convolution layer and represents the bias term.

Most CNN based algorithms [Fiel:2015, Tang:2016, Xing:2016, Sulaiman:2019] use images patches of a fixed predefined size extracted at text-line/ document level to train a writer dependent convolution network for classifying writers. These images contain a combination of several alphabets, which makes it possible to train an end-to-end CNN network. Designing, such an end to end Convolution network may not be feasible for an image extracted using SIFT keypoint because of the following reasons:

  • SIFT keypoint contains many small-sized image features in the form of small segments extracted from word sample, these features may share some similarities across different writers in a database, therefore a deep convolution network may not be able to discriminate between these features effectively, as the size of the of feature map in a CNN decreases with increasing network depth leading to a loss of information.

  • In the case of a limited number of training samples, it may not be possible to train a convolution network effectively.

  • A convolution network trained on word samples from one database may not be effective in classifying writers from another database[He:2020], therefore the network may need to be retrained.

Because of the above mentioned issues, we use a writer-independent auxiliary dataset to train the CNN network. Our CNN network is trained on EMNIST dataset [Cohen:2017] which contains a combination of handwritten English alphabets and digits of which we only use handwritten English alphabets for training our network. An alphabet contains a mixture of allographs of various shapes and sizes which combine to form a word, training a network on alphabets helps up to extract important writer independent features from SIFT keypoint fragments. The block diagram of our trained CNN network is shown in fig. 4

. In our convolution network implementation, we have replaced the max-pooling operation with stride convolution operation. In our algorithm, we pass the image of SIFT kepoints as an input to the trained convolution network and extract the feature maps corresponding to a convolution layer for further processing. The convolution layer selected for feature extraction is based on experimentation.

Fig. 4: CNN network used for training English character on EMNIST dataset [Cohen:2017].

V HOG feature representation

The size of the feature map obtained by passing the SIFT keypoint through a selected convolution block varies across the samples of a writer, due to the variable size of input SIFT keypoints. As a result, we need to represent the output feature map using a fixed-size feature vector without losing important information. To project feature map into a fixed-size representation we use Histogram of Oriented Gradients (HOG) [Dalal:2005] feature representation. In, this section we provide the details of our modified HOG representation, which takes care of a variable sized input feature map.

Histogram of Oriented Gradients (HOG) [Dalal:2005], is a feature descriptor algorithm widely used in the field of object detection to extract features from an image. This method takes into account the gradient orientation in a localized portion of an image. It differs from other gradient-based feature representations as it uses gradient magnitude as well as orientation information to represent the image feature. This algorithm preserves the spatial information besides being invariant to small geometric transformations which may occur between samples of the same writer.

The process of HOG feature extraction as implemented by us consists of the following steps in sequence:

  • The input image is first divided into cells with each cell containing a certain number of pixels. The number of pixel corresponding to each input image varies dynamically depending on the size of the input image.


    where, and is the size of cell along row , and column of the feature map and and is the number of image grids along the row and column respectively.

  • The resulting input image is thus divided into matrix of size with each individual entries containing a sub image of size (). The image block is further grouped into numbers of block, with each block containing number of sub-images each of size ().

  • Gradient and along and direction is calculated for the whole image, which is then used to extract gradient magnitude and orientation information using equation (3).

  • Gradient magnitude and orientation calculated in the above step are used to construct a histogram. The horizontal axis of which represents the orientation information discretized into a fixed set of bins, whose values are in multiples of degrees, where is number of bins in the Histogram. Location () of bin to be voted is selected based on gradient orientation , and the vote (the value that goes into the bin) is selected based on the gradient magnitude.

  • Corresponding to each of the sub-images in the block histogram is constructed as explained in the above step. Each of the histograms corresponding to a given block is then concatenated to form a combined histogram representing each block of the input image, these histograms are then concatenated horizontally and normalized to construct a fixed size one-dimensional HOG representation corresponding to each image.


Vi Determination of Saliency value for convolution filters

The HOG feature representation learned for the output SIFT keypoint fragments of a Convolution network needs to be stored for further processing in subsequent stages. Since, a convolution layer contains multiple filters at each convolution level, storing data corresponding to all these filters individually for the purpose of classification is a computationally expensive task. A solution to this problem consists of either combining the output feature map obtained from each filter of a convolution layer or combining the extracted HOG feature representation obtained by processing the feature map of a convolution layer. For combining the output obtained from a convolution layer using any one of the above discussed methods average pooling strategy can be used, which assigns equal weight to all the filters of a convolution layer. Since a convolution layer contains multiple filters not all of these filters are equally informative and some of the filters may not contribute to the overall extracted features and may not assist the overall accuracy. In order to solve this issue, we assign a saliency score to each of the convolution layer filters based on their informative content. In this section, we explain in detail our proposed entropy-based method for assigning saliency score to the selected convolution layer filters.

For calculating saliency score, we select a set of W writes form the database listed in TableI. For each of these writers in W HOG feature representation is carried on the feature map corresponding to each of the filter of a convolution layer. Following operations are performed on the HOG feature obtained from the filters of convolution layer:

  1. Extraction of principal components from HOG feature representation and using it to construct a histogram.

  2. Computation of entropy using histogram generated in the above step to assign a saliency value to the convolution layer filters.

In the following subsection we explain in details each of the above mentioned steps.

Vi-a Histogram generation using Sparse Principal component analysis

The first step in our objective of determining the saliency of filter, involves projecting writer onto a common subspace (as done in traditional -means clustering algorithm). In our work, we have used the concept of Sparse PCA [Zou04] for constructing individual subspace relative to each of the filters of a convolution layer. This subspace is constructed using data matrix X containing HOG entries corresponding to each of the W writer samples. This matrix contains global information captured by a particular convolution layer filter independent to each writers in . This matrix is used to construct a set of dictionary atoms (representing a subspace) with each atom having a dimension equal to the HOG feature dimension. The process of obtaining dictionary atom using Sparse PCA(SPCA) is explained below.

Sparse PCA can be framed as a regression problem involving PCA, and using elastic net penalty to impose sparsity. Using, this method PCA is computed by performing SVD on data matrix X as


Here, represents the principal components (PC) of each observation, and represents the loadings of principle component. can also be obtained by projecting on vector (i.e ) using equation(5). This observation helps us in formulating PCA as an optimization problem involving and as:


Here, is the target vector and is the corresponding regression coefficient. is an sparse approximation of the original loading vector.

We,use the concept of Sparse PCA, as discussed above to project the HOG feature vector relative to a set of samples N from each writer in W on the approximated Sparse principal components. The coefficient generated as a result of projection is used to construct a D dimensional histogram. Let be the coefficients corresponding to the writer for the principle component. These, coefficients are used to construct a histogram corresponding to the principle component as:


Here, is the quantized value of principle component coefficient corresponding to bin index. The histogram of each of the principal components corresponding to samples of the writer in is obtained using equation (7). Following the histogram generation each of them is normalized in the range between 0 and 1 as follows:


Here, is the number of qunatized histogram bins.

Vi-B Computation of filter saliency based on Entropy

The normalized histogram generated for each of the principle component relative to the enrolled samples in W is used to compute the entropy value for each of the individual principle components of a writer as:


Here, is the entropy value of the principle component corresponding to samples of each writer in . This entropy value

gives us the information about the probability distribution of the writers relative to each of the principal components. These individual entropy values are summed up across all writers to give an overall entropy value for a particular convolution filter as follows:


Here, is the overall value of entropy for the filter of a convolution layer. This overall entropy value signifies to us the amount of information contained in a particular filter of a convolution layer. A low entropy value signifies the presence of information content across some of the principal components of the filter, thus making other components insignificant leading to a low entropy value. Such a filter should be assigned a low saliency value. On the contrary, a higher entropy value signifies, that the informative content is evenly distributed across a larger number of principle components thus corresponding filter needs to be assigned a higher saliency value. Based on these observations we assign saliency value to each of the filters using their entropy value as:


As, a visual interpretation of the above explanation, we generate a histogram in Fig.5 corresponding to some of the filters of conv1 and conv2 layers in our convolution network. These histograms correspond to the filters having maximum and minimum saliency values. For constructing the histogram we have used 10 word samples from each of 50 writers of the IAM database[IAM].

Fig. 5: (a) and (b) are histogram corresponding to two filters of conv1 having highest and second highest saliency values, (c) and (d) are histogram corresponding to two filters of conv1 having minimum and second minimum saliency values. Similarly (e) and (f) are histogram corresponding to two filters of conv2 having highest and second highest saliency values, (g) and (h) are histogram corresponding to filters of conv2 having minimum and second minimum saliency values.

Vii writer feature generation and classification

Given a writer sample containing number of word samples, generating number of SIFT keypoints. These SIFT keypoints when passed through a convolution layer generates multiple feature map depending on the number of filters included in the convolution layer. To save computational cost instead of extracting HOG features corresponding to each of the feature maps and concatenating them to construct a large feature vector, we use a weighting strategy to generate a low dimensional HOG feature representation. The weighting strategy used by us includes conventional average pooling, which assigns equal weight to all the output of a feature map, and entropy-based saliency weighting, which assigns weight to the feature map of a convolution layer based on their information content measured using entropy as discussed in section VI. Further, to analyze the effect of weighting strategy on the overall accuracy of the proposed algorithm, we incorporate feature map weighting either at an early stage (pre weighting) or at the later stage (post weighting) .

In the pre-weighting approach, each of the feature maps generated corresponding to a selected convolution layer is weighted and combined to generate a combined feature map following which HOG feature representation is performed on the combined feature map as explained in section V.


Here, is the feature map corresponding the filter of a convolution layer and is the combined weighted feature map generated by the SIFT keypoint of writer.

In the post-weighting approach, each of the HOG feature vectors generated corresponding to the feature map of a convolution layer is weighted individually and combined to generate a combined HOG feature representation for a SIFT keypoint.


Following the generation of the HOG feature vector, a one vs all SVM using radial basis function (RBF)

[Burges:1998] is used to train writing samples. The optimal values of RBF parameters and are obtained by grid search.

During the testing phase, HOG feature vector generated from the SIFT keypoint fragments of a writer word image is passed through a set of SVM classifiers to obtain a score for each of the word fragments. SVM assigns a positive or negative value to input samples measured with respect to a hyperplane separating negative and positive samples. These, SVM scores are normalized in the range between

by passing the SVM classifier score through a sigmoid function. These, normalized SVM scores are then average pooled across each writer to obtain an overall response of all the segmented fragments corresponding to a word image.


Here, () is the total number of SIFT fragments generated form an input word image, is the normalized SVM score of the fragment for a writer. The final prediction for a word sample () is made based on the label () for which the classifier reports the highest confidence score as:


Viii Dataset description

The proposed method is evaluated on two datsets: IAM [IAM]and CVL [CVL].

IAM [IAM] dataset is the most widely used English language dataset for the purpose of writer identification. It contains 1593 images of English handwriting documents collected from a set of 657 writers. Each writer has contributed a variable number of handwritten document. Out of 657 writers, 301 have contributed two or more handwritten documents, while the rest have contributed only a single handwritten document. We modified the IAM dataset as described in [Wu:2014] by randomly selecting two documents for writers who have contributed two or more document samples in the database. On the other hand for a writer contributing only one document, it is divided into half one half of which is used for training and the other half for testing. As the bounding boxes for the word image have been provided in the dataset, we collect the word image from the generated training and testing set for our experimentation.

CVL [CVL] dataset contains handwritten text documents from 310 writers of which 27 writers have contributed 7 text documents, while the rest have contributed 5 text documents. Each writer contributed one text document in German and the rest in English language. In our experimentation, we used only the text document written in English. We follow the same methodology as is done in [He:2020] for dividing the dataset into training and testing sets. Similar to the IAM dataset segmented words are made available in this dataset.

Table I gives a detailed overview of all the datasets used in our experiments.

Dataset Number of writers Language
IAM 657 English
CVL 310 English
TABLE I: Overview of the datasets used in experiments

Ix Experiment and Discussion

In this section we carry out a set of experiments of our proposed algorithm on two benchmark datasets for writer identification based on word and page images. These set of experiments helps us in choosing a proper set of parameters to be used for increasing the efficacy of our proposed writer identification system.

Ix-a Implementation details

The proposed Convolution network is built on the TensorFlow framework. Adam optimizer is used for optimizing the network, with a weight decay factor of

every ten epoch. The model is trained for 50 epochs. Word fragments are normalized (between

) before feeding to the convolution network for feature extraction.

In order to extract the HOG feature vector from SIFT keypoint as discussed in section V, we select the value of to be corresponding to conv1 and conv2 layers as is done for SIFT descriptor. Since, the size of the feature map in a convolution network decreases as the depth on the network increases, in order to keep the spacial consistency of the feature map value of and is adjusted accordingly. For extracting the HOG features corresponding to conv3 the value of is set to , as the size of the conv3 layer is half the size of the conv2 layer.

Ix-B Performance with varying Bins B and convolution layers

In this subsection, we evaluate the effect of bin size () on the overall performance of our writer identification system on one third dataset (105 for CVL and 219 for IAM). We vary the value of bin size () from 6 to 12 in steps of two. Table II tabulates the result of our strategy for different databases. We observe that the best average identification for the IAM database corresponding to conv1 and conv2 layers are 91.15 and 90.25 respectively which are achieved for bin size 10. With respect to the CVL database, we achieve the best average accuracy of 82.97 and 80.95 corresponding to bin size 10.

Based on table II

it can be inferred that initial layers of the convolution neural network (namely conv1 and conv2) capture most of the important information that is helpful in differentiating one set of writers from another. As we move deeper into the convolution network the resolution of the image decreases, as a result, the information becomes more abstract and less visual interpretative resulting in loss of discriminability among writers(as shown using conv3 result). In the subsequent set of experiments, we will mainly lay emphasis on conv1 and conv2 layers corresponding to the optimal bin size (as reported in Table


Convolution layer Bin size IAM CVL
Top1 Top5 Top1 Top5
Conv1 6 89.8 95.45 81 92.45
8 90.15 95.85 82 93.68
10 91.15 97.15 82.97 94.59
12 89.85 95.65 81.21 93.25
Conv2 6 89.4 95.5 79 92.35
8 89.8 95.3 79.95 93.25
10 90.25 96.75 80.95 93.58
12 88.55 95.15 80.80 93.38
Conv3 6 70 88.15 47 77.85
8 73.35 89.81 57.8 83.25
10 79.97 91.77 64.45 87.9
12 81.88 92.07 67.15 88.55
TABLE II: Comparison of average identification rates (in % for one third data set samples) at word level for proposed algorithm with different bin size. The best identification rate is marked in bold

Ix-C Influence of weighting strategy

In this experiment, we consider the effect of incorporating different weighting strategy and its effect on the overall accuracy of the proposed writer identification system. The weighting strategy that we consider consists of mean-pooling, pre-weighting (12) and post-weighting (13) based methodology.

Based on table III we observe, that the performance of mean pooling based weighting is lowest when compared to other two weighting strategy. This, is due to the fact that mean pooling assigns equal weight to all the filters of a convolution layers, due to which additional information contained in some of the convolution filter is not utilized properly. On, the other hand other two weighting strategy (pre and post) uses weighted average method that determine in advance the relative importance of each convolution filter output. Thus, providing additional information related to the filter output, which is helpful in increasing the overall accuracy of the system. Experimental result also reveal that among the two weighting strategy (pre and post), post-weighting performs better than pre-weighting strategy. This, can be attributed to the fact that various filter output of a convolution layer are complementary to each other, as such performing pre-weighting sometimes leads to masking of some important features of the combined feature map resulting in a loss of hidden information. On, the other hand post-weighting strategy weights the extracted HOG feature corresponding to each of the feature map in a convolution layer separately, thus hidden information corresponding to each of the feature map is retained and sometimes amplified which helps in improving overall accuracy of the system.

Conv1 Conv2
Database Average weighting pre weighting post weighting Average weighting pre weighting post weighting
IAM 91.15 91.8 92.3 90.25 90.55 92
CVL 82.97 83.73 85.25 80.95 81.31 84.19
TABLE III: Effect of accuracy achieved (in %) on IAM and CVL dataset (for one third samples) using various weighting strategies

Ix-D Score Fusion

Let be an input word image which generates number of SIFT fragments. These SIFT fragments when passed through HOG feature classifier corresponding to conv1 and conv2 layer generates and score in the interval [0-1] respectively. These scores are then fused together to generate a final classification score as:


Where, is the weight. The component assigns a weight to the individual conv1 and conv2 classifiers by taking into account the effectiveness of the individual classifier in classifying a writer. The parameter is data dependent and is determined by performing cross-validation on training data across different datasets.

Once, the parameter is selected the final prediction score for each writer is calculated individually, following which the prediction for candidate writer is made by the system as follows:


Ix-E Performance of Writer Identification with varying number of words Images

In order to show the robustness of our proposed system with respect to the amount of handwritten word available, we calculate the TOP-1 performance of our system with respect to different number of handwritten words across two databases. In this experiment we randomly select number of word images for each enrolled writer. SIFT fragment extracted from these word samples is passed through HOG classifier corresponding to conv1 and conv2 layers following which these scores are combined to obtain overall accuracy score. This procedure is repeated 10 times and the average result for different values is shown with the help of fig.6.

From fig.6, it can be observed that the performance of proposed writer identification system increases significantly as the number of available word sample of a writer increases form one to two, and it stabilizes as the number of available word sample reaches four or more words. This result shows the efficacy of our proposed algorithm in recognising writer in situation where limited amount of writer data is available for testing.

Fig. 6: Performance (Top1) of writer identification using different numbers of words on the IAM and CVL data set.

Ix-F Performance comparison of proposed system with different features on word Image

In this section we evaluate the performance of our proposed system with traditional handcrafted features based system on word image data. Table IV shows the result of different writer identification system considering word image as input. On the basis of table IV we can observe that the performance of traditional handcrafted features based system on input word image is low. This is due to the fact that these algorithm make writer prediction based on statistical information collected from the input image. In order to generate stable representation based on statistical information, these algorithms require certain minimum amount of text samples. In the case of word image the statistical information captured is insufficient to generate a stable representation of the input text sample. On the contrary neural network based system can capture simple as well as complex information from the training data, as a result it provides much better result compared to traditional handcrafted features based methods. Based on Table IV, it can be inferred that our convolution based neural network trained on writer independent features perform better than other end to end trained neural network in identifying general writer characteristics.

Method IAM CVL
Top1 Top5 Top1 Top5
Hinge[Bulacu:2007] 13.8 28.3 13.6 29.7
Quill[Brink:2012] 23.8 44.0 23.8 46.7
CoHinge[Sheng:2017] 19.4 34.1 18.2 34.2
QuadHinge[Sheng:2017] 20.9 37.4 17.8 35.5
COLD[sheng:2017C] 12.3 28.3 12.4 29.0
Chain Code Pairs[Siddiqi:2010] 12.4 27.1 13.5 30.3
Chain Code Triplets[Siddiqi:2010] 16.9 33.0 17.2 35.4
WordImgNet[He:2020] 52.4 70.9 62.5 82.0
FragNet-64[He:2020] 72.2 88.0 79.2 93.3
Veritcal GR-RNN(FGRR)[He:2021] 83.3 94.0 83.5 94.6
Horizontal GR-RNN(FGRR)[He:2021] 82.4 93.8 82.9 94.6
Proposed Methodology with post weighting (conv1+conv2) 93.55 98.15 86.26 95.65
TABLE IV: Comparison of Writer Identification performance (in % ) on full database for word image samples

Ix-G Performance comparison of proposed system with other methods on page Image

In this section we evaluate the performance of our proposed algorithm on page images. For calculating the writer performance on page images, we calculate the SVM score corresponding to each word image contained in the page of writer sample individually, following which these scores are aggregated to arrive at an overall score corresponding to each of the writer page samples. This process is represented mathematical using below equation.



is the overall writer probability based on words samples present in a

image. is the writer probability for each of the individual word contained in a and is the total number of word contained in the image. Table V and VI shows the overall performance of our algorithm on page data compared with other state of art algorithms for IAM and CVL datasets respectively.

Based on table V and VI it can be observed that the performance of the proposed writer identification algorithm on page image is much better than word image due to the availability of large number of word samples contained in a page. In addition to that it can also be seen that our word based writer identification algorithm achieves a higher identification accuracy when compared with other word based writer identification algorithm (such as WordImagenet, Fragnet[He:2020], GR-RNN [He:2021], and [Nguyen:2018]) for page level data on IAM datset. For CVL[CVL] dataset our proposed algorithm achieves top1 accuracy of 98.70%, which is comparable to other state of art algorithm for page level data. The low classification accuracy of our proposed system on page level data for CVL dataset as compared to other algorithm is due to the fact that the written content in case of CVL datase is same for every writer due to which, the model in some of the case is unable to differentiate between writers having a similar style (as our proposed model is not trained to capture writer specific features), thus leading to false writer identification in some of the cases, as depicted with the help of figure 7. One the other hand IAM dataset contains writer content which are mostly different across writers, as a result of random selection of writer samples during training phase as described in sectionVIII. This, results in better generalization of features across different writers leading to higher accuracy than other algorithms. On the basis of the classification result on CVL and IAM datset, it can be inferred that our algorithm is able to learn general writer characteristics better than other writer identification algorithms.

Reference Number of Writers Top 1 accuracy
Siddiqui and Vincent[Siddiqi:2010] 650 91.0
He and Schomaker[Sheng:2017] 650 93.2
Khalifa[Khalifa:2015] 650 92.0
Hadjadji and Chibani[Hadjadji:2018] 657 94.5
Wu[Wu:2014] 657 98.5
Khan[Khan:2018] 650 97.8
Nguyen[Nguyen:2018] 650 93.1
WordImagenet 657 95.8
FragNet-64[He:2020] 657 96.3
GR-RNN[He:2021] 657 96.4
Proposed Methodology with Post weighting 657 98.63
TABLE V: Comparison of State of art methods on IAM database
Reference Number of Writers Top 1 accuracy
Fiel and sablating[Fiel:2015] 309 97.8
Tang and Wu[Tang:2016] 310 99.7
Christlein[Christlein:2017] 310 99.2
Khan[Khan:2018] 310 99.0
WordImagenet 310 98.8
FragNet-64[He:2020] 310 99.1
GR-RNN[He:2021] 310 99.3
Proposed Methodology with Post weighting 309 98.7
TABLE VI: Comparison of state of art method on CVL database
Fig. 7: Example of false acceptance. Samples included in (a) and (b) are from different writers, but is identified as same writer by the proposed algorithm.

X Conclusion

In this work we have proposed an offline test-independent writer identification system based on word image. Our algorithm uses SIFT keypoint to detect various type of features from a word image, these feature vary form word to allograph. Such type of diverse features helps in making our system more robust as compared to a system trained using word image as whole. The major contribution of our proposed work can be listed as follows: (i) Use of a convolution network trained on a auxiliary dataset instead of an end to end trained network on writer dependent data. Thus, making the algorithm better suited to extract writer features without the need of retraining the network. (ii) Use of modified HOG feature representation to capture spatial relation between convolution later output of various SIFT keypoint corresponding to different writers. (iii) proposing an entropy based weighting strategy to assign importance to each filter of a convolution layer. As, a result of incorporating these modification the accuracy of our proposed algorithm increases and it is found to be on par or even better to several prior works.

One of the limitation of our proposed algorithm is that it requires a good quality of segmented word image, which is a very challenging task to achieve on a highly cursive written document. In our future works we would like to address this shortcoming and work on an algorithm which could perform writer identification on any handwritten document.