Semi-Supervised Feature Learning for Off-Line Writer Identifications

07/15/2018 ∙ by Shiming Chen, et al. ∙ University of Technology Sydney 0

Conventional approaches used supervised learning to estimate off-line writer identifications. In this study, we improved the off-line writer identifica- tions by semi-supervised feature learning pipeline, which trained the extra unla- beled data and the original labeled data simultaneously. In specific, we proposed a weighted label smoothing regularization (WLSR) method, which assigned the weighted uniform label distribution to the extra unlabeled data. We regularized the convolutional neural network (CNN) baseline, which allows learning more discriminative features to represent the properties of different writing styles. Based on experiments on ICDAR2013, CVL and IAM benchmark datasets, our results showed that semi-supervised feature learning improved the baseline meas- urement and achieved better performance compared with existing writer identifications approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Handwritten texts, speech, fingerprints, and faces are often applied in physiological biometric identifiers. Especially, handwritten text plays an important role for forensics and security in proving someone’s authenticity. Research into writer identification has received renewed interest in recent years, such as historical document analysis for the mass-digitization processes of historical documents Kleber2013CVL ; Louloudis2013ICDAR ; Xing2016DeepWriter

through machine learning; unfortunately, this process requires considerable time and detection costs. Therefore, many researchers have proposed state-of-the-art pattern recognition approaches to automatically recognize writing styles

Abdi2015A ; Christlein2017Writer ; Mallikarjunaswamy2013Writer ; Plamondona1989Automatic ; Yin2009Handwritten .

Writer identification aims to search and recognize texts written by the same writer in a query database. Writer identification has been investigated on different handwritten scripts, such as English Bulacu2007Text ; Schomaker2004Automatic , Chinese He2015Junction ; He2008Writer ; Wu2014Offline , Arabic Abdi2015A , Indic Mallikarjunaswamy2013Writer , Persian Helli2010A and Latin scripts Christlein2017Unsupervised

. This task generally presents substantial challenges because it requires the documents to be sorted according to high similarity (e.g., the distance of feature vectors). Writer identification can be classified as online writer identification and offline writer identification according to the handwritten document acquisition method. The latter approach can be further categorized into allograph-based and textual-based methods. Textural-based methods compute global statistics directly from handwritten documents (pages)

Brink2012Writer ; Hannad2016Writer ; He2014Delta ; Newell2014 ; Nicolaou2015Sparse . For example, the angles of stroke directions, the width of the ink trace, and the histograms of local binary patterns (LBP) and local ternary patterns (LTP) have been used for writer identification purposes. Allograph-based methods rely on local descriptors computed from small patches (allographs), and then a global document descriptor is statistically calculated using the local descriptors of one document Christlein2017Writer ; Christlein2015Offline ; He2015Junction . These two methods can be further combined to form a discriminative global feature Bulacu2007Text ; He2017Writer ; Wu2014Offline . The semi-supervised feature learning pipeline proposed in this work is based on allographs for offline writer identification.

Although writer identification has achieved excellent performance on some benchmark datasets, there are considerable challenges in real-world applications. First, the use of different pens, the physical condition of the writer, the presence of distractions (such as multitasking and noise), and the changes in writing style with age are key factors resulting in the unsatisfactory performance of writer identification. Second, the writers of the training set are different than those of the test set, and every writer only contributes a few handwritten text images in the typically used benchmark datasets. Third, the number of handwritten documents in benchmark datasets is highly insufficient for convolutional neural network (CNN) model training; therefore, training a reliable CNN model using limited data is a challenge. Moreover, almost all published methods are based on supervised learning, which cannot achieve landmark results due to the limited amount of labeled data present in the benchmarks. Some researchers utilize different data augmentation methods to address these problems. However, these data augmentation methods that are used in writer identification easily lead to model overfitting and require a considerable amount of extra data. To overcome the aforementioned challenges and then tightly integrate with writer identification in practice, we propose a novel insight for writer identification.

CNNs are a well-known deep learning architecture inspired by the natural visual perception mechanism of living creatures. CNNs have been widely used and have achieved exciting performance in the fields of image classification, object recognition and object detection and tracking

He2016Deep ; Krizhevsky2012ImageNet ; Simonyan2015Very ; Szegedy2015Going

due to their powerful ability to learn deep features. The recent progress in writer identification is mainly attributed to advancements in CNNs based on supervised

Christlein2014Writer ; Christlein2017Writer ; Christlein2015Offline ; Fiel2015Writer ; He2017Writer ; Tang2017Text ; Xing2016DeepWriter and unsupervised feature learning Christlein2017Unsupervised . The features extracted from CNNs perform better as discriminative characteristics compared to handcrafted features. For example, Xing and Qiao et al. Xing2016DeepWriter designed a multistream CNN structure for writer identification and achieved a high identification accuracy on the IAM Marti2002The and HWDB Liu2013Online datasets using a small amount of handwritten documents. In Christlein2015Offline , Christlein proposed using activation features from CNNs as local descriptors for writer identification and improved the identification performance on the ICDAR2013 dataset. R. Eldan et al. Eldan2015The showed that a deeper network would learn a more discriminative representation but will need more resources to train. Therefore, we recommend that a tradeoff and a deep residual neural network with 50 layers (ResNet-50) could be applied in our work.

In contrast to the supervised learning approaches, semi-supervised learning significantly surpasses supervised learning when annotated data are limited in the training set, e.g., weakly labeled or unlabeled data Huang2017Semi ; Weston2012Deep ; Zhu2002Learning

. In particular, semi-supervised learning saves the time and budget needed for annotating data when the volume of clean labeled data is limited. Some recent studies investigated a semi-supervised learning pipeline by combining unsupervised learning with supervised learning

Rasmus2015Semi ; Varior2016A to assign an original label or a new label to unlabeled data Lee2013Pseudo ; Odena2016Semi ; Papandreou2016Weakly . Motivated by the previous studies, we attempt to use a modified semi-supervised learning method by assigning a weighted uniform label distribution to extra unlabeled data (extra data) according to the original labeled data (real data). We believe that the proposed approach has the potential to regularize the baseline for improving identification performance.

Therefore, we proposed a semi-supervised method that leverages a deep CNN and the weighted label smoothing regularization (WLSR) to form a powerful model that learns discriminative representations for offline writer identification in our work. Specifically, we first preprocess the original labeled data and the extra unlabeled data. Then, these original labeled data and extra unlabeled data are fed into a deep residual neural network (ResNet) He2016Deep simultaneously. Furthermore, the WLSR method regularizes the learning process by integrating the unlabeled data, which can reduce the risk of overfitting and direct the model to learn more effective and discriminative features. Finally, the local features of every test handwritten document are extracted and encoded as a global feature vector for identification.

To summarize, this study makes the following contributions:

A. This study is a pioneering work that uses a semi-supervised feature learning pipeline to integrate extra unlabeled images and original labeled images into the ResNet model for writer identification.

B. The WLSR method of semi-supervised learning is used to regularize the identification model with unlabeled data. We thoroughly evaluate its availability on public datasets.

C. Our results show that the proposed semi-supervised learning model had a consistent improvement over the deep residual neural network baseline and achieved better performance than existing approaches on benchmark datasets.

The remainder of this paper is organized as follows. Sec. 2 provides an overview of the related works in the field of writer identification. The process of the semi-supervised learning pipeline is presented in Sec. 3. The performance and evaluation are given in Sec. 4. Sec. 5 presents the discussion. Sec. 6 provides a summary and the outlook for future research.

2 Related Work

In this section, we review related work on writer identification that considered different data augmentation approaches to address cutting-edge challenges. Some researchers considered data augmentation in intrasets Christlein2015Offline ; Fiel2015Writer ; Tang2017Text ; Xing2016DeepWriter , but this easily led to model overfitting. Two recent studies added extra labeled data into the original data to enlarge the training set, which in turn required a vast amount of extra data to improve the identification results Christlein2014Writer ; Christlein2017Writer .

S. Fiel et al. Fiel2015Writer

used a series of image preprocessing methods (binarization, text line segmentation, and sliding window) and then generated a discriminative feature by CaffeNet for each

image patch. Because CNNs have to be trained on a large amount of data to achieve a good result, he cut the line images into patches using a sliding window model with a step size of 20 pixels and rotated each patch of the sliding window from to degrees using a step size of 5 degrees. Thus, the new training set consists of more than 2,300,000 image patches, which artificially enlarged the original training set. His proposed algorithm achieved good performance on the ICDAR2011 Louloudis2011ICDAR and CVL Kleber2013CVL datasets, but this algorithm failed to improve the performance on the ICDAR2013 Louloudis2013ICDAR dataset. Furthermore, the CNN was trained on word images of the IAM dataset and the features of the CVL dataset extracted from the pretrained CNN. It suggested that the IAM and CVL datasets share a similar sample space. In Tang2017Text , Tang introduced a new method for offline writer identification using a CNN and a joint Bayesian approach to contend with insufficient benchmark datasets for CNN model training. Tang also used words segmented from handwritten documents as elements to permute the texts to generate a significant number of images, which were subsequently converted to form handwritten pages. In addition, all the reconstructed handwritten pages were split into some nonoverlapping patches for training. In Xing2016DeepWriter , Xing introduced a data augmentation method to enhance the performance of the proposed DeepWriter. However, these data augmentation methods only enlarged the dataset in the area of the intraset, and existing models did not consider dealing with the generated data, leading to an overfitting situation and limitations of feature learning in CNNs.

In Christlein2014Writer , Christlein created a combined dataset (MERGED) consisting of 559 scribes with four documents per writer, resulting in 2236 documents from the ICDAR2013 and CVL datasets. Thereby, the training set was enlarged, and the outcomes on the MERGED datasets slightly differ from the image vocabularies that can be calculated from the ICDAR2013 experimental set or the CVL dataset. Furthermore, Christlein et al. Christlein2017Writer showed that the identification rate on the CVL test set could be improved by adding additional datasets (ICDAR2011 and IAM Marti2002The ) into the CVL training set. Although existing data augmentation approaches have the capability to improve the identification performance using the extra data, we can imagine that it requires a large amount of extra labeled data. In practice, however, we do not have access to collect a large number of samples for writer identification.

In contrast to the aforementioned works, we employed a semi-supervised feature learning pipeline that allows adding data without a label. We assumed that the semi-supervised feature learning approach could effectively avoid overfitting and require less extra data to improve the ability of feature learning of the baseline.

Figure 1: The pipeline of semi-supervised feature learning, which consists of three parts: preprocessing (green dotted box), semi-supervised learning (blue dotted box) and encoding (purple dotted box). During training, the original labeled data and extra unlabeled data are shuffled and fed into the semi-supervised learning network for training. For testing, the local features (red rectangles with solid edge in encoding part) of testing handwritten documents are extracted from the fully connected layer of the pretrained model, and then all the local features of one handwritten test document are encoded into a global feature vector (blue rectangles with solid edge in encoding part).

3 Semi-supervised Feature Learning Pipeline

As shown in Fig. 1, our proposed semi-supervised feature learning pipeline consists of three parts. A. Preprocessing: For the ICDAR2013 dataset, the handwritten documents are segmented into line images by a line segmentation method Srinivasan2007A , and then the line images are split up using a sliding window approach without overlapping. For the IAM and CVL datasets, we normalize the word images already provided. B. Semi-supervised learning: During training, the original labeled data (real data) and extra unlabeled data (extra data) are shuffled, and then they are simultaneously fed into ResNet-50 baseline, which is regularized by WLSR. Furthermore, the trained model is used for extracting local features of testing handwritten documents. Specifically, all local features of handwritten test documents are extracted from the fully connected layer, and thus, all layers after the fully connected layer can be discarded. C. Encoding: We reduce the dimensions of local features with PCA-White J2012Negative , and then the vector of locally aggregated descriptors (VLAD) Jegou2012Aggregating is used to encode the local features of every test document as a global feature vector, which is used for writer identification with the nearest neighbor approach. All of the parts will be concretely introduced in the following.

Figure 2: Part of the line images of the ICDAR2013 dataset are segmented by the proposed line segmentation approach and are normalized with their original aspect ratio.
Figure 3: Part of the patches extracted from the ICDAR2013 dataset (top row), word images provided by the CVL dataset (middle row) and word images provided by the IAM dataset (bottom row), where all have been preprocessed. The patches of the ICDAR2013 dataset are normalized to . Each word image with size in the CVL and IAM datasets is normalized to an image of size or such that or .

3.1 Preprocessing

First, a binarization is implemented for all handwritten pages with the Otsu Otsu2007A method. Second, the handwritten pages have to be segmented. Because the CVL dataset Kleber2013CVL and IAM Marti2002The dataset already provide a segmentation of the words, these images are directly used for training and evaluating after normalization, as shown in Fig. 3. For the ICADR2013 competition on the Writer Identification dataset Louloudis2013ICDAR , the handwritten documents are segmented into lines with the method proposed by Arivazhagan Srinivasan2007A . The line segmentation method is based on a statistical approach that segments the text lines exactly. In addition, we normalize the line images to a height of 256 pixels and maintain their aspect ratio. Finally, all text lines are cut into patches with a size of without overlap using the sliding window approach. Some line images and patches of the ICDAR2013 dataset are shown in Fig. 2 and Fig. 3, respectively. Furthermore, we remove noise patches (e.g., blank patches) to avoid adverse effects.

3.2 Semi-supervised Learning

In this section, we thoroughly introduce the process of the proposed semi-supervised learning. Semi-supervised learning is based on a baseline (ResNet-50) and WLSR method. The baseline serves as an identification model, and the local features of testing handwritten pages are extracted from the fully connected layer of the baseline during testing. WLSR regularizes the baseline and directs the model to learn more discriminative features.

3.2.1 CNN baseline

K. He et al. He2016Deep

first proposed ResNet for image classification and object recognition and achieved exciting results, and then ResNet became widely used in other tasks due to its strong feature learning ability. In this work, ResNet-50 is used as a baseline because it learns discriminative representations without consuming too much of the time and computational budgets in writer identification. A ResNet consists of residual units that have two branches. One branch has several convolutional layers and learns the features of the input, and the other bypasses the other branch and forwards the result of the previous layer. These units help the CNN model preserve the identity and maintain a deeper structure. Following the conventional fine-tuning strategy, we use a model pretrained on ImageNet. To avoid model overfitting and to learn more discriminative features, we add a rectified linear unit (ReLU) layer

Glorot2011Deep and replace the original pooling layer with a global average pooling layer Lin2013Network

before the fully connected layer. In addition, we modify the last layer to have K neurons to predict the K classes, where K is the number of classes in the original training data. The extra data are mixed with the original data as the input of the CNN. That is, the original labeled training data and the extra unlabeled data are shuffled and simultaneously trained. After training, the local features of all test handwritten documents are extracted from the fully connected layer. Additional implementation details are provided in section 4.3.

3.2.2 Weighted Label Smoothing Regularization Method

Label smoothing regularization (LSR) was first used for fully supervised learning in the 1980s and was recently proposed to regularize the classifier layer by estimating the marginalized effect of label dropout during training Szegedy2016Rethinking . In the person reidentification task, Zheng Zheng2017Unlabeled

extended LSR to label smoothing regularization for outliers (LSRO), which leveraged unsupervised data generated by GAN and set the virtual label distribution to be uniform over all classes, effectively regularizing the baseline model and achieving better retrieval performance than the baseline. In this work, we propose the WLSR method to regularize the CNN baseline with the extra unlabeled data for offline writer identification. WLSR sets the virtual label distribution to be a weighted uniform distribution over all classes, which effectively regularizes the baseline according to the original training data distribution. For instance, if the original training set has a large number of common features that do not benefit writer identification (e.g., some ink traces and scribe width), the identification model may be misdirected to take these common features as a discriminative representation, which limits the discriminative ability of the model. However, if we add these common features of extra unlabeled data into the model for training, the classifier will make an incorrect prediction toward the labeled words, and thus, the classifier will be penalized. Moreover, the regularization ability of WLSR is decided by the similarity of the sample space between the original labeled data and the extra unlabeled data. If the extra unlabeled data are located nearer the original training data in the sample space, the regularization ability of WLSR will be more effective. Otherwise, the performance of WLSR will be undesirable.

WLSR is proposed to be used with cross-entropy loss. Formally, let be the original training data class and be the numbers of the original training data. The cross-entropy loss is shown in Eq. (1).

(1)

where

is the predicted probability of training data belonging to class

, which is derived from the softmax function that normalizes the output of the previous CNN layer, and is the ground-truth distribution. Let be the ground-truth class label. A pair is called the original training example, and .

(a) Label distribution of real data
(b) Label distribution of extra data
Figure 4: The label distributions of real data and extra data used in our proposed semi-supervised feature learning pipeline. The cross-entropy loss combines them and will be simultaneously optimized (Eq. (8)). (a) The label distribution of real data (Eq. (2)) is a one-hot distribution, which shows that the original cross-entropy loss only takes the ground-truth term into account (Eq. (3)). (b) We propose the virtual weighted uniform label distribution for the extra data (Eq. (6)), which is assumed to not belong to any predefined training classes. All extra data will result in an incorrect prediction, and thus, the network will be penalized.

For the original labeled data of the training set, its ground-truth distribution is shown in Fig. 4(a). It can be formulated as:

(2)

Combining Eq. (1) and Eq. (2), the cross-entropy loss of real data can be rewritten as:

(3)

From Eq. (3), it is clear that minimizing is equivalent to maximizing the predicted probability of the ground-truth class.

However, LSR was proposed to take the distribution of non-ground-truth classes into consideration Szegedy2016Rethinking . LSR discouraged the network from being confident toward its prediction. Formally, its label distribution is formulated as:

(4)

where is a smoothing parameter. Intuitively, if is too large, the network may fail to predict the ground-truth label. Considering Eq. (1) and Eq. (4), the cross-entropy loss is written as:

(5)

Thus, not only takes the ground-truth class into account but also pays attention to other classes, which effectively avoids network overfitting.

We extend LSR from the supervised domain to the semi-supervised domain and propose weighted label smoothing (WLSR) to train the extra unlabeled data. Specifically, we set the virtual label distribution as a weighted uniform distribution over all classes for the extra unlabeled data according to the real data distribution, as shown in Fig. 4(b). Thus, the label distribution of the extra data can be formulated as:

(6)

Thus, combining Eq. (1) and Eq. (6), the cross-entropy loss of extra data can be written as:

(7)

where is an indicator function. The proposed semi-supervised feature learning pipeline shuffles and simultaneously trains the real data and the extra data. Combining Eq. (3) and Eq. (7), we rewrite the cross-entropy loss of semi-supervised feature learning as:

(8)

where is an indicator. For the extra data, . For the original training data, . Therefore, the proposed semi-supervised feature learning method has two types of losses: one for real images, and the other one for extra images.

Figure 5: Visualization of the activation maps of the test patches of the ICDAR2013 test set in the baseline (ResNet-50) and the proposed semi-supervised learning model (baseline + WLSR). The baseline and the proposed semi-supervised learning network activate different patterns to the content of the patches. We can observe that the activation maps of the semi-supervised learning network more correctly and clearly show the contents of the test patches than the activation maps extracted from the baseline.

To find the differences between the baseline ResNet-50 and the proposed semi-supervised learning pipeline baseline+WLSR, we visualize the intermediate feature maps of the two pretrained models. We take some patches of the ICDAR2013 test set for testing. The selected patches belong to various handwritten documents that perform poorly in the baseline, while they achieve the desired results in the semi-supervised learning model. For each patch, its activation is obtained from the intermediate layer “res4fx” of the network, the size of which is 14 * 14. Then, we visualize the sum of several activation maps. As shown in Fig. 5, we observe that the baseline network and the proposed semi-supervised learning network activate different patterns in the content of patches. In particular, the activation maps of the semi-supervised learning more correctly and clearly exhibit the contents of test patches than the activation maps extracted from the baseline. That is, the representations of the semi-supervised learning model are more discriminative, which is why the proposed semi-supervised learning can produce better results than the baseline.

3.3 Encoding

The all-local descriptors were extracted from the pretrained model during testing. We need to aggregate them to encode a global feature vector for each test document. First, we reduce the dimensionality of the local descriptors with PCA-White, which has been shown to effectively reduce the identification time and improve the identification performance Christlein2017Writer ; Christlein2017Unsupervised . In addition, we encode the all-local descriptors of each test page as the global feature vector with VLAD, which encodes the first-order statistics by aggregating the residuals of local features to their corresponding nearest cluster centroid. VLAD is a standard encoding method and has been widely used in writer identification Christlein2015Writer ; Christlein2017Unsupervised and other information retrieval tasks Chattopadhyay2016Supervised ; Paulin2016Convolutional . Formally, a codebook

is first computed by k-means with

centroids, and all local features of every test handwritten image are assigned to their nearest cluster centroid. Then, all residuals between the cluster centroid and the assigned local features are accumulated for each cluster:

(9)

where refers to the nearest neighbor of in dictionary . All are concatenated as a global feature vector of one handwritten page:

(10)

Thus, the global feature of each test document will eventually be -dimensional.

4 Evaluation

In the following sections, we describe the datasets and evaluation metrics that we used for evaluating our proposed method. Then, we verify that WLSR has the potential to regularize the baseline for improving identification performance. Furthermore, we show the impacts of using various dimensions of local features, different numbers of extra unlabeled data during training and different centroids of k-means during encoding. Finally, we compare our method to other methods for writer identification.

4.1 Datasets

There are three different benchmark datasets used for evaluation: the ICDAR2013 datasethttp://rrc.cvc.uab.es/ Louloudis2013ICDAR , the CVL datasethttps://cvl.tuwien.ac.at/research/cvl-databases/ Kleber2013CVL and the IAM datasethttp://www.fki.inf.unibe.ch/databases/iam-handwriting-database/ Marti2002The . All of these datasets are public and have been used in many recent publications Christlein2014Writer ; Christlein2017Writer ; Fiel2015Writer ; Nicolaou2015Sparse ; Tang2017Text ; Xing2016DeepWriter . Of note, Fiel Fiel2015Writer trained the network on the IAM dataset and evaluated on the CVL dataset, achieving good performance. The results suggested that the word images in the IAM and CVL datasets can share a more similar sample space. Tang Tang2017Text trained his model on the ICDAR2013 dataset, tested on CVL the dataset and provided an impressive identification effect, which revealed that the patches of the CVL and ICDAR2013 datasets have a highly similar sample space. Therefore, we take IAM word images and CVL patches as the extra unlabeled data to evaluate CVL word images and ICDARA2013 patches, respectively.

ICDAR2013 Louloudis2013ICDAR : The ICDAR2013 benchmark dataset is divided into a training set with documents written by 100 writers and a test set with documents written by 250 writers. Every writer contributed four documents, including two Greek documents and two English documents.

CVL Kleber2013CVL : There are 310 writers who contributed documents for the CVL dataset. The 27 writers of the training set contributed seven documents each, and the 283 writers of the test set contributed five documents each. All writers contributed one German document, and the others are English documents.

IAM Marti2002The : The IAM dataset was contributed to by approximately 400 writers with 1066 forms. In the collection, 82,227-word examples are built from a vocabulary of 10,841 words. All of the documents were written in English.

4.2 Evaluation Metrics

The mean average precision (mAP) and hard TOP-k, which are common evaluation metrics in image and information retrieval tasks, are used for our experimental evaluation.

A ranked list of all documents in the query library is generated according to the similarity of each query document. Suppose that there are handwritten documents from the query; thus, the average precision of the () query document is Eq. (11).

(11)

where is the number of documents in the query library and is the number of relevant documents of the query document in the query library. is the precision at rank , which is given by the number of documents from the same writer in the query up to rank divided by . is an indicator function, where when the document retrieved at rank is from the same writers, and otherwise.

The mAP is the mean value of the average precision of all query documents. It can be written as:

(12)

The hard TOP-k depends on the calculation of the percentage of the query result, where the highest ranked documents are from the same writer.

4.3 Experiments

The proposed method was evaluated on the ICDAR2013, CVL and IAM benchmark datasets. We present the implementation details and analysis of the experimental results in the following.

4.3.1 Implementation Details

In this work, we adopt the ResNet-50 model as a baseline. To gather more abstract features, we take the global average pooling layer to replace the original pooling layer and add a ReLU activation feature layer. Furthermore, the last fully connected layer was modified to have 100 and 27 neurons for ICADAR2013 and CVL, respectively. We add a dropout layer before the last convolutional layer and set the dropout rate to 0.5 for training. The momentum of stochastic gradient descent is set to 0.9. We set the learning rate of the convolutional layers to 0.1 and have it decay to 0.01 after 45 epochs. To evaluate ICDAR2013, we take the ICDAR2013 training image patches as the original labeled data and the CVL training image patches as the extra unlabeled data. The CVL and IAM datasets already provide a segmentation of words. Thus, we directly take the CVL training words as the original labeled data and the IAM words as the extra unlabeled data to evaluate the CVL dataset. The size of the segmented image patches is set to

, while the width or height of word images was set to 256 pixels and the original aspect ratio was maintained. We extracted the local features of the test images in the first fully connected layer. The similarity between two handwritten documents was calculated by the Euclidean distance for ranking.

Figure 6: The influence of the number of centroids during encoding with VLAD. The mAP of the CVL dataset (red solid line) and the ICDAR2013 dataset (blue dotted line) exchange with the number of centroids of k-means.
TOP-1 TOP-2 TOP-3 TOP-4 mAP
Fc-512 97.9 97.0 93.6 85.0 96.4
Fc-1024 98.4 97.4 94.9 87.9 97.0
Fc-2048 99.2 98.2 96.0 90.2 98.0
Fc-4096 98.5 97.6 94.7 88.0 97.3
Table 1: The influence of the number of neurons of the fully connected layer on the CVL test set evaluated with the hard TOP-k and mAP metrics (%).

4.3.2 Experimental Results

First, we evaluate how the number of neurons of the fully connected layer affects writer identification. The number of neurons is set to 512, 1024, 2048, and 4096, which are assessed on the CVL dataset, as shown in Table 1. It was evident that the semi-supervised feature learning pipeline achieves the best performance on hard TOP-k and mAP metrics when the number of neurons of the first fully connected layer is set to 2048. Thus, all the following experiments use this configuration.

TOP-1 TOP-2 TOP-3 TOP-4 mAP
0 (baseline) 98.3 97.0 92.5 87.0 95.7
CVL 12000 (baseline) 98.4 97.0 94.0 87.2 96.8
12000(baseline+WLSR) 99.2 97.9 96.0 90.2 97.8
0 (baseline) 94.9 74.6 55.1 N/A 88.0
ICDAR2013 1000 (baseline) 95.1 74.3 57.3 N/A 88.1
1000 (baseline+WLSR) 96.6 79.0 61.1 N/A 90.1
Table 2: Comparison: the proposed semi-supervised feature learning vs. baseline on the CVL and ICDAR2013 test sets
TOP-1 TOP-2 TOP-3 TOP-4 mAP
0 (baseline+WLSR) 98.3 97.0 92.5 87.0 95.7
1000 (baseline+WLSR) 98.8 97.9 95.0 88.5 97.3
5000 (baseline+WLSR) 98.9 97.9 95.4 88.9 97.5
12000 (baseline+WLSR) 99.2 97.9 96.0 90.2 97.8
24000 (baseline+WLSR) 99.0 97.9 95.2 89.9 97.6
Table 3: Comparison of the effect of various numbers of extra unlabeled images on the CVL test set evaluated with the hard TOP-k and mAP metrics (%).
TOP-1 TOP-2 TOP-3 mAP
0 (baseline+WLSR) 94.9 74.6 55.1 88.0
500 (baseline+WLSR) 94.8 75.5 56.3 88.1
1000 (baseline+WLSR) 96.6 79.0 61.1 90.1
2000 (baseline+WLSR) 96.5 78.6 59.6 90.0
5000 (baseline+WLSR) 94.9 74.3 56.5 88.0
Table 4: Comparison of the effects of the numbers of extra unlabeled images on the ICDAR2013 test set evaluated with the hard TOP-k and mAP metrics (%).
TOP-1 TOP-2 TOP-3 TOP-4 mAP
CS-UMD Kleber2013CVL 97.9 90.0 71.2 48.3 N/A
QUQA A Kleber2013CVL 30.5 5.7 0.5 0.1 N/A
QUQA B Kleber2013CVL 92.9 84.9 71.5 50.6 N/A
TEBESSA-c Kleber2013CVL 97.6 94.3 88.2 73.9 N/A
TSINGHUA Kleber2013CVL 97.7 95.3 94.5 7.30 N/A
Fiel et al. Fiel2015Writer 98.9 97.6 93.3 79.9 N/A
Christlein et al. Christlein2014Writer 99.2 98.1 95.8 88.7 97.1
Nicolaou et al. Nicolaou2015Sparse 99.0 97.7 95.2 86.0 N/A
Christlein et al Christlein2017Writer 98.8 97.8 95.3 88.8 96.4
Ours (single) 99.2 97.9 96.0 90.2 97.8
Ours (2-streams) 99.2 98.4 96.1 91.5 98.0
Table 5: Comparison of the performance with other methods on the CVL test set. Hard TOP-k and mAP metrics are listed (%).

Second, we analyze the influence of the number of centroids during encoding with VLAD. In general, when k is larger, the retrieval performance is better for a large dataset. The experimental results on the ICDAR2013 and CVL datasets are shown in Fig. 6. As shown, when the number of centroids is set to 1, we achieve the largest mAP (98.0% and 90.1% on ICDAR2013 and CVL, respectively). Moreover, the mAP of the two benchmarks consistently decreases as the number of centroids increases. Three reasons may explain the experimental results: A. The ICDAR2013 and CVL datasets are too small; therefore, they do not need more image vocabulary to represent themselves. B. Every writer wrote the documents with the same content in one dataset, which means that the diversity of the dataset is limited. C. The dimensions of the local feature are so large (2048 in this work compared to 64 in Jegou2012Aggregating ) that the local features are discriminative.

Third, we verify the regularization ability of the WLSR method in the semi-supervised feature learning pipeline. The same extra labeled and unlabeled data were added into the supervised baseline and the proposed semi-supervised pipeline for training, respectively. As shown in Table 2, the extra labeled data added in the baseline have almost no effect on writer identification, while the semi-supervised learning pipeline takes the same unlabeled data to improve the identification rate (on the CVL and ICDAR2013 datasets), which shows that the regularization of WLSR improves the performance of the baseline.

Moreover, we compare the proposed semi-supervised learning pipeline with the baseline. As shown in Table 2, when we add 12000 extra unlabeled IAM words into the CNN for training, our method significantly improves the writer identification performance on the CVL test set, which reveals that the WLSR method achieves improvements of 0.9% (from 98.3% to 99.2%), 0.9% (from 97.0% to 97.9%), 3.5% (from 92.5% to 96.0%), 3.2% (from 87.0% to 90.2%) and 2.1% (from 95.7% to 97.8%) in hard TOP-1, hard TOP-2, hard TOP-3, hard TOP-4, and mAP, respectively. On ICADAR2013, we observe improvements of 1.7%, 4.4%, 6.0% and 2.1% in hard TOP-1, hard TOP-2, hard TOP-3, and mAP, respectively, when 1000 extra unlabeled CVL patches are added in ICDAR2013, as shown in Table 2. Thus, it is evident that the proposed semi-supervised feature learning pipeline effectively improves the performance of the baseline.

TOP-1 TOP-2 TOP-3 mAP
CS-UMD-b Louloudis2013ICDAR 95.0 20.2 8.4 N/A
HIT-ICG Louloudis2013ICDAR 94.8 63.2 36.5 N/A
TEBESSA-c Louloudis2013ICDAR 93.4 62.6 36.5 N/A
CVL-IPK Louloudis2013ICDAR 90.9 44.8 24.5 N/A
Fiel et al. Fiel2015Writer 88.5 40.5 15.8 N/A
Christlein et al. Christlein2014Writer 97.1 42.8 23.8 67.1
Nicolaou et al. Nicolaou2015Sparse 97.2 52.9 29.2 N/A
Christlein et al. Christlein2017Writer 98.2 71.2 47.7 81.4
Ours (single) 96.6 79.0 61.1 90.1
Ours (2-streams) 97.7 83.3 63.7 91.8
Table 6: Comparison of the performance with the other methods on the ICDAR2013 test set. Hard TOP-k and mAP metrics are shown (%).

In addition, we find that the amount of extra unlabeled data profoundly affects the regularization ability of WLSR. If too little extra unlabeled data are incorporated into the pipeline, the regularization of the WLSR is insufficient. In contrast, if too much extra unlabeled data are added, the pipeline tends to assign weighted uniform prediction probabilities to all training data, as shown in Table 3 and Table 4. Therefore, the appropriate amount of extra unlabeled data that should be added to the system varies by dataset to avoid poor regularization and pipeline overfitting.

Finally, we combined the two models generated by our method to form an ensemble (2-stream) to further enhance the identification performance and compared our proposed method with the other published methods on the ICDAR2013 and CVL datasets, as listed in Table 5 and Table 6, respectively. We can observe that the semi-supervised learning pipeline can achieve a better result than most other supervised approaches. On the CVL dataset, we achieve hard TOP-1=99.2%, hard TOP-2=98.4%, hard TOP-3=96.1%, hard TOP-4=91.5, and mAP=98.0%, which are better results that those achieved by the other supervised methods. On ICDAR2013, we achieved hard TOP-1=97.7%, hard TOP-2=83.3%, hard TOP-3=63.0, and mAP=91.1%, which are also very competitive results compared to the results of the other methods. In particular, the proposed semi-supervised learning method produces the desired performance on the ICDAR2013 test set with few extra unlabeled patches of the CVL training set, while Christlein et al. Christlein2014Writer added the entire CVL training set into ICDAR2013 for training and achieved ordinary results. The results in Table 5 and Table 6 show that the semi-supervised feature learning method takes full advantage of the extra data, and it is more conveniently used in practice than other supervised methods Christlein2014Writer ; Christlein2017Writer ; Christlein2015Offline ; Fiel2015Writer ; Kleber2013CVL ; Louloudis2013ICDAR ; Nicolaou2015Sparse . Fig. 7 presents some identification results achieved by the proposed semi-supervised feature learning method (single) on the ICDAR2013 dataset (sample 1-2, sample 22-4, sample 24-3, and sample 248-1). The images (gray border) are the query images. The identification images (red border and green border) are sorted according to the similarity scores from top to bottom (from Rank-1 to Rank-5). Images with a green border are correct candidates, and images with a red border images are incorrect candidates. Most ground-truth candidate images are correctly identified.

Figure 7: Writer identification results of the proposed semi-supervised feature learning method (single) on the ICDAR2013 dataset (sample 1-2, sample 22-4, sample 24-3, and sample 248-1). The images (gray border) are the query images. The identification images are sorted according to the similarity scores from top to bottom (from Rank-1 to Rank-5). We maintain the original aspect ratio of the images.

5 Discussion

In this study, we visualized the intermediate feature maps of the baseline and semi-supervised feature learning pipeline (Sec. 3.2.2). It showed that the activation maps of the semi-supervised learning more correctly show the contents of test patches than the activation maps extracted from the baseline. Then, we analyzed the impact of the dimensions of the local features, the centroids of VLAD encoding and the amount of extra unlabeled data (Sec. 4.3.2). Moreover, we experimentally showed that the proposed method could significantly improve the baseline and perform competitively with existing writer identification approaches, which benefit from the potential of regularization of WLSR. WLSR takes full advantage of extra unlabeled data for regularizing the baseline, and thus, the CNN learns effective and discriminative features.

Due to some common representations in the extracted features, some researchers combined multiple handcrafted elements to derive a more reliable discriminative feature, yet restraining the impact of common features. For example, Helli extracted features using Gabor and XGabor filters and then developed a feature relation graph Helli2010A . Considering the width of ink traces, a powerful source of information for offline writer identification consisted of a powerful feature (Quill) by combining with directions Brink2012Writer . In He2015Junction , they proposed a novel junction detection method for writer identification using stroke-length distribution and direction of ink of texts. Motivated by the above methods, we proposed a WLSR method to regularize and penalize the common features that are automatically learned features by the CNN and reducing the negative influence of common features.

To be honest, our proposed semi-supervised feature learning has a limitation in that WLSR depends on the similarity of the sample space between the original labeled data and extra unlabeled data. In the future, the generative adversarial networks (GANs), a system of two neural networks competing with each other in a zero-sum game framework, may be a potential choice to overcome this limitation. Because we can generate data by GANs and original data share the same sample space, we do not require any extra data from other datasets.

6 Conclusion

In this paper, we proposed a semi-supervised feature learning pipeline for offline writer identification. To the best of our knowledge, this is the first attempt to apply semi-supervised feature learning in the field of writer identification. Of note, the WLSR method is introduced to train the extra unlabeled data and the original labeled data simultaneously for the semi-supervised learning pipeline with regularization ability, which improved the identification results of the baseline model and achieved better performance than other popular methods on the CVL and ICDAR2013 datasets.

In the future, we will choose a better encoding method that is suitable for small datasets of writer identification tasks to replace VLAD. Furthermore, we will adopt the unlabeled data generated by GANs to train the semi-supervised learning network because the generated data share a similar sample space with the original labeled data.

References

References

  • [1] M. N. Abdi and M. Khemakhem. A model-based approach to offline text-independent arabic writer identification and verification. Pattern Recognition, 48(5):1890–1903, 2015.
  • [2] A. A. Brink, J. Smit, M. L. Bulacu, and L. R. B. Schomaker. Writer identification using directional ink-trace width measurements. Pattern Recognition, 45(1):162–171, 2012.
  • [3] M. Bulacu and L. Schomaker. Text-independent writer identification and verification using textural and allographic features. IEEE Trans. Pattern Anal. Mach. Intell., 29(4):701–717, 2007.
  • [4] C. Chattopadhyay and S. Das. Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions.

    Iet Computer Vision

    , 10(3):220–227, 2016.
  • [5] V. Christlein, D. Bernecker, and E. Angelopoulou.

    Writer identification using vlad encoded contour-zernike moments.

    In Proceedings of the International Conference on Document Analysis and Recognition, pages 906–910, 2015.
  • [6] V. Christlein, D. Bernecker, F. Honig, and E. Angelopoulou. Writer identification and verification using gmm supervectors. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 998–1005, 2014.
  • [7] V. Christlein, D. Bernecker, F. Hönig, A. Maier, and E. Angelopoulou. Writer identification using gmm supervectors and exemplar-svms. Pattern Recognition, 63:258–267, 2017.
  • [8] V. Christlein, D. Bernecker, A. Maier, and E. Angelopoulou. Offline writer identification using convolutional neural network activation features. In Proceedings of the German Conference on Pattern Recognition, pages 540–552, 2015.
  • [9] V. Christlein, M. Gropp, S. Fiel, and A. Maier. Unsupervised feature learning for writer identification and writer retrieval. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 991–997, 2018.
  • [10] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. arXiv:1512.03965, 2015.
  • [11] S. Fiel and R. Sablatnig. Writer identification and retrieval using a convolutional neural network. In Proceedings of the International Conference in Computer Analysis of Images and Patterns, pages 26–37, 2015.
  • [12] B. G. G. Louloudis, N. Stamatopoulos. Icdar 2011 writer identification contest. In Proceedings of the International Conference on Document Analysis and Recognition, pages 1475–1479, 2011.
  • [13] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    , pages 315–323, 2011.
  • [14] Y. Hannad, I. Siddiqi, and M. E. Y. E. Kettani. Writer identification using texture descriptors of handwritten fragments. Expert System With Application, 47:14–22, 2016.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [16] S. He and L. Schomaker. Delta-n hinge: Rotation-invariant features for writer identification. In Proceedings of the International Conference on Pattern Recognition, pages 2023–2028, 2014.
  • [17] S. He and L. Schomaker. Writer identification using curvature-free features. Pattern Recognition, 63:451–464, 2017.
  • [18] S. He, M. Wiering, and L. Schomaker. Junction detection in handwritten documents and its application to writer identification. Pattern Recognition, 48(12):4036–4048, 2015.
  • [19] Z. He, X. You, and Y. Y. Tang. Writer identification of chinese handwriting documents using hidden markov tree model. Pattern Recognition, 41(4):1295–1307, 2008.
  • [20] M. M. Helli B. A text-independent persian writer identification based on feature relation graph (frg). Pattern Recognition, 43(6):2199–2209, 2010.
  • [21] G. Huang, S. Song, J. N. Gupta, and C. Wu. Semi-supervised and unsupervised extreme learning machines. IEEE Trans. on Cybernetics, 44(12):2405–2417, 2017.
  • [22] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell., 34(9):1704–1716, 2012.
  • [23] H. Jégou and O. Chum.

    Negative evidences and co-occurences in image retrieval: the benefit of pca and whitening.

    In Proceedings of the European Conference on Computer Vision, pages 774–787, 2012.
  • [24] F. Kleber, S. Fiel, M. Diem, and R. Sablatnig. Cvl-database: An off-line database for writer retrieval, writer identification and word spotting. In Proceedings of the International Conference on Document Analysis and Recognition, pages 560–564, 2013.
  • [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
  • [26] D. H. Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML 2013 Workshop: Challenges in Representation Learning, pages 1–6, 2013.
  • [27] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
  • [28] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognition, 46(1):155–162, 2013.
  • [29] G. Louloudis, B. Gatos, N. Stamatopoulos, and A. Papandreou. Icdar 2013 competition on writer identification. In Proceedings of the International Conference on Document Analysis and Recognition, pages 1397–1401, 2013.
  • [30] B. P. Mallikarjunaswamy. Writer identification based on offline handwritten document images in kannada language using empirical mode decomposition method. International Journal of Computer Applications, 30(6):31–36, 2013.
  • [31] U. V. Marti and H. Bunke. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, 2002.
  • [32] A. J. Newell and L. D. Griffin. Writer identification using oriented basic image features and the delta encoding. Pattern Recognition, 47(6):2255–2265, 2014.
  • [33] A. Nicolaou, A. D. Bagdanov, M. Liwicki, and D. Karatzas. Sparse radial sampling lbp for writer identification. In Proceedings of the International Conference on Document Analysis and Recognition, pages 716–720, 2015.
  • [34] A. Odena. Semi-supervised learning with generative adversarial networks. arXiv:1606.01583, 2016.
  • [35] N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. on Systems Man and Cybernetics, 9(1):62–66, 2007.
  • [36] G. Papandreou, L. C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1742–1750, 2016.
  • [37] M. Paulin, J. Mairal, M. Douze, Z. Harchaoui, F. Perronnin, and C. Schmid. Convolutional patch representations for image retrieval: An unsupervised approach. Int. J. Comput. Vis., 121(1):1–20, 2016.
  • [38] R. Plamondona and G. Loretteb. Automatic signature verification and writer identification — the state of the art. Pattern Recognition, 22(2):107–131, 1989.
  • [39] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-supervised learning with ladder networks. In Proceedings of the International Conference on Neural Information Processing Systems, 2015.
  • [40] L. Schomaker and M. Bulacu. Automatic writer identification using connected-component contours and edge-based features of uppercase western script. IEEE Trans. Pattern Anal. Mach.Intell., 26(6):787–98, 2004.
  • [41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, 2015.
  • [42] H. Srinivasan and S. Srihari. A statistical approach to line segmentation in handwritten documents. In Proceedings of the SPIE Document Recognition and Retrieval XIV, pages 6500T–1–11, 2007.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 1–9, 2015.
  • [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
  • [45] Y. Tang and X. Wu. Text-independent writer identification via cnn features and joint bayesian. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, pages 566–571, 2017.
  • [46] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang.

    A siamese long short-term memory architecture for human re-identification.

    In Proceedings of the European Conference on Computer Vision, pages 135–153, 2016.
  • [47] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In Proceedings of the International Conference on Machine Learning, pages 1168–1175, 2012.
  • [48] X. Wu, Y. Tang, and W. Bu. Offline text-independent writer identification based on scale invariant feature transform. IEEE Trans on Information Forensics and Security, 9(3):526–536, 2014.
  • [49] L. Xing and Y. Qiao. Deepwriter: A multi-stream deep cnn for text-independent writer identification. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, pages 584–589, 2016.
  • [50] F. Yin and C. L. Liu. Handwritten chinese text line segmentation by clustering with distance metric learning. Pattern Recognition, 42(12):3146–3157, 2009.
  • [51] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, pages 3774–3782, 2017.
  • [52] X. Zhu. Learning from labeled and unlabeled data with label propagation. In Proceedings of the International Joint Conference on Neural Networks, pages 2803–2808, 2002.