Handwritten signatures are one of the oldest and most widely used biometric authentication techniques in administrative and financial institutions due to its simplicity and uniqueness . As technology progresses, authentication methods have also evolved. Handwritten signatures are now categorized as online signatures and offline signatures. Online signatures have much more distinct features than offline signatures; therefore, they are easier to verify . However, capturing online signatures is expensive, and digital systems prefer different authentication methods, such as passwords or personal authentication questions. On the other hand, offline signatures are easy to capture but hard to verify due to the limited amount of features they contain and uncontrolled environmental acquisition conditions.
Offline signature verification task has been a challenge for computer vision research and many different approaches have been proposed to perform the task more accurately. Evaluations of these approaches have been conducted on publicly available datasets such as GPDS-960 , GPDS-4000 , MCYT , and CEDAR . All of these datasets contain genuine signatures of the users with random and skilled forgeries that try to imitate the genuine signature. The collection of the signatures is completed in either single or multiple sessions. People tend to sign very similar signatures when they sign one after another, similar to the one session acquisition, however, signatures differ very much when signatures are collected over time. In real-world applications, signatures of a person can be varied considerably, because people sign a lot of documents in their daily lives, and it is unlikely to sign exactly the same every time. Therefore, datasets, that acquire signatures in a short period, do not capture the high intra-class variety of a person’s signature.
In the literature, both writer dependent and writer independent signature verification methods have been proposed. However, in a real-world signature verification setting user enrollment is very frequent. On account of this reason, writer dependent methods are not feasible to apply. In the writer independent methods, the subjects used for training and testing are different, so no person specific features can be utilized. Writer independent methods try to learn efficient representations of the signatures to distinguish each person, but creating a universal discriminative representation of a signature is challenging and no particular feature extraction method has been found to solve this problem.
In this paper, we focus on offline signature verification in the banking process as one of the real-world application scenarios. In the banks, the customers from enterprise and commercial segments send their banking transaction orders with mainly petition-based documents. These documents are received by the central operation unit of the banks from their fax, scanner, and e-mail channels. The operators are responsible for checking the signature, whether it is the same with the one, which is seen on the signature declaration document of the same customer. This task is illustrated in Figure 1. Due to the requirement of a significant manual workforce, we aim to automatize this process for the documents of the companies that have exactly one authorized employee to sign the documents. It is measured that such types of customers send around 90,000 pages of banking order documents per month in the medium-size banks. The signature verification task with a manual workforce requires approximately 233 person-hours to process these documents. Hence, employing an automatic offline signature verification system provides significant resource efficiency to the central operation unit of banks.
In this work, we collect bank order and signature declaration documents of the customer’s111Please note that due to data confidentiality, we cannot publish samples from our real-world dataset. Therefore, to visualize our real-world signature verification problem, imitations of signatures, rubber-stamps, and document images are provided from our dataset.
. The location of signatures on these documents are annotated manually. This way, we create a real-world signature dataset. Signatures on order documents can be rubber-stamped or unstamped. Therefore, we need a stamp cleaning method to obtain more clear signatures before the verification process. Inspired from image-to-image translation works in the literature, we utilize the CycleGAN
for stamp cleaning. We generate two datasets from the created signature dataset, one for representation learning and the other to run verification tests. These two subsets contain signatures from different individuals. Thus, we train a deep feature extraction network on a completely different set of users than the ones in the test set to have a writer independent feature extractor.
Please note that we cannot make our confidential customer signature dataset of the bank publicly available due to the General Data Protection Regulation (GDPR). Furthermore, we cannot use publicly available signature verification datasets, such as GPDS-960 , GPDS-4000 , MCYT , and CEDAR , because our problem differs from the one presented by them regarding data collection and application purpose. Therefore, we also prepare another real-world signature verification setup using the publicly available Tobacco-800 dataset [10, 14] and conduct experiments on it. In order to promote signature verification research on real-world documents, we publish the generated training, validation, and verification splits that we use in this benchmark222https://github.com/Alpkant/Offline-Signature-Verification-on-Real-World-Documents.
Our main contributions can be summarised as follows:
We present a comprehensive study on offline signature verification on real-world documents. For this purpose, we both create a custom offline signature verification dataset and a real-world signature verification setup using the publicly available Tobacco-800 dataset.
We extensively analyze different verification setups, fine-tuning strategies, and signature representation approaches. Moreover, we conduct a human evaluation to show the challenging nature of the problem.
We formulate the stamp removal task as an unpaired image-to-image translation problem and propose a CycleGAN-based stamp removal method. With the proposed framework, we achieve a significant reduction in the equal error rate.
The remainder of the paper is organized as follows. In Section 2, we review the related work. The proposed method is explained in Section 3. Experimental setups and the corresponding results are presented and discussed in Section 4. Finally, Section 5 concludes the paper.
2 Related Work
Noise Cleaning. Signatures on the complex documents often overlap with different parts of the documents, such as stamps, ruling lines, printed and handwritten texts, which are called noise in general. Removal of these parts can be seen as a segmentation problem since segmented parts can be removed to extract a clean signature.  proposed a fully convolutional stamp segmentation network to detect different kinds of stamps in the documents. Stamps change a lot between companies and countries; therefore, network training for the specific dataset is essential. Their proposed network has been trained with pixel-level stamp annotations; however, creating a pixel level stamp annotation for real-world documents is not feasible. On account of this, we utilize a noise cleaning method, which does not require pixel-level annotations, and it is trained in an unpaired manner.
In , a CycleGAN  based scanning artifact removal deep network is proposed to clean documents from a variety of noises, e.g., watermark, background noise, and blur. They train their network on four different datasets for four different noise types, however, these datasets are synthetically created. Our proposed noise-cleaning network has been trained on real-world documents. Moreover, we do not constrain our network to a limited number of noise types or degradations. For example, printed and handwritten texts or stamps are also seen as noise along with the other noise types for our network.
Signature Verification. Like all other computer vision problems, handcrafted features have been widely used in the signature verification. 
built a support vector machine (SVM) classifier on top of combined local binary patterns (LBP) and histogram of oriented gradients (HOG) features. This approach achieved the highest score in ICDAR SigWiComp challenge both in 2013 and 2015 
. Instead of searching good handcrafted features, deep convolutional neural networks have been utilized to learn feature representations from raw data[4, 22, 11]. In 
, the authors investigated the feature representations of the deep learning models specifically for the signatures. Analysis of the features showed that deep learning models could successfully create good representations of the signatures and able to discriminate the genuine signatures. Also, created a writer independent deep neural convolutional network to prove that learned feature space not only generalizes to unseen users in a dataset but also to the users from other datasets. This is also a good indicator of the applicability of the deep convolutional neural networks to the real-world signature verification task.  proposed a multiple stream verification network, which uses original and inverse signatures. They claim that their network focuses more on the signature strokes when original and inverse signatures are used together with their inverse streams and multi-path attention modules.
3 Proposed Method
Our proposed system includes two main steps as stamp cleaning and representation learning. In the system, after stamp cleaning, signature representations are extracted. Then similarity between two signature representations is measured and compared to a general threshold to determine whether the signatures belong to the same person or not. In the following subsections, we explain these processes.
3.1 Stamp Cleaning
Signatures on the real-world documents might be stamped, which degrades the verification process. In our dataset, the target signatures generally include a stamp. Thus, a conversion between stamped and unstamped signatures is a critical process for signature verification. For this reason, a stamp cleaning method is necessary. The requirement of an unsupervised method is the primary constraint for the stamp cleaning method due to the difficulty of collecting a large number of stamped and unstamped pairs of signatures from the same users in real-world documents. This limitation motivates us to utilize CycleGAN  to perform unpaired image-to-image translation.
We collect a dataset by using the extracted signatures from the documents. There are 1287 signatures extracted from signature declaration documents which are clean, whereas 3607 signatures extracted from the order documents contain stamps. Our aim is to learn the conversion between stamped signatures, X, and unstamped ones, Y. For this purpose, two mapping functions and are defined.
The adversarial loss for mapping function is given in Equation 1. Adversarial loss for mapping function is also similar to this adversarial loss.
As an improvement to adversarial loss, cycle consistency loss has been proposed in CycleGAN to compare generated images with input images using the cyclic process. In cycle consistency loss described in Equation 2, the L1 norm is employed to calculate the loss between generated inputs and original inputs.
The full objective of CycleGAN, which consists of adversarial losses in two ways and cycle consistency loss, is given in Equation 3.
The sample inputs and outputs of our cleaning process can be seen in Figure 2. Our trained model is able to remove texts successfully on images in both datasets.
3.2 Representation Learning
Writer dependent signature verification models are not feasible for real-world signature verification scenarios where user enrollment is very frequent. Therefore, we should learn writer independent signature representations to verify signatures. For this purpose, we benefit from well-known, successful architectures, namely, VGG-16  and ResNet-50 
, and their pre-trained models on ImageNet. For each dataset, we fine-tune these networks’ models with signatures of the users in the training set. In the verification test set, we have signatures of the users that our networks have never seen before. For each network architecture, we fine-tune three models with different settings: raw signature images, cleaned signature images, and inverse signature images. By changing the input image type, we explore the effect of the cleaned and inverse signature images.
illustrates the feature extraction and verification process. Two signatures are fed into the model, and their features are extracted. Cosine similarity is calculated between the extracted features. Finally, the obtained similarity score is thresholded to determine whether the signatures belong to the same person or not.
More specifically, the first fully-connected layer of VGG-16 and the second last convolution layer of ResNet-50 are chosen for feature extraction. Accordingly, we obtain a feature vector with size of 4096 from VGG-16 and a feature vector size of 25088 from ResNet-50. Then, we employ cosine similarity to measure the similarity between extracted feature vectors. After calculating the similarity for a pair, a label is assigned according to a specified threshold.
In this paper, we present the results in terms of global equal error rate (EER), based on a global threshold value, and ROC curves. Defining a threshold value for each user is not feasible for a real-world signature verification system, where new users enroll frequently with a few samples provided in a single session.
4 Experimental Results
In this section, we first present the datasets and the experimental setups. Then, we will give information about the implementation details. Finally, the objective and subjective evaluation results are provided and discussed.
We collect signatures from two sources: order documents and signature declaration documents. A sample signature declaration document and an order document can be seen in Figure 1. Order documents include the transaction order of the customers and must be signed by them. Customers must also declare their signatures on signature declaration documents. According to the regulations, each person signs three times on signature declaration documents. Signatures extracted from the signature declaration documents are named reference signatures of the customers and these are unstamped signatures. On the other hand, signatures extracted from the transaction order documents are named target signatures. These signatures can be rubber-stamped or unstamped, which are named as stamped, and unstamped signatures, respectively.
Our dataset is categorized into two sub-datasets: (i) representation learning dataset, (ii) verification test dataset. The representation learning dataset is utilized for training a model to learn signature representations. The verification test dataset includes signature pairs (reference signatures and target signatures) to evaluate the signature verification performance. In these datasets, we selected the individuals from whom the bank has received a high number of orders. These two subsets contain different sets of customers, that is, a customer’s signatures are included in only one of these two subsets leading to a person independent setup.
Representation Learning Dataset: This dataset consists of 109 individuals’ signatures. After applying data augmentation, such as thickening, rotation, and random distortion, each individual has at least 80 signatures. In total, we have approximately 9K signatures. Finally, we split this dataset randomly into training, validation, and test sets with a proportion of 70%, 15%, and 15%, respectively.
Verification Test Dataset: We have two sets of test pairs of signatures from 178 individuals: unstamped pairs and stamped pairs, which consist of reference and target signatures. Unstamped pairs of signatures contain 2609 pairs, which consist of 1001 matched pairs and 1608 mismatched pairs. On the other hand, stamped pairs of signatures contain 2630 pairs, which have 1022 matched pairs and 1608 mismatched pairs. Five different experimental setups are prepared in order to assess the effects of different cases as listed in Table 1. Corresponding sample pairs of these setups can be seen in Figure 4
. Please note that the signature images in this figure are resized for visualization purposes. In the first setup, we compare a reference signature with an unstamped signature. In the second setup, we apply our stamp cleaning method both on the reference and unstamped target signature. This is to evaluate the effect of performing a stamp cleaning process when both reference and target signature does not contain any stamps. This could happen, since, at the moment, we do not employ a stamp detection method and apply stamp cleaning on all signatures extracted from the order documents. Stamped target and reference signature are compared in the third setup. This setup is to observe the degree of performance loss when the target signature contains a stamp. In the fourth setup only the stamped target signature is cleaned. This setup is to assess the effect of stamp removal on signature verification performance. Finally, in the fifth setup, both reference and stamped target signatures are cleaned. This setup is to observe the effect of slight artifacts from the cleaning process on the verification performance. Moreover, we could have defined another setup consisting of stamped reference signatures and stamped target signatures. In this case, we should have added a stamp on the reference signature by generating a random stamp; however, the generated stamp cannot be the identical stamp with the target signatures. Since different stamps on the reference and target signatures lead to a decrease in the similarity of these signatures, this is not an appropriate setup for our problem.
Tobacco-800 dataset [10, 14] is a publicly available subset of 42 million pages of documents that are scanned with various equipment. It contains real-world documents and unlike most of the publicly available signature datasets, it contains noises and artifacts, such as stamps, handwritten texts, and ruling lines, on the signatures. Figure 5 shows example signatures of different users from the Tobacco-800 dataset. The resolution of the documents varies between 150 and 300 DPI. All signatures are manually annotated in this dataset. Also, the identification of the users has been done manually by considering the signers’ names in the document. There are some mislabeled or unidentified signatures. These mislabeled signatures and signatures without user identities have been removed from the dataset. In the end, 746 signatures of 130 users remained. The number of signatures for each user varies, for example, some users have just one signature. We use randomly selected 60 users to perform representation learning. After applying the same data augmentation strategies with our dataset, we obtain approximately 4200 signatures in total for training.
To perform a writer independent signature verification, we use the remaining 70 users for the test set. 41 of these users only have one signature; therefore, they are only used to generate negative pairs. The remaining 29 users have a minimum of two and a maximum of seven signatures. From these user signatures, we generate all possible positive pairs, which are 166 in the test set. We randomly create the same number of negative pairs by using all the test users. In total, we formed 332 signature pairs.
4.2 Implementation Details
We implement our models in Tensorflow
frameworks. We train our model with NVIDIA GTX 1080Ti graphics card. We perform fine-tuning on ResNet-50 and VGG-16 models with batch size of 32 and 64, respectively. We utilize the SGD optimizer with momentum. The learning rate in the initialization varies in the range of 0.001 and 0.0001. Early stopping is employed by controlling validation loss for specified consecutive epochs.
4.3 Objective Evaluation
We run experiments using five different test setups, three different use of fine-tuning data, and three different representations of signature images –original, cleaned, and inverse– as input.
Effects of cleaned input images. We investigate the effectiveness of the stamp cleaning process on signature verification. We train VGG-16 and ResNet-50 on raw input images and cleaned input images, separately. The models trained on the cleaned input images are denoted as VGG-16 and ResNet50. We then test these models on five test setups and compare the results. According to Table 2, the experimental results indicate that the stamps lead to significant degradation of the performance. For example, the obtained EER with the VGG-16 model is 0.18, when there are no stamps in the target signatures. The EER increases dramatically to 0.33, when the target signatures contain stamps. However, the cleaning process compensates for this performance loss to a large extent and brings the EER down to 0.23. This observation is consistent in all the experiments, therefore, independent of the used network model, fine-tuning data, and the input image representation. VGG-16 model is found to be better than the others in almost all test setups on our dataset.
ROC curves for all the models are plotted in Figure 6. Each ROC curve includes the results of five test setups to compare the effects of the cleaning process. As can also be observed from the ROC curves, when Stamped Target Signatures are cleaned, the performance increases. When Unstamped Target Signatures are cleaned without necessity, the performance does not get affected much. Due to the slight artifacts caused by the cleaning process, applying stamp removal also on the clean reference signature leads to either a slight performance improvement or does not change the performance, depending on the experimental setup.
We then evaluate our best performing models on the cleaned test pairs of the Tobacco-800 dataset. That is, we train VGG-16 and VGG-16 models on the Tobacco-800 and cleaned Tobacco-800 training sets, respectively. As can be seen from Table 2, since the Tobacco-800 dataset also consists of real-world documents, the results are similar to the ones that we have obtained on our custom dataset, which validates the difficulty of the problem.
Effects of inverse input images.
For offline signature verification, signature images are digitalized by the scanners. Original images contain a white background and black or blue signatures when scanned. In signature verification literature, we notice that most of the work use binarized signature images with black background and white signatures instead of directly using binarized signature images with white background and black signatures. Therefore, we trained our models with both original and inverse images to see the effect of image representation on the performance. From Table2, it can be observed that image representation does not affect the verification accuracy significantly.
To investigate the effect of image representation further, we visualize the five most activated convolution filters of the last convolutional layer for the VGG-16 model. Figure 7 shows that both models, either trained with original or inverse images, learn similar features from the signatures. Visualizations indicate that most activated five convolutional filters concentrate on the same regions of the signatures.
4.4 Subjective Evaluation
To assess the difficulty of the problem, we also perform a subjective evaluation by 18 volunteers. We randomly select 360 pairs from our dataset. The subjective evaluation test set includes 180 reference - stamped pairs of signature and 180 reference - unstamped pairs of signatures. These 360 pairs are divided equally into six subsets. Each participant is shown 60 pairs and expected to decide whether the shown signature pair belongs to the same individual or not. This way, each pair is evaluated by three individuals. We provide human evaluation results via majority voting and individual. For majority voting, we assign the human prediction for each pair to whichever prediction is in the majority in the human prediction set. On the other hand, for individual results, we assume having 1080 pairs of signatures and evaluate the prediction of each individual separately.
In order to compare human vs. machine performance, we also run signature verification experiments with the proposed system on the selected 360 pairs for the subjective evaluation. The models fine-tuned on the cleaned signatures, namely VGG-16 and ResNet-50, are chosen to extract features. EER is calculated on these pairs, and the threshold value according to this EER is used to calculate the accuracy of the models.
Results of human evaluation, along with the accuracies obtained by the models, are given in Table 3. The results show the challenging nature of the task as even humans cannot predict all the pairs correctly. Model accuracies on this subset are lower than the ones obtained on the overall test set in Table 2, which indicates that the chosen subset includes harder pairs. Comparing human and model performances, it is clear that we still need further improvements in the system to match human performance.
In this paper, we have presented a comprehensive study on writer independent offline signature verification in a real-world scenario, where occluded signatures of a bank’s customers’ are verified against their clean reference signatures. We have proposed a CycleGAN based stamp removal method to clean signatures before feeding them to a CNN model to extract the signature representation. We have compared different verification setups, fine-tuning strategies, and signature representation approaches and analyzed their effects. In order to show the difficulty of the problem, we have also conducted a human evaluation. We have shown the challenging nature of the problem and effectiveness of our proposed stamp cleaning method in our experiments both on our custom dataset and on publicly available Tobacco-800 dataset.
Acknowledgements. We would like to thank our colleagues from Applied AI and R&D Department of Yapı Kredi Technology, especially Ali Yeşilkanat and Mehmet Yasin Akpınar for their support and valuable comments.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §4.2.
-  (2015) Keras. Note: https://keras.io Cited by: §4.2.
ImageNet: A Large-Scale Hierarchical Image Database.
Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
-  (2017) SigNet: convolutional siamese network for writer independent offline signature verification. arXiv preprint arXiv:1707.02131. Cited by: §2.
-  (2017) A behavioral handwriting model for static and dynamic signature synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1041–1053. External Links: Cited by: §1, §1.
-  (2017) Learning features for offline handwritten signature verification using deep convolutional neural networks. Pattern Recognition 70, pp. 163–176. External Links: Cited by: §1, §2.
-  (2016) Analyzing features learned for offline signature verification using deep CNNs. In International Conference on Pattern Recognition (ICPR), pp. 2989–2994. Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §3.2.
Offline signature verification and identification using distance statistics.
International Journal of Pattern Recognition and Artificial Intelligence18 (07), pp. 1339–1360. Cited by: §1, §1.
-  (2006) Building a test collection for complex document information processing. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666. Cited by: §1, §4.1.
-  (2019) DeepHSV: user-independent offline signature verification using two-channel CNN. In ICDAR, Vol. . External Links: Cited by: §2.
-  (2015) ICDAR 2015 competition on signature verification and writer identification for on- and off-line skilled forgeries (SigWiComp 2015). In ICDAR, pp. 1186–1190. External Links: Cited by: §2.
-  (2013) ICDAR 2013 competitions on signature verification and writer identification for on-and offline skilled forgeries (SigWiComp 2013). In ICDAR, Cited by: §2.
-  (2006) Tobacco-800 signatures and logos dataset. Note: http://lamp.cfar.umd.edu Cited by: §1, §4.1.
-  (2003) MCYT baseline corpus: a bimodal biometric database. IEEE Proceedings-Vision, Image and Signal Processing 150 (6), pp. 395–401. Cited by: §1, §1.
-  (2000) Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (1), pp. 63–84. External Links: Cited by: §1, §1.
-  (2018) Learning to clean: a GAN perspective. In ACCV, pp. 174–185. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556, pp. . Cited by: §3.2.
-  (2007) Off-line handwritten signature GPDS-960 corpus. ICDAR 2, pp. 764–768. External Links: Cited by: §1, §1.
-  (2019) Inverse discriminative networks for handwritten signature verification. In Computer Vision and Pattern Recognition (CVPR)), pp. 5764–5772. Cited by: §2.
-  (2011) Offline signature verification using classifier combination of HOG and LBP features. In International Joint Conference on Biometrics, pp. 1–7. Cited by: §2.
-  (2018) Hybrid user-independent and user-dependent offline signature verification with a two-channel CNN. In CVPRW, pp. 639–6398. External Links: Cited by: §2.
-  (2017) D-StaR: a generic method for stamp segmentation from document images. In ICDAR, Vol. 1, pp. 248–253. Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §1, §2, §3.1.