CoinNet: Deep Ancient Roman Republican Coin Classification via Feature Fusion and Attention

08/26/2019 ∙ by Hafeez Anwar, et al. ∙ FAU 7

We perform classification of ancient Roman Republican coins via recognizing their reverse motifs where various objects, faces, scenes, animals, and buildings are minted along with legends. Most of these coins are eroded due to their age and varying degrees of preservation, thereby affecting their informative attributes for visual recognition. Changes in the positions of principal symbols on the reverse motifs also cause huge variations among the coin types. Lastly, in-plane orientations, uneven illumination, and a moderate background clutter further make the task of classification non-trivial and challenging. To this end, we present a novel network model, CoinNet, that employs compact bilinear pooling, residual groups, and feature attention layers. Furthermore, we gathered the largest and most diverse image dataset of the Roman Republican coins that contains more than 18,000 images belonging to 228 different reverse motifs. On this dataset, our model achieves a classification accuracy of more than 98% and outperforms the conventional bag-of-visual-words based approaches and more recent state-of-the-art deep learning methods. We also provide a detailed ablation study of our network and its generalization capability.



There are no comments yet.


page 4

page 14

page 16

page 23

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coins have been the dominant type of currency in human history, and for this reason, their recognition is of significant interest in both the academic and economic worlds. In archaeology, history, and art, ancient coins reveal an enriched understanding of cultural and historical events. In commerce, they are valuable trading items due to their antiquity. In contrast to present-day coins, recognizing and understanding of ancient coins need in-depth and highly specialized domain expertise. This challenge is partly attributed to severe abrasions due to their age, yet the main complexity stems from their finely granulated class structure. For instance, the Roman Republican coins have over 1900 classes and subclasses defined in standard reference books crawford1974roman . With such a large number of categories, the ancient coins further face an additional challenge of “rarity” where the worldwide count of specimen for some classes is considerably low. Consequently, there is clear interest in automatically extracting information about an unknown coin, and several works in the past anwar2013bag ; anwar2015ancient ; arandjelovic2010automatic ; arandjelovic2012reading ; kavelar2012word ; kim2014ancient ; kim2014improving ; zambanini2012coarse ; zambanini2013improving ; zambanini2014classifying

have attempted to address this problem. As being a visual recognition task, recent state-of-the-art convolutional neural network (CNN) based models 

cooper2019understanding ; kim2016discovering ; schlag2017ancient have also been applied, albeit their strong dependency on comprehensive and annotated image datasets.

In this paper, we strive to facilitate ancient coin recognition by introducing one of the largest and the most diverse datasets presented so far. The categorization of our Roman Republican Coin Dataset, which we call as RRCD, is based on the main object shown on its reverse side. The object represented on a coin can have many forms such as a person, instrument, animal, and building to give a few examples. The object is the main element for coin classification in addition to the coin legend and smaller auxiliary symbols. We call these visible marks as motifs. Since there is a huge variation among the positions of these motifs on the Roman Republican coins, the task of image-based coin classification is very challenging and non-trivial. Exemplar images are shown in Figure 1 where the variations in the anatomy of the coins are evident. Exacerbating such inter-class differences, severe intra-class inconsistencies are commonly found in the ancient coins due to manual minting, abrasions, missing parts, intentional deformations, usage, rust, and patina.

Figure 1: Dataset Challenge: Variations in the anatomy of the reverse motifs due to the positions of the symbol, main object, and legend.

Our dataset consists of 228 object classes minted on the reverse side such as quadriga, griffin, elephant, and many more. We show that a domain adapted neural network model trained with a comparably a small dataset can retrieve the object class with high reliability. We believe that this work is a meaningful step towards a better semantic understanding of ancient coins, as the recognition of their essential elements is key to the ancient coin classification task. Training coin images with a coin ID from a reference classification scheme such as crawford1974roman is impractical for large-scale systems due to the vast amount of classes and the burdensome effort to collect training samples, especially for rare coin classes. Therefore, a system that can recognize the essential elements like the main object or legend would provide a semantically meaningful output that can also easily be mapped to possible coin classes in case it links into a respective ontology.

To iterate, the classification problem we tackle has these inherent challenges:

  • Huge appearance variations induced by the reverse motif of the ancient coins anatomies, abrasions, and the number of coin classes,

  • Absence of a large-scale dataset established under strict numismatics guidelines, and

  • Lack of sufficient exemplary images for a greater proportion of coin classes to train or test the classification model.

Our contributions towards this fine-granular coin recognition task can be summarized as:

  • We develop a novel and domain-specialized neural network model called the “CoinNet”.

  • We introduce the largest and most diverse dataset of the Roman Republican coins, called as RRCD, collected from three specialized numismatics resources under strict and coherent guidelines. Our dataset is available publicly.

  • We demonstrate that the generalization ability of our solution outperforms existing CNN models on completely disjoint test sets that accommodate coin classes having fewer exemplary images.

The remainder of this paper is structured as follows. Section 2 provides a literature overview. Section 3 introduces our novel large-scale image dataset of the Roman Republican coins. Section 4 explains the architecture of the CoinNet. Section 5 reports the results and ablation study.

2 Related Work

The earlier work on coin classification approaches targeted modern-day coins, which is comparably straightforward since the use of modern technology for coin manufacturing ensures a uniform visual appearance concerning shape, depictions, and legend on the obverse and reverse sides. As a result, relatively simple image analysis schemes based on traditional approaches nolle2003dagobert ; reisert2006efficient ; huber2011identification achieved notable classification rates on modern coin datasets with as much as 2270 coin classes. Despite their success on modern coins, these approaches were shown to perform poorly on the task of ancient coin classification zaharieva2007image . As a remedy, attempts on ancient coin classification incorporated additional analysis on visual depictions such as portrait recognition arandjelovic2010automatic , object recognition anwar2015ancient ; anwar2013bag , and legend recognition kavelar2012word .

Generally, strategies for ancient coin classification constitute two main groups. The first one uses local feature matching techniques. Local features capture image variations in a local neighborhood, and the set of such features calculated over an entire image provide a discriminating representation of that particular image. Local features allow calculation of the similarity of two images by measuring, for instance, Euclidean distance between the corresponding local features. The second group of methods for ancient coin classification uses supervised learning algorithms. The parameters of these algorithms are derived in an offline training process with the help of training image datasets. The learned model is then used to predict the class for a test image.

The success of supervised learning mainly comes from the availability of abundant data for the offline training phase. However, in the case of ancient coins, the prevalent problem is the absence of training data due to their rareness and diversity, thereby leading to low recognition rates zambanini2014classifying . In comparison, the feature matching based techniques neither involve an offline training process nor do they require a large number of exemplary images. Even with three or four samples per class, the feature matching substantially outperforms the supervised learning methods zambanini2014classifying . Nevertheless, the online feature matching, as well as the search process it involves, make the feature matching methods computationally intensive. Besides, the complexity increases proportionally with the number of classes in the dataset zambanini2014classifying .

The first exclusive method for ancient coins kampel2008recognizing uses a combination of local feature descriptors lowe1999object ; bay2008speeded ; mikolajczyk2005performance to perform an exemplar-based classification. Zambanini and Kampelzambanini2012coarse apply dense correspondence-based feature matching called SIFT flow liu2011sift . To improve the quality of local features matching, Zambanini et al. zambanini2014classifying employ the geometric consistency of the matched features. Similarly, a more customized descriptor for ancient coin classification called Local Image Descriptor Robust to Illumination Changes (LIDRIC) zambanini2013local is proposed to alleviate the sensitivity to illumination changes. To sum up, in the absence of training data, the feature matching-based methods achieve acceptable classification rates. However, they are not easily scalable to more extensive datasets. They disregard the inherent domain-specific knowledge, which is extremely important from a numismatics perspective to make the classification task complaint with the standard reference books in this subject.

Leveraging upon the numismatic knowledge, the second group of methods uses machine learning algorithms for ancient coin classification. For instance, recognition of legends on the obverse sides of Roman Imperial coins is used for their classification 

arandjelovic2012reading . The legend is assumed to be located along the coin border and is curvature normalized by a log-polar transformation. Due to this assumption, their method performs poorly on Roman Republican coins where the location of the legend is not fixed KavelarZK14 . Consequently, Kavelar et al. kavelar2012word

performs legend recognition of Roman Republican coins using SIFT features with a Support Vector Machine (SVM). The legends carry rich information in terms of alphabets and numbers, thus making them an excellent cue for coin classification. However, they suffer more wear and tear on the coins due to their detailed nature, which makes them less attractive and impractical for coin classification 

schlag2017ancient . Another visual cue used for ancient coins classification is the obverse side portrait arandjelovic2010automatic ; kim2014improving . However, the semi-frontal portraits on the obverse side have less inter-class variations schlag2017ancient . Also, like legends, the portraits are more likely to lose their details with erosion. Anwar et al. anwar2013bag ; anwar2015ancient utilizes reverse motifs recognition for ancient coin classification where the spatially enriched Bag-of-Visual-Words (BoVWs) csurka2004visual model represents the images.

Unlike legends and portraits, the reverse motif is a discriminative visual cue that is less affected by wear and tear. Besides, a given reverse motif can be shared by coins of multiple classes. Therefore, the search space for the class of a given query coin image is aptly reduced by recognizing its reverse motif. This aspect makes the reverse motif-based coin classification coarse-grained that can further be refined by fine-grained classification methods zambanini2014classifying .

A comprehensive review of recent deep learning methods is out of the scope of this paper. Still, we like to mention that the convolutional neural networks have been already applied in the field of digital humanities. Kim and Pavlovic kim2016discovering have used CNNs to recognize the prominent visual cues on the ancient coins and later utilized them for classification. Similarly, Schlag and Arandjelović schlag2017ancient

have used CNNs for portrait recognition on the obverse side to classify the ancient Roman Imperial coins. However, the process of massive data collection such as the sources, methods, and guiding principles are of extreme importance, especially when it comes to ancient objects such as the coins. Such a procedure is not outlined in existing CNN based methods, which makes them unreliable. We, on the other hand, explicitly elaborate on the process of collection of the largest dataset of the Roman Republican coins. Our sources of coin images are among the most reliable ones in this field. Lastly, our guiding source in the data collection is the standard reference book by Crawford 

crawford1974roman , which is considered as the utmost authority on the Roman Republican coins by the numismatists. A detailed description of the data collection is explained in the next section.

3 RRCD - Roman Republican Coin Dataset

The primary motivation of research on ancient coin recognition is to support the manual coin classification efforts by reducing the labor time involved in the process. To make the best use of the invaluable domain expertise, the recognition task should be steered by the standard reference books of numismatics. However, this critical aspect is often overlooked in most of the published work on coin classification except for a few anwar2015ancient . Similarly, the use of smaller image datasets for the evaluation of the proposed coin classification methodologies leads to unrealistic and ungeneralizable approaches. Even in the recently introduced and relatively larger datasets schlag2017ancient , the coin images are categorized into different grades without involving domain-specific knowledge, thus creating an ambiguity about the feasibility of the solutions.

Our focus is on the gold and silver coinage of the Roman Republican era (BC 280/225-27) since there is a comprehensive standard reference work by Crawford crawford1974roman , which is still accurate today, with only minor modifications hollstein1993stadtromische ; woytek . Crawford’s work assigns 550 distinct reference numbers, many comprising different denominations or typological variations. By consolidating all the variants, the actual number of possible combinations might exceed 2000.

Based on Crawford’s work, we collect the most diverse and extensive image dataset of the reverse sides. For most of the Roman Republic coin classes, the obverse side depicts more discriminative information than the observe side CrawfordWebsite . Our dataset has 228 motif classes, including 100 classes that are the main classes for training and testing, which we call the main dataset RRCD-Main. The images of the additional 128 classes constitute the disjoint test set, RRCD-Disjoint, which we allocate to assess the generalization ability of our models. Therefore, the training and testing can be evaluated on completely disjoint datasets. The number of images per class in the RRCD-Main is shown in Figure 2, while a comparison with the existing available reverse side datasets in the literature is given in Table 1. To the best of our knowledge, RRCD is the most diverse dataset proposed while it is the largest dataset of the Roman Republican coins.

Figure 2: Number of Classes: Per-class image counts in the dataset.
Datasets Images Image Size Visual Cues Classes Era
schlag2017ancient 49,571 - O 83 RI
kim2016discovering 4,500 350350 O,R 96 RI
anwar2015ancient 2,224 480480 R 29 RR
kim2014improving 2,815 256256 O 15 RI
zambanini2014classifying 600 150150 R 60 RR
zambanini2013improving 464 384384 O,R,L 60 RR
zambanini2012coarse 180 150150 R 60 RR
kavelar2012word 180 384384 L 35 RR
arandjelovic2010automatic 2,326 250250 O 65 RI
kampel2008recognizing 350 - O,R 3 RR
Our: RRCD 18,285 448448 R 228 RR
Table 1: Datasets comparison: Image datasets of the ancient Roman coins. Imperial is represented by RI while Republican is given by (RR). The dataset classification is performed based on their different visual cues; Obverse side (O), Reverse side (R), or Legends (L).

Nonetheless, during image search, we use the reference number given to each coin class by Crawford. The retrieved coin images and the textual descriptions of their obverse and reverse sides are then cross-matched with the standard ones. This allows for a coherent and unambiguous collection process of image data driven by domain-specific knowledge. We do not perform an explicit categorization of the collected coin images based on their grades. However, the deteriorated coin images, where the reverse motif is challenging to be distinguished by the domain experts, are discarded.

3.1 Composition of RRCD-Main

The images in the main dataset RRCD-Main are collected from three different reliable sources. The images from the Vienna Museum of Fine Arts and the British Museum London are captured in a controlled environment, due to which both the image resolution and imaging conditions are of high quality. Furthermore, the coin specimens are in fair condition and do not exhibit extreme visual deterioration. Similarly, the third source is an online ancient coin auction website where both the quality and the condition of the coin specimen vary. Nonetheless, images from all the sources face extra variations due to irregular illumination caused by the nonrigid nature of the coins. Following is a brief description of the image data from each source.

The Vienna Museum of Fine Arts: The stock of material from the Roman Republic in the Coin-Cabinet in Vienna is among the biggest in the world. It comprises about 3900 coins. The ILAC project kavelar2013ilac collected the image dataset of these coins with a uniform background. However, there exist orientation differences between the coin images as they are not photographed under their canonical orientations based on their central reverse motifs. We acquired 1416 images from this source.

The British Museum: An extensive collection of ancient coins is owned by the department of medals and coins at the British Museum. In our dataset, we use 2376 images of the Roman Republican coins of the British Museum.

The ACSearch: This is an online auction website of ancient coins including those of the Roman Republican era. For any coin at the auction, the images of both reverse and observe sides with their respective descriptions are provided. The information contains the type of the coin given by Crawford, the issuer, the date of issuance and explanations of the objects, scenes, or persons depicted on each side. A snapshot of the website is shown in Figure 3, where various parts of the page showing different information are highlighted.

Figure 3: Search Process: A snapshot of the acsearch image search process

A coin can be searched via the search bar of the website using the keywords from the description, such as the type number or the object displayed on the reverse side. For a search coherent with the standard reference book, we used the type numbers given by Crawford e.g. “Cra. 422/1a”. This results in a list of coin image retrievals along with descriptions. For a uniform collection, we match the images with their descriptions and download only those images that are in complete agreement with their records. We also cross-check the retrieved information with the descriptions given in the reference book. Exemplar reverse side images of the RRCD-Main classes are shown in Figure 4.

Figure 4: Representative images: Samples images of the 100 classes that constitute the RRCD-Main.

3.2 Composition of RRCD-Disjoint

In many cases, the main object on the reverse sides of the Roman Republican coins is shared by multiple coin classes. However, the object style and the additional information on the reverse motifs such as the symbols and legends make the coin classes different from one another. To include images of all those styles in the RRCD-Main is impractical, mainly due to the lack of their images or the rarity of the specimen themselves zambanini2012coarse . Due to such constraints, an image-based coin classification solution should be robust to variations in object styles. If trained on one set of object styles, the framework should be generalizable enough to recognize other styles.

To investigate the performance of our proposed CoinNet and assess its generalization, we select the predominant objects found on the reverse motifs of the Roman Republican coins, namely, “biga” (two-horse chariot), “quadriga” (four-horse carriage), and “curule chair”. Out of 100 coin classes of RRCD-Main that we collected from three different sources, a total of 12 coin classes are having biga as the main object, four coin classes show quadriga, and there is only one coin class where the curule chair is minted. The depiction of main objects varies from each other depending on their styles, additional symbols, and legends.

Figure 5: Disjoint image set: The first row in each partition (separated by double lines) shows images of the coin classes included in the RRCD-Main while the second row shows exemplar images of some of the coin classes included in the RRCD-Disjoint test set. Each partition depicts a separate main object; biga, quadriga, and curule chair, respectively. Since the same main object is minted in different styles with different additional symbols and legends, we treat each column as a separate class.

We collect another 700 images of 81 coin classes where biga is minted in styles different from those of the RRCD-Main. Similar sets of 111 images for 12 curule chair classes and 344 images of 35 quadriga classes are collected too (total 128 classes). We call the combination of all these image datasets as the “the disjoint test set” RRCD-Disjoint because they are only used to test the coin classification framework that is trained on the main dataset RRCD-Main. The exemplar images of the coin classes from RRCD-Main along with the representative images of some of the classes of RRCD-Disjoint are shown in Figure 5. The differences in styles between the coin classes in RRCD-Main and those in RRCD-Disjoint can clearly be noted. For instance, in the case of biga and quadriga, the following are the main differences:

  1. Bigas have different animals such as there are horses, stags, lions, snakes, goats, seahorses and Centaurus.

  2. The animals also vary due to their moving styles i.e. they are either walking, running or galloping

  3. The chariots are either moving towards the right or left

  4. The persons driving the chariots are depicted differently. For instance, they vary from one another due to the objects in their hands.

  5. There are additional symbols associated with the chariots.

Similar differences exist for quadriga and curule chair where it is either minted in a different style or have different symbols and legends.

4 CoinNet: Proposed Coin Recognition Network

In coin recognition task, we aim to predict the most likely outcome for any given image , which can be expressed as:


where are the network parameters and is the set of classes. It needs to be regarded here that Eq. 1 takes image to predict the class label, while we extract image embeddings (feature maps) and from the input image using off-the-shelf convolutional neural networks. Therefore, Eq. 1 can be rewritten as


Our purpose is to exploit a joint representation by employing an appropriate pooling operator which can encode the relationship of Eq. 2 between feature maps and .

4.1 Compact Bilinear Pooling

The bilinear models are introduced by tenenbaum2000separating and received remarkable performance improvement in many computer vision and image processing tasks. However, bilinear presentations are impractical due to many reasons: 1) the number of parameters becomes very high, 2): the features stored in memory for retrieval or deployment requires TeraBytes of storage, 3): the processing for matching and domain adaptation requires feature concatenation, which stresses memory and storage and 4): scenarios such as few-shot Sun_2019_CVPR and zero-shot learning Xie_2019_CVPR becomes challenging. Here, we first provide the formulation of the bilinear representation and then introduce its compact form.

In our case, the bilinear model is obtained by taking the outer product of the two vectors ( and ) as


where converts the matrix into a vector form, i.e. vectorize the product. The bilinear model is effective as it computes each interaction between the encoded vectors; however, it is computationally expensive, as mentioned earlier. Let us consider an example where =2048 and =2048 with =100 output classes (i.e. ) will result in a high dimensional representation with the model composed of one billion parameters.

To avoid the outer product in bilinear models and project the representation onto a lower-dimensional space, we employ the Compact Bilinear Pooling (CBP) of gao2016compact and fukui2016multimodal , where both propose to utilize the Count Sketch Projection function charikar2002finding which projects a vector to . Count Sketch Projection randomly draws two vectors and

from a uniform distribution, while these drawn vectors remain constant for the future invocations. The mapping function

maps the index of to the index of , initialized as zero. For every element its destination index is looked up using ; and then is added to . This technique helps to reduce the number of parameters in the model due to the projection of the outer product (bilinear representation) to a low-dimensional space.

According to pham2013fast , the computation of outer product can be circumvented by taking the convolution of the count sketches as


where is the convolution operator. Furthermore, according to the convolution theorem, the element-wise multiplication in the one (frequency) domain is equal to convolution in the other (spatial) domain. Therefore, Eq. 4 can be rewritten as


where is the element-wise multiplication operator, and

is the Fourier transform function. In the next section, we describe the convolutional neural network segment of our algorithm.

4.2 Proposed Architecture

Our model encodes the input image to extract feature maps and then merges them via the compact bilinear pooling algorithm. The problem is treated as a multi-class classification task with 100 possible outcomes. As a first step, the images are resized to 448448 and encoded using the two popular CNN networks i.e. DenseNet161 huang2017densely and ResNet50 he2016deep

having 161 and 50 convolutional layers trained on ImageNet dataset 

deng2009imagenet . The features are collected from the network before the final fully-connected layer without applying the average pooling resulting in a 14142048 feature map. Suppose DenseNet161 huang2017densely and ResNet50 he2016deep are denoted by and ,

We fuse these feature maps and using CBP to obtain a better representation. Then we apply a group of residual blocks to the fused features to learn the joint representation. We also perform normalization on the 2048-D vector obtained from the residual group.

Attention: Recently, attention has been investigated in many computer vision applications, e.g. image captioning xu2015show

, super-resolution 

zhang2018image and visual question answering yang2016stacked . In our model, we also incorporate soft-attention to integrate spatial information. As presented in Fig 6, we employ one convolutional layer to extract features to emphasize the salient features. Moreover, we apply softmax to predict the attention weights for each grid location to generate normalized soft attention maps. To get the visual representation, the attention map is summed with the spatial feature vectors and . As a final step, a fully-connected layer is employed to obtain the number of outputs equal to the number of coin categories.

Figure 6: CoinNet: Our model highlighting the Compact Bilinear Pooling, residual blocks, skip connections, and feature attention. The green and yellow cubes indicate the embedded features via CNN networks.

Network loss:

The output features of the fully-connected layers are passed via the softmax function to normalize the feature values. Moreover, the loss function, we compute the difference between the predicted probabilities and the actual distribution of the class through cross-entropy as


Here, and

stands for the true and the estimated probabilities, respectively. Furthermore, the loss only captures the error on the target class where its value is non-zero because

uses one-hot encoded vectors.

Algorithms BoVWs RT VGG NASNet Ours
Acc. 70.81% 84.4% 97.4% 97.8% 98.5%
Table 2: Quantitative comparison: Comparison of our method with state-of-the-art methods on train-test split of 30%-70%. All results reported as top-1 mean accuracy on the test set.

5 Experiments

In this section of the paper, we present a quantitative and qualitative performance evaluation of our method against state-of-the-art traditional and convolutional neural network algorithms. Firstly, we show the focus of our network on different objects for classification. Then, we investigate the influence of various feature inputs on performance accuracy. Subsequently, we report on the generalization capability of our method.

5.1 Experimental Setup

In this section, we provide implementation details of our model. We set the filter size of all the convolutional layers as 33. We use four residual blocks as a single residual group. The initial learning rate is fixed at 10

which is reduced after 50 epochs by a factor of 10

. To train the model, we use SGD bottou2010large with a weight decay of 10

. We use 30% of the data for training, utilizing data augmentation, which includes random rotations and flipping. The model is implemented using PyTorch on a Tesla P100 GPU with 16GB memory.

Youth and soldiers Father and son Charging bull

Italia and Roma Wild boar and dog Soldiers and women
Figure 7: Visual comparison: The correctly classified images are represented with green circles while the wrongly classified ones are in red circles. In the first row, the confidence of the NasNet zoph2018learning is always low although the model can classify correctly. The second shows that the confidence of the VGG simonyan2014vgg , which is consistently high even for wrongly classified classes.

5.2 Quantitative Evaluation

Until now, Anwar et al. anwar2015ancient used the largest and most diverse dataset of the reverse side images of the Roman Republican coins. Their algorithm uses a linear SVM on the spatial extensions of the standard bag-of-words (BoVWs) representation for image classification. To this end, we compare our results with the simple BoVWs representation anwar2015ancient and its variant with a rectangular tiling (RT) anwar2013bag of 22 for empirically selected vocabulary sizes as shown in Table 2.

Our method performs better from the classical algorithms with an improvement of 27.7% and 10.1% on BoVWs anwar2015ancient and RT anwar2013bag , respectively. Furthermore, to compare with the current state-of-the-art convolutional neural networks i.e. VGG simonyan2014vgg and NasNet zoph2018learning , we fine-tune the networks from imageNet deng2009imagenet using our coins’ training set. The improvement is on VGG simonyan2014vgg and NasNet zoph2018learning is 1.1% and 0.7%, respectively. The improvement on CNN is less as compared to the traditional classifiers as the CNN methods may be benefiting from the weights of ImageNet deng2009imagenet .

5.3 Qualitative Comparison

In Figure 7, we show the correct and incorrect classification results on the randomly selected images from the original test dataset. The results are only furnished for the CNN algorithms’ i.e. VGG simonyan2014vgg , NasNet zoph2018learning and our CoinNet.

In the top row of Figure 7, our method, and NasNet zoph2018learning both can classify the input images correctly; hence marked with different shades of green circles and a confidence score at the top of each image. It can be observed that the confidence level of NasNet zoph2018learning is always lower even the prediction is correct as compared to our CoinNet method. Likewise, we present the misclassification of the coin types by our method and VGG simonyan2014vgg in the bottom row of Figure 7 marked with red circles and again having the confidence score at the top of the image. In this case, VGG simonyan2014vgg is always more confident i.e. having a high score than our network. This sum up that our model is more confident about correct predictions and vice versa.

5.4 Network Visualization For Attention

To visualize the importance of the feature attention, we employ a recently introduced method called Grad-CAM selvaraju2017grad . By computing the gradients concerning an individual class, Grad-CAM selvaraju2017grad , gives an insight into essential regions the network focuses. In Figure 8 we provide a visualization comparison for VGG simonyan2014vgg , NasNet zoph2018learning and CoinNet.

The first image in Figure 8, we can observe that the Grad-CAM selvaraju2017grad masks of our CoinNet network cover the “dolphin” object regions better than other methods where VGG simonyan2014vgg only focuses on a subpart of the object while NasNet simonyan2014vgg aims for non-essential regions. Similarly, in the “Biga” image, our method focuses on the number of horse legs while other CNN networks conform to either human head or horse abdomen, which can be found in the different coin images as well; hence resulting in low accuracy. As the last example, we present attention on “Minerva” coin image. As usual, VGG simonyan2014vgg focus on the middle part of the coin while NasNet zoph2018learning aim for text regions; however, our CoinNet model learns from more holistic feature regions as shown in the last row of Figure 8. The mentioned examples show that our CoinNet exploits and learns the essential information in the target objects and aggregate features for classifications from these regions, that helps in increasing accuracy.




Input VGG NASNet CoinNet
Figure 8: Visualization results from Grad-CAM selvaraju2017grad : The visualization is computed for the last convolutional outputs, and the ground-truth labels are shown on the left column the input images.

5.5 Influence of Feature Maps

We test the robustness of our network to the input image embeddings required for classification of the coins. For this purpose, we utilized the combinations of VGG simonyan2014vgg , ResNet he2016deep and DenseNet huang2017densely . Table 4 shows that the classification rate has a marginal difference as we employ another input embeddings. The leading cause for this phenomenon is that the network is not mainly relying on the input embeddings as MCB, residual blocks, and attention plays the primary role in learning the subtle variations among the coins.

Nets r50-r50 d161-d161 r50-d161 vgg19-r50 vgg19-d161
Acc. 98.4% 98.5% 98.5% 98.4% 98.5%
Table 4: Influence of vocabulary: The effect of the vocabulary size on the classification performance for BoVWs and rectangular tiling.
Vocabulary size BoVWs RT
1k 65.80% 84.44%
5k 70.81% 83.80%
10k 69.45% 81.53%
15k 69.15% 79.81%
Table 3: Input features effect: Comparison of different input features combinations to our CoinNet. Our network is robust to the change in the input features such as generated via ResNet50 (r50), DenseNet161 (d161) and Vgg19.

We also perform ablation studies to get the best vocabulary size of the BoVWs representation where the vocabulary sizes are empirically selected, as shown in Table 4. An overfitting effect can be observed with an increase in vocabulary size. This effect is more noticeable in rectangular tiling, where the feature vector size that represents the image is four times the vocabulary size.

5.6 Generalization Capability

Here, we assess the generalization capability of our CoinNet. To this end, the model is trained with images of the 100 classes included in the original dataset. The models are then tested using the photos of the disjointed test set. Since the test images are disjoint, and there is no class label for the disjoint test images in the original dataset, we use a workaround where a test image of the object “Biga” will be considered as correctly classified if and only if, it falls into any one of those 12 coin classes with “Biga” as the main object.

Table 5 presents the quantitative results where our CoinNet leads the other competitive state-of-the-art methods with a significant margin of more than thus demonstrating a far superior generalization performance of CoinNet on disjoint coin types. The performance increase can be partially attributed to the ResNet blocks, followed by the attention mechanism.

Datasets VGG NASNet Ours
Biga 69.15% 48.64% 96.56%
Quadriga 4.37% 16.33% 68.15%
Curule 71.17% 8.11% 79.28%
Table 5: Performance on disjoint set: Accuracy on the unseen coin types for competing CNNs

5.7 Limitations

Although the performance of CoinNet has surpassed the classical and CNN methods; however, like competitive methods, it still struggles to recognize the objects in the images due to the lower resolution. Few examples are previously presented in the second row of Figure 7, where the images are either low-resolution or having blur in it; hence, results in misclassification.

6 Conclusion

The classification of ancient Roman Republican coins via recognizing objects on their reverse sides is performed on a new dataset comprised of diverse coin images. Our method outperformed the traditional BoVWs model and its spatial extensions that previously gave state-of-the-art results on the task of ancient coins classification. It was experimentally shown that on a large scale image dataset the BoVWs model performs inferior and tends to overfit. We also compared our proposed CoinNet architecture with current state-of-the-art CNN model, which lags in accuracy. In addition, our CoinNet also outperformed the competing CNNs on the unseen disjoint test set. In the future, we plan to recognize other visual cues of the reverse motifs that will ultimately support the current classification system for a more detailed classification of the coin.


  • (1) M. H. Crawford, Roman republican coinage, Vol. 1, Cambridge University Press, 1974.
  • (2) H. Anwar, S. Zambanini, M. Kampel, A bag of visual words approach for symbols-based coarse-grained ancient coin classification, arXiv preprint arXiv:1304.6192 (2013).
  • (3) H. Anwar, S. Zambanini, M. Kampel, K. Vondrovec, Ancient coin classification using reverse motif recognition: Image-based classification of roman republican coins, IEEE Signal Process. Mag. 32 (4) (2015) 64–74.
  • (4) O. Arandjelovic, Automatic attribution of ancient roman imperial coins, in: CVPR, 2010, pp. 1728–1734.
  • (5) O. Arandjelović, Reading ancient coins: automatically identifying denarii using obverse legend seeded retrieval, in: ECCV, 2012, pp. 317–330.
  • (6) A. Kavelar, S. Zambanini, M. Kampel, Word detection applied to images of ancient roman coins, in: International Conference on Virtual Systems and Multimedia, 2012, pp. 577–580.
  • (7) J. Kim, V. Pavlovic, Ancient coin recognition based on spatial coding, in: ICPR, 2014, pp. 321–326.
  • (8) J. Kim, V. Pavlovic, Improving ancient roman coin recognition with alignment and spatial encoding, in: ECCV, 2014, pp. 149–164.
  • (9) S. Zambanini, M. Kampel, Coarse-to-fine correspondence search for classifying ancient coins, in: ACCV, 2012, pp. 25–36.
  • (10) S. Zambanini, A. Kavelar, M. Kampel, Improving ancient roman coin classification by fusing exemplar-based classification and legend recognition, in: ICIAP, 2013, pp. 149–158.
  • (11) S. Zambanini, A. Kavelar, M. Kampel, Classifying ancient coins by local feature matching and pairwise geometric consistency evaluation, in: ICPR, 2014, pp. 3032–3037.
  • (12) J. Cooper, O. Arandjelovic, Understanding ancient coin images, arXiv preprint arXiv:1903.02665 (2019).
  • (13) J. Kim, V. Pavlovic, Discovering characteristic landmarks on ancient coins using convolutional networks, Journal of Electronic Imaging 26 (2015).
  • (14) I. Schlag, O. Arandjelovic, Ancient roman coin recognition in the wild using deep learning based recognition of artistically depicted face profiles, in: ICCV Workshop), 2017, pp. 2898–2906.
  • (15) M. Nölle, H. Penz, M. Rubik, K. Mayer, I. Holländer, R. Granec, Dagobert-a new coin recognition and sorting system, in: DICTA, 2003, pp. 329–338.
  • (16) M. Reisert, O. Ronneberger, H. Burkhardt, An efficient gradient based registration technique for coin recognition, in: Proceedings of the Muscle CIS Coin Competition Workshop, Berlin, Germany, 2006, pp. 19–31.
  • (17) R. Huber-Mörk, S. Zambanini, M. Zaharieva, M. Kampel, Identification of ancient coins based on fusion of shape and local features, Machine Vision and Applications 22 (6) (2011) 983–994.
  • (18) M. Zaharieva, M. Kampel, S. Zambanini, Image based recognition of ancient coins, in: CAIP, 2007, pp. 547–554.
  • (19) M. Kampel, M. Zaharieva, Recognizing ancient coins based on local features, in: ISVC, 2008, pp. 11–22.
  • (20) D. G. Lowe, Object recognition from local scale-invariant features, in: ICCV, 1999, pp. 1150–1157.
  • (21) H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (surf), Comput. Vis. Image Underst. 110 (3) (2008) 346–359.
  • (22) K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1615–1630.
  • (23) C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across scenes and its applications., IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011) 978–994.
  • (24) S. Zambanini, M. Kampel, A local image descriptor robust to illumination changes, in: SCIA, 2013, pp. 11–21.
  • (25) A. Kavelar, S. Zambanini, M. Kampel, Reading the legends of roman republican coins, J. Comput. Cult. Herit. 7 (1) (2014) 5:1–5:20.
  • (26) G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Workshop on statistical learning in computer vision, ECCV, Vol. 1, 2004, pp. 1–22.
  • (27) W. Hollstein, Die stadtrömische Münzprägung der Jahre 78-50 v. Chr. zwischen politischer Aktualität und Familienthematik: Kommentar und Bibliographie, Tuduv-Verlag-Ges., 1993.
  • (28) B. Woytek, Arma et nummi - forschungen zur römischen finanzgeschichte und münzprä-gung der jahre 49 bis 42 v. chr (1993).
  • (29) Roman coins database,, accessed: 2019-08-15.
  • (30) A. Kavelar, S. Zambanini, M. Kampel, K. Vondrovec, K. Siegl, The ilac-project: Supporting ancient coin classification by means of image analysis, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences - XXIV International CIPA Symposium 5 (2013) W2.
  • (31) J. B. Tenenbaum, W. T. Freeman, Separating style and content with bilinear models, Neural Computation 12 (6) (2000) 1247–1283.
  • (32)

    Q. Sun, Y. Liu, T.-S. Chua, B. Schiele, Meta-transfer learning for few-shot learning, in: CVPR, 2019, pp. 403–412.

  • (33) G.-S. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, L. Shao, Attentive region embedding network for zero-shot learning, in: CVPR, 2019, pp. 9384–9393.
  • (34) Y. Gao, O. Beijbom, N. Zhang, T. Darrell, Compact bilinear pooling, in: CVPR, 2016, pp. 317–326.
  • (35)

    A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, in: Conference on Empirical Methods in Natural Language Processing, 2016, pp. 457–468.

  • (36) M. Charikar, K. Chen, M. Farach-Colton, Finding frequent items in data streams, in: International Colloquium on Automata, Languages, and Programming, 2002, pp. 693–703.
  • (37) N. Pham, R. Pagh, Fast and scalable polynomial kernels via explicit feature maps, in: International Conference on Knowledge Discovery and Data mining, 2013, pp. 239–247.
  • (38) G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: CVPR, 2017, pp. 4700–4708.
  • (39) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016, pp. 770–778.
  • (40) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: CVPR, Ieee, 2009, pp. 248–255.
  • (41) K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: ICML, 2015, pp. 2048–2057.
  • (42) Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: ECCV, 2018, pp. 286–301.
  • (43) Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks for image question answering, in: CVPR, 2016, pp. 21–29.
  • (44)

    L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

  • (45) B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning transferable architectures for scalable image recognition, in: CVPR, 2018, pp. 8697–8710.
  • (46) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
  • (47) R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: ICCV, 2017, pp. 618–626.