In recent years researchers have become increasingly interested in the multi-label and hierarchical learning approaches, finding many application to several domain, including classification wehrmann2018hierarchical; cesa2006incremental, image annotation dimitrovski2011hierarchical, bioinformatics valentini2009true; yan2019zhejiang chen2018deep abacha2019vqa
. Nowadays, machine learning is commonly used to resolve complex problem into pattern recognition where an object is classified assigning a label in according with the model’s rule used. However, classes are not always disjoint from others and objects within them can be related to others as a hierarchical structuresilla2011survey. Human beings perceive the world with different types of granularity and can translate information from coarse-grained to fine-grained and on the contrary, perceiving different levels of abstraction of the information acquired hobbs1990granularity; mccalla1992granularity. This concept is reflected in the taxonomy of the multi-label general approaches under the idea of structured output prediction su2015multilabel.
In terms of neural models, the main difference between the prediction of structured output and flat multi-label classification lies in the level of neurons that contains the label prediction. In fact, in the presence of a structured output, the information is based on a different level of abstraction, while with the multi-label flat approach it is based on a single level.
Hierarchical multi-label classification (HMC) is a variant of the classification task where instances may belong to multiple classes at the same time and classes are organized in a hierarchy. In HMC approaches a relationship among classes and can be formalized by a tree or directed acyclic graph (DAG). Our approach to HMC exploits the annotation hierarchy by building a single neural network that can simultaneously predict all categorization of an input source exploiting multiple layers of a neural model. For example, considering the class label prediction for an image containing a tiger, the proposed system can simultaneously predict that a ”tiger” has been found but at the same time the same object is also a ”feline” and a ”mammal”.
In literature exists two main approaches to HMC problem, known as local and global costa2007comparing; xu2019survey; silla2011survey. In the global approach, the output of the final layer predicts the test instance in which only one classifier sees information globally without having local information. In the local approach, there is a set of trained classifiers that follows a top-down strategy, in particular, the training process is independently for each base classifier.
Different local approaches have been proposed in the literature, like Local classifier per Node (LCN) valentini2009true, Local classifier per parent node (LCPN), Local classifier per level (LCL) cerri2011hierarchical. LCN strategy trains a local classifier for each node of a graph providing a local decision to make predictions. LCPN uses a multi-class classifier for each internal class to recognize classes from its sub-classes and LCL methods train a multi-class classifier per hierarchical level. In contrast with local (LCN, LCL, LCPN) and global approaches, we use a single trained model and a single back-propagation error with many different layers fully connected, responsible to synchronize with a concept linked to a given hierarchical structure.
A recent work wehrmann2018hierarchical describes a novel method to solve HMC problem, that preserves local and global information simultaneously to discover the local hierarchical relationship among classes. Unlike this work, our architecture exploits recent neural network potentialities and facilitates the multi-class prediction for each deep layers to capture local context following the hierarchical structure of the information. In our approach, we have a cascade of fully connected linear layers each one with softmax plus cross-entropy, where the output of a layer is the input of layer ; instead, in wehrmann2018hierarchicalwehrmann2018hierarchical is that the input of each layer fuse with the input, instead, in our approach the input per layer is the output of the previous layer. The last difference is that our model uses local classification as final prediction in according to hierarchical multi-label classification task, instead of in HMCN-F the final layer is used as flat layer plus another layer that uses jointly local and global output information to obtain the final prediction.
Our work can be summarized in the following key contributions:
We propose a new hierarchical deep loss approach (HDL) as an extension of convolutional neural networks to assign hierarchical multi-labels to images. Our extension can be adapted to a generic Convolutional Neural Network as final step.
To prove the effectiveness of our hierarchical classification approach we conduct empirical studies on three different datasets. First, we created Animals_Taxonomy8 dataset based on real animal images from Flickr on three groups of taxonomy (Class, Family, Species) with their relative label annotations. Second, we used a well-known biomedical dataset (VQA-Med 2019) contains radiology real images on different levels of hierarchy and third, we created Geometry_shapes_annotations that contains thousands of shapes images on three depth hierarchy levels. Further, all datasets have a different number of instances (2.8k,8k,40k) useful to prove the robustness of our approach.
2 The Proposed Approach
As mentioned above, our solution is an architectural extension that can be adapted to a generic neural network. In this paper, we used a standard Convolutional Neural Network, the ResNet18, as a base model to which we added our solution to solve a hierarchical images classification problem. As graphically represented in Figure 1, what we do is to extend the output layer with some fully connected layers equal to the number of layers available in the classes hierarchy tree of the problem to be solved, and to associate a loss function to each of these new layers added. In practice, we construct a mapping between the layers of a class hierarchy and the new layers of the neural network ( in Figure 1
) so that the network can learn to discriminate between all class labels belonging to a given layer of the hierarchy. To minimize the intra-class variance and at the same time to keep the features among different classes separated we compute the Center Losswen2016discriminative
on each training mini-batch and update all class centers after each training epoch. More formally we compute the center lossas follow:
where denotes the center for the class in the features space of the deep model. In our experiments, we chose a Resnet-18 as a general model and apply Center Loss after the adaptive pooling layer. Finally, let be the linear layer at first level with dimension equal to the number of classes at first level of hierarchy, more formally:
is the bias vector withlinear activation function and be the number of features. Then, we add a linear layer
for each hierarchical level in a generic dataset and we perform the cross-entropy loss to maximize the inter-class variance. Precisely, we apply softmax function from logits of layerand use cross-entropy loss as Eq.3
Where, is the layer l-th, m and n are the mini-batch size and number of classes respectively, denotes the i
th deep feature, belonging to theth class and b is the bias.
Finally, our total loss is:
Where , it the centers loss value and is the cross-entropy loss value of the layer . The general formulation with layer is defined as Eq.5
To evaluate the proposed method, we created our own datasets as there is no a standard benchmarked dataset on hierarchical multi-label images classification, available in the literature.
The medical Visual Question Answering task (VQA-Med 2019) abacha2019vqa is focused on radiology images (example in Fig. 12) grouped in four main classes: Modality, Plane, Organ system, Abnormality. The original challenge is to classify an image from a question linked to it, indeed for each image in the training we have a paired question. Our focus is on the hierarchical multi-label classification of images, therefore, we will exclude our experiment from text classification task. We use all train size and use the validation set as a test set (because the test set is not labelled with all labels), respectively 2816/340 objects. In total, we consider three levels of hierarchy (Modality Class, Plane Class, Organ Class) with their relative different type of concepts. These classes have a size of 44, 15, 10 respectively per classes. In these experiments, our goal is to prove experimentally the effectiveness and robustness of our model to discriminate different concepts also in the case we have a few examples per classes in the train.
We have created a synthetic geometric shapes dataset which contains 2 different shapes (Triangle, Square, some image sample into fig. 6) at the first level of our hierarchy. Each shape has 6 different full colours and other 6 different colours for out-fill, the last two represents the second and third level of the hierarchy. The possible configuration is 72 so, we generate train/test with 20000 and 6000 objects respectively. The dimension of the images is 128x128x3. In these experiments, we want to answer the question ”Which kind of shape is this? What is the fill colour? and the out fill colour?.
This data set is created from Flickr animals images, the hierarchy represents a small taxonomy with class, family and species as in Fig. 22. The selected class is mammalia and reptilia. The second level of hierarchy is the family, in particular felidae and ursidae for mammalia and crocodyle, iguanidae, emydidae and pythonidae for reptilia. The last hierarchy level represent the species( example of images in fig. 21) as malaysia tiger, felis catus known as cat, ailuropoda melanoleuca known as giant panda, ursus maritimus known as polar bear, python molurus known as green python, trachemys scripta as small turtle, iguana iguana and crocodylus niloticus well known as nilus crocodile. A whole representation of the dataset is in Fig. 7.
To evaluate the proposed method we develop four empirical studies.
In the first one, we use a well-known dataset (VQA-Med 2019) to test our approach with biomedical real images, also in the case we have few data available.
In the second we test the capability of abstraction of our approach on a synthetic dataset created in the context we have thousand of instances available.
In the third, we extract hierarchical structure on the real-images dataset contains images of three types of animal taxonomy levels (Class-Family-Species) and prove the robustness of our HDL in the case which images are hard to recognize and they contain noise.
In the four experiment, we compare our HDL with a ResNet18 proving the effectiveness of our approach.
First experiment In this experiment, we test our model in the case we have few instances and with a high complexity of images. Our hypothesis is that the performance in terms of accuracy in a layer is higher when the number of different concepts to distinguish is inferior to a layer with many concepts to recognize. As we show in 1 at the first row (VQA-Med 2019), we have accuracies of 38.05, 74.04, 66.66 for the size of layers 44, 15, 10 respectively. We can observe that the accuracy of the first layer is lower of 1.94 times than the second layer and to 1.75 times than the third layer, this proves that our model offers better scalability when we have few concepts per layer to learn. Similar results can be found in Animals_Taxonomy8, where the higher accuracy of the third layer at the third row of Table 1 than others, is due to the fact we have only two concept (mammals or reptiles) to distinguish than the second layer (8 concepts) Figs. 23 and 24.
Second experiment In a second experiment, we use a synthetic dataset with simple geometric shapes and several instances 7.10 times greater than VQA-Med 2019. Our intuition is that attribute more samples per classes can improve the training of our model and subsequently, to obtain better performance in terms of accuracy than the first experiments. To prove this conjecture, we train with 20K instances our HDL and test it with 6k instances. The results in Table. 1 at the second row per tables, confirm our expectations. The higher number of instances jointly with the simplicity of images allows the model to reach high accuracy starting from the first ten epochs. Furthermore, we conduct three different runs with a learning rate of 0.005, 0.001, 0.01 using batch-size of 64.
Third experiment In these experiments, we test our model using more instances than the first experiment and with images of animal (Animals_Taxonomy8) that contains noise. In particular, our model offers good performance also in the case the images are not simple as in the second experiments and when they contain noise or offers little comprehensibility, indeed many images are not clear, like for example a snake completely hidden by forest or a bar sign with a panda logo. However, as we show in 1, the accuracy of the third layer, responsible to recognize mammals or reptile is very high. We conclude that considering the poor understanding of images, noise and hard images to recognize, experimental results prove the robustness of our model.
Fourth experiment HDL is designed to maximize the learning capacity and to extract the hierarchical structure from the labelled data. Our intuition is that our model, lead to different losses at any level, with the power to reduce intra-variance and to maximize inter-variance, can obtain better accuracy than a classical convolutional neural network. To prove this, we conduct six different experiments using a classic ResNet18 and our HDL on Animals_Taxonomy8 using two learning rate and a batch size of 64. The results in Fig. 2 and 25,26 clearly confirm our expectations. In all cases, the accuracy is higher than a classical ResNet18, this experiment proves the effectiveness of our proposed model.
4.1 Experiments settings
We build our hierarchical multi-label classifier model as an extension on a Resnet-18, but is it possible to apply to any Convolutional Neural networks. We implement our extension in Python using Pytorch framework. Fig.1 shows the architecture used for experiments. The size of the input images is re-scaled to 64x64x3 for Geometry dataset and 256x256x3 for VQA-Med 2019 and Animals_Taxonomy8
datasets. We do not apply any preprocessing of images as data augmentation, rotation or normalization. The kernel size of the first convolutional layers is 7x7 with a stride of 2 pixels, followed by a normalization of layer and a non-linear layer with ReLu activation function. A max-pooling operation over 3x3 regions and a stride of 2 pixels. Then, we have four blocks of Convolution, with 64, 128, 256, 512 numbers of plans respectively and apply an adaptive average pooling over 1x1 region. Finally, we add three fully connected linear layer, where each layer corresponding to the total number of concepts in our hierarchical dataset. In the forward process, we take the output after the adaptive average pooling and apply Center loss function and for each linear layers we apply softmax function and then cross-entropy loss. The total loss will be the sum of the local loss per layers. Our network was trained with Adam optimizerkingma2014adam. The batch-size used, learning rate, epochs are described jointly with the results for each dataset.
5 Results and Discussion
This study is placed in the sub-category of multi-label classification called Structure output learning. In according with experimental results at Tables 1, 2, we achieved good results on three different datasets finding the way to exploit the dependency among classes and make accurate predictions, reducing the misclassification than a classic ResNet18. The main reason we have created these datasets is to prove our proposal in the field of computer vision and with more than 2 levels of depth, indeed CIFAR100 contains only two levels of depth (Super Class, Classes) and other datasets with many depths find applicability only in text classification or in bioinformatics, where the inputs are not images.
In literature, multi-label classification is an important field in machine learning and it is strongly related to many real-world applications for example, in biomedical images annotation, document categorization and whatever problem which the instances inside the classes are not disjoint but they keep a hierarchical structure. In this work, we have conducted four empirical studies on different datasets to prove by experimental results the effectiveness and robustness of our proposed model, that can be applied as an extension to any Convolutional Neural Network.