Introducing the structural bases of typicality effects in deep learning

07/07/2021 ∙ by Omar Vidal Pino, et al. ∙ Universidade Federal de Minas Gerais 0

In this paper, we hypothesize that the effects of the degree of typicality in natural semantic categories can be generated based on the structure of artificial categories learned with deep learning models. Motivated by the human approach to representing natural semantic categories and based on the Prototype Theory foundations, we propose a novel Computational Prototype Model (CPM) to represent the internal structure of semantic categories. Unlike other prototype learning approaches, our mathematical framework proposes a first approach to provide deep neural networks with the ability to model abstract semantic concepts such as category central semantic meaning, typicality degree of an object's image, and family resemblance relationship. We proposed several methodologies based on the typicality's concept to evaluate our CPM-model in image semantic processing tasks such as image classification, a global semantic description, and transfer learning. Our experiments on different image datasets, such as ImageNet and Coco, showed that our approach might be an admissible proposition in the effort to endow machines with greater power of abstraction for the semantic representation of objects' categories.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Memory is one of the most amazing faculties of the human being and construed as the brain’s ability to code, store, and retrieve information Atkinson and Shiffrin (1968); Tulving (2007); Yee et al. (2018); Netto et al. (2015)

. For decades, understanding and simulating the basis of human learning, cognition processing, and its perception and vision system has been the motivation of the machine intelligence field. In recent years, pattern recognition methods with an impressive performance for some specific tasks related to image interpretation have been developed in the Computer Vision and Image Processing fields. However, these methods still lack in others capabilities compared to human proficiency. Image semantic understanding is influenced by how the features of image basic components (

e.g., objects) are semantically represented and how the semantic relationships between these basic components are constructed Guo et al. (2016). Knowledge extraction models (high-level vision processes) from images are highly influenced by the methods used to detect, extract, and represent the image’s relevant semantic information.

The advent of Convolutional Neural Networks (CNN) outperformed the traditional methods 

Lowe (2004); Bay et al. (2008) used for image feature representation, and CNN-methods are the leading approaches in semantic image processing tasks such as object recognition Simonyan and Zisserman (2014), semantic segmentation Long et al. (2017), object description Li et al. (2020), semantic correspondence Rocco et al. (2018), etc. Although state-of-the-art CNN methods have achieved remarkable results, there are still many challenges to attain the discriminative power and the abstraction of human memory (e.g., semantic memory Tulving (2007); Yee et al. (2018)) to represent the semantic of visually acquired information. How to emulate the behavior of human memory in the representation of learned knowledge of objects’ features? How to extract and encode such features to encapsulate the representation of the meaning (or semantic representation) of a specific object? How to infer or ascribe semantics to objects? How to represent the image’s meanings and its phenomena? The quest to answer some of these questions still occupies the investigation agenda of many researchers.

Figure 1: Schematic of our prototype-based classification and description models. The human visual system can observe, categorize and build the semantic description of an object based on its most distinctive features within that object’s category. We propose a prototype-based model to simulate this behavior through the pipeline composed of modules 1) to 6). 1)features extraction; 2) recognition of object features; 3) prototypes-based classification; 4) object features; 5) central semantic meaning of a category (the category prototype); 6) our Global Semantic Description based on Prototypes.

Object typicality effects are among these semantic phenomena that are difficult to capture, and they are still challenging for the image computing process. The typicality concept refers to the degree to which the objects under study are considered good examples of the category Rosch (1973); Rosch and Mervis (1975). For example, the pigeon is a typical case in the bird category since it has several representative features: it can fly, has feathers, beaks, lays eggs, and builds a nest. On the other hand, the penguin is an atypical member since it satisfies only some features but not all. A glance is enough for human beings to perform this type of semantic ranking within the category. In contrast, once objects belong to the same category, machines still lack the ability to capture this semantic phenomenon.

The argument that category membership is a matter of degree came from cognitive psychology with the seminal studies of Rosch and colleagues Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978), who referred to membership degree as typicality. In her seminal work Rosch and Mervis (1975), Rosch introduced the concept of semantic prototype and presented an in-depth analysis of the internal semantic structure of the category. Rosch Rosch and Mervis (1975) holds that the representation of category semantic meaning is related to the category prototype, particularly to those categories denoting natural objects. The Prototype Theory Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978); Geeraerts (2010) proposes that human beings think categories in terms of abstraction (prototypes), represented by typical category members. This theory also indicates that the successful execution of object classification and description tasks in the human brain is inherently related to the category prototype learned.

This paper relies on cognitive semantic studies related to the Prototype Theory to propose a new perspective to model the central semantic meaning of object categories: the prototype. Unlike other prototype learning approaches Mingbo Ma et al. (2013); Ojeda-Magaña et al. (2013); Wohlhart et al. (2013); Saleh et al. (2013); Zhao and Qin (2015); Jetley et al. (2015); Saleh et al. (2016); Oyedotun and Khashman (2018); Snell et al. (2017); Drumond et al. (2017); Dong and Xing (2018); Fort (2018); Yang et al. (2018); Allen et al. (2019); Angelov and Soares (2020); Xiao et al. (2020); Garnot and Landrieu (2021), we use our prototype representation to capture the concepts of typicality and category membership degree of object’s images. Our proposal considers the typicality concept from cognitive psychology, assuming that it is possible to obtain a more natural and interpretable representation of the semantics of the object’s image. Specifically, we propose a mathematical framework that endeavors to represent the semantic definition of an object’s categories and, consequently, capture the phenomena of the object’s typicality. To evaluate our proposal in real-world tasks, we also propose a procedure to introduce our prototype’s semantic representation and our typicality measure in the global semantic description of the object’s images. Furthermore, we also propose a CNN-layer architecture to evaluate our proposal in classification and transfer learning tasks. Figure 1 shows the intuition and our basic conceptual steps to apply our framework to classification and description models.

Prototype learning is a representative approach of pattern recognition methods. It has been used in such image processing tasks as Face Recognition 

Mingbo Ma et al. (2013); Oyedotun and Khashman (2018), Image Segmentation Ojeda-Magaña et al. (2013); Dong and Xing (2018), Static Hand Gesture Oyedotun and Khashman (2018), Few-Shot Learning Jetley et al. (2015); Snell et al. (2017); Fort (2018); Allen et al. (2019), Clustering Zhao and Qin (2015), Robust Image Classification Wohlhart et al. (2013); Saleh et al. (2013, 2016); Yang et al. (2018); Xiao et al. (2020); Garnot and Landrieu (2021), CNN Interpretation Drumond et al. (2017); Angelov and Soares (2020), etc. Even though these works proposed a wide variety of methods for prototype learning, most of the approaches focus on using prototype learning to improve the performance/robustness of a specific task. As far as we know, little attention is paid to the use of the prototype to capture other semantic properties of the object image, such as its typicality; further, if we consider that the prototype is based on the notion of typicality Rosch and Mervis (1975); Rosch (1975, 1978); Geeraerts (2010).

Introducing typicality into image processing can increase the generalization power of pattern recognition models. This assumption can be supported – on the one hand – due to its theoretical foundations: Rosch’s experiments Rosch (1978) showed that when humans learn a category by looking at its most typical samples, they can better recognize new members. On the other hand, some authors Saleh et al. (2013, 2016) showed that deep learning models could not generalize atypical images that are substantially different from training images. When a typicality measure is involved in the learning process, we can improve the image classification task Saleh et al. (2013, 2016). Moreover, involving the typicality learning in the category’s learning process would allow the machines to categorize the images and know their degree of belonging (is it a typical, atypical, or border-image?), a type of semantic representation of the object’s image categories that are only achievable by human beings.

Our experimental results show that our mathematical framework allows us to interpret possible semantic associations between members within the category’s internal structure. Results also indicate that our method could establish a relationship between the proposed typicality measure and the representativeness of the object’s image. Experiments on ImageNet Russakovsky et al. (2015) and Coco Lin et al. (2014) datasets present that good performance can be achieved in real-world tasks while capturing other semantic properties of the object’s image, like typicality.

2 Related Work

Prototype Theory

The Prototype Theory Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978); Geeraerts (2010) analyzes the internal structure of semantic categories and proposes categorization based on the prototype. This Theory postulates that semantic categories are not homogeneous structures. According to experimental evidences Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976), semantic categories should be considered heterogeneous structures, where their members and their respective characteristics do not have the same relevance within the category.

Rosch and colleagues Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978); Geeraerts (2010) argued that semantic categories are made up of good and bad examples, according to a given criterion. The most representative members (typical members), those that are evoked when thinking/or viewing a category, are the central members (focal cases) or prototypical members (best examples) around which the rest of the category members are organized; thus exhibiting a prototypical organization of the category. Figure 2 shows an example of the prototypical organization phenomenon. According to the authors Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978); Geeraerts (2010) the prototype is based on typicality, and all members belonging to the same category do not represent it in the same way, i.e., some members are more typical than others. Accordingly, there must be an internal family resemblance among category members in each category and an external dissimilarity (low similarity) with the members of the other categories.

Figure 2: Category’s prototypical organization. The figure shows the Sessel and Stuhl experiment conducted by Gipper (Figure adapted from Geeraerts (2010)). The experiment studies the meaning of German words Stuhl (chair) and Sessel (comfortable chair) and shows that within the chair category, the category’s internal organization (and central semantic meaning) can change according to the given criterion and object typicality.

The prototype was initially defined as the most representative and distinctive member of a category Rosch (1973), as it is the element that shares more features with the other category members and less with members of other categories. But one of the problems of defining the prototype as an element or as a prototype-object is related to the decision of who is the category’s prototype when two members are equally representative? Consequently, the prototype began to be defined as a prototype-cognitive entity, specifically, as prototypicality effects Rosch and Mervis (1975); Rosch (1975); Geeraerts (2010).

The category prototype was formally defined as a cognitive entity as the clearly central of a given category Rosch and Mervis (1975); Rosch (1975) (for example, in Figure 2 the prototype is defined by the typical members within the circle). The attributes of those central members are those that are structurally the most salient category properties. Rosch’s experiments Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978) showed that human beings store category knowledge as a semantic organization around the category’s prototype. The categorization of an object is obtained based on the similarity of a new exemplar with one of the prototypes (cognitive abstraction) learned.

According to Geeraerts Geeraerts (2010), the concept of prototypicality is in itself prototypically clustered based on one to four characteristics. The concepts of non-discreteness and non-equality (either on the intensional or on the extensional level) play a major distinctive role. Four characteristics are frequently mentioned as typical of prototypicality in semantic categories Rosch (1975); Geeraerts (2010): i) categories exhibit degrees of typicality; not every member is equally represented in the category (extensional non-equality); ii) categories are blurred at the edges (extensional non-discreteness); iii) categories are clustering into family resemblance structure, i.e., the category semantic structure takes the form of a radial set of clustered and overlapping members (intensional non-equality); and iv) categories cannot be defined by a single set of criteria (necessary and sufficient) attributes (intensional non-discreteness). The prototypicality effects (Table 1) surmise the importance of the distinction between the central and peripheral meaning of the object categories Geeraerts (2010).

Extensional Intensional
non-equality Difference of typicality Clustering into family
(salience effect, and membership salience resemblances
core/periphery)
non-discreteness Fuzziness at the edges, Absence of necessary and
(demarcation membership uncertainty sufficient definitions
problems, flexibly)
Table 1: Two-dimensional conceptual map of prototypicality effects according to Geeraerts Geeraerts (2010).
Prototype Learning

Learning Vector Quantization (LVQ) is a field sprung by the seminal work of Kohonen 

Kohonen (1997), in which the methods attempt to find optimal prototypes from labeled data. LVQ models partition the input space and assign each partition a set of prototypes Kohonen (1997). The classification of a new element is based on the proximity (similarity) with the learned prototypes. LVQ approach has been widely studied by many works Kohonen (1997); Seo and Obermayer (2003). It has many variations that normally differ in the proposal used for feature extraction (handcrafted features) and the approach used to prototypes construction/update. Works like Yang et al. (2018); Liu and Nakagawa (2001) presented a more detailed review of this family of prototype-based learning methods.

With the rise of deep neural networks, handcrafted features were replaced with CNN features in prototype learning, thus achieving end-to-end integration in deep networks and high precision and robustness in various image processing tasks. The differences between the great variety of existing approaches can be roughly grouped by: i) the number of prototypes used to represent a category (1-per-class Ojeda-Magaña et al. (2013); Dong and Xing (2018); Jetley et al. (2015); Snell et al. (2017); Fort (2018); Wohlhart et al. (2013); Garnot and Landrieu (2021), n-per-class Oyedotun and Khashman (2018); Allen et al. (2019); Yang et al. (2018); Xiao et al. (2020); Drumond et al. (2017), sparse Mingbo Ma et al. (2013)); ii) the distance measure (or measures combination) used to stand for the similarity between each instance-prototype pair (Euclidean distance Mingbo Ma et al. (2013); Oyedotun and Khashman (2018); Ojeda-Magaña et al. (2013); Snell et al. (2017); Allen et al. (2019); Wohlhart et al. (2013); Yang et al. (2018); Xiao et al. (2020), Mahalanobis distance Ojeda-Magaña et al. (2013); Wohlhart et al. (2013)

, Co-variance distance 

Fort (2018), Cosine distance Snell et al. (2017), Learned distance Dong and Xing (2018); Drumond et al. (2017), Hand-designed distance Xiao et al. (2020); Garnot and Landrieu (2021)); and iii) the approach used for prototype representation (prototype-template image Mingbo Ma et al. (2013); Jetley et al. (2015), mean vector of embedded features Oyedotun and Khashman (2018); Dong and Xing (2018); Snell et al. (2017); Fort (2018); Allen et al. (2019), learned centroid vector Ojeda-Magaña et al. (2013); Wohlhart et al. (2013); Yang et al. (2018); Garnot and Landrieu (2021); Drumond et al. (2017)

, learned CNN-tensor 

Xiao et al. (2020)).

All these previous approaches improve the performance of a specific task and: 1) the prototype learning use all images of the training dataset (regardless of whether they are good or bad examples of the category); 2) although these methods assume that the categories are prototypical, they also assume that the categories are homogeneous (and consequently use similarity measures that do not take into account the relevance of the attributes for each category). Note that these two characteristics show that current prototype learning methods Mingbo Ma et al. (2013); Angelov and Soares (2020) do not consider the theoretical foundations established by Rosch Rosch and Mervis (1975); Rosch (1975) to represent the semantics structure of prototypical categories; therefore, they are still unable to capture higher-level semantic phenomena, such as object’s image typicality. In the next section, we present a mathematical framework based on the foundations of Prototype Theory that presents a first approach to the use of prototypes to capture the object typicality.

3 Computational Prototype Model

The semantic structure of a category (i.e., core and peripheral meaning) is related to differences of typicality and membership salience among category members (extensional non-equality in Table 1Rosch (1978); Geeraerts (2010). Rosch’s experiments showed that the heterogeneous internal structure of natural semantic categories relates to the concepts of prototype (core meaning of the category), typicality, and family resemblance. Family resemblance relationship Rosch and Mervis (1975); Rosch et al. (1976) consists of a set of items of the form AB, BC, CD, DE; i.e., each item has one or more attributes in common with one or more other items, but no attribute needs to be common to all items Rosch and Mervis (1975). The abstract nature of these semantic concepts has made it difficult to simulate them, even with the most powerful current techniques: deep learning.

Rosch’s experiments also concluded that family resemblance is a function of the frequency (or learned attributes relevance) and the distribution of attributesRosch et al. (1976). It is worth noting that the attributes relevance (category weights) and the distribution of the attributes (category features distribution) are, currently, characteristics that can be modeled with CNN models. In the following, we used Rosch’s results as assumptions to model our Computational Prototype Model (CPM).

3.1 Semantic Prototype Representation

In general, prototype learning methods Mingbo Ma et al. (2013); Ojeda-Magaña et al. (2013); Wohlhart et al. (2013); Saleh et al. (2013); Zhao and Qin (2015); Jetley et al. (2015); Saleh et al. (2016); Oyedotun and Khashman (2018); Snell et al. (2017); Drumond et al. (2017); Dong and Xing (2018); Fort (2018); Yang et al. (2018); Allen et al. (2019); Angelov and Soares (2020); Xiao et al. (2020); Garnot and Landrieu (2021) represent the prototype as a centroid vector computed using all category members. In contrast to those proposals that assume the prototype as a centroid element (prototype-object), and based on Rosch’s prototype definition, we propose representing the prototype as a semantic entity with a center and boundaries computed using only the typical members.

definitionDefinition

Let be a set of objects and  be the set of objects categories that partitions ; is the set of objects that share the same i-th category , and is the set of features of an object.

Semantic prototype. We define a semantic prototype as the central meaning of category

. Thus, the semantic prototype is given by the average and standard deviation of each features of

all typical objects within the -category along with a measure of the relevance of those features. Formally, the semantic prototype is represented by the tuple where and

  1. and is the mean of j- feature considering only typical objects of -category;

  2. and is the standard deviation of j- feature considering only typical objects of -category;

  3. and is the relevance value of j- feature for the category .

An abstract prototype is defined as the ideal element or the most prototypical element of the -th category; and it is given by the -dimensional vector composed of the expected value of the most salient features of -th category since was computed using only typical members.

Rosh and colleagues Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978) besides arguing that the prototype is the semantic nucleus of natural categories; also stated that from this nucleus, the categorical continuum could be characterized by two gradations: i) the relative importance evaluates each attribute it has for the category, and ii) the relevance (or salience) of each category member coincides with the amount and type of features that the element presents. In this way, it is possible to establish the prototypicality degree of a given element within the category Rosch and Mervis (1975); Rosch (1975). For example, within the color category, the weight and height attributes are not relevant (null relevance); conversely, within the category “light objects”, the relevance of color and height attributes is null, and weight is very high. Consequently, the object relevance is evaluated based on its weight attribute. Note that this type of semantic distance relative to the semantic prototype cannot be modeled with similarity measures that assume that the categories’ attributes are homogeneous (e.g., the Euclidean distance and other classical measures used in prototype learning).

3.2 Semantic Distance

Formal models of experimental psychology such as Prototype Model Homa and Vosburgh (1976), Multiplicative Prototype Model (MPM) Minda and Smith (2002), and Generalized Context Model (GCM) Medin and Schaffer (1978); Zaki et al. (2003) proposed measures of semantic distances between the stimulus that corresponds to Prototype Theory foundations. In Definition 3.2, we present a semantic distance between objects as a measure of the family resemblance. Our proposal is a generalization of the psychological distance between two stimuli proposed in the GCM formal model. Unlike the original formal Context Model Medin and Schaffer (1978), we assume that object features (stimuli) are not binary values () and the relevance () (or cost of attention) of each -th unitary object feature is forced to be strictly positive, but has no upper limit, i.e., .

Objects Dissimilarity. Let be a representative objects of -th category and the features of objects respectively. We defined the objects dissimilarity or the objects distance between and as the semantic distance given by

(1)

where , , and is L1-norm, .

Prototypical distance. Let a representative object of -th category , the features of object o and the semantic prototype of -category. We defined as prototypical distance between o and the semantic distance:

(2)

where and .

The proposed prototypical distance is a generalization of the semantic distance of the MPM formal model Minda and Smith (2002). Different from MPM model assumptions, we assumed that prototype features are the features of the ideal member (abstract prototype) of -th category(). Note that our prototypical distance (Def. 3.2) is a specific case of our dissimilarity measure between objects (Def. 3.2), where one of the elements is the abstract prototype.

Since the distance function ( is a non empty set of all objects features of category ) satisfies the axioms of non-negativity, identity of indiscernible, symmetry and triangle inequality;  is a metric in the features domain . Consequently, is a metric space or features metric space. Notice that is a measurable space (see proof in the supplementary material).

Since is a measurable space, we can use the generalization of Chebyshev’s inequality to define the boundary of our semantic prototype representation. Stellato et al. Stellato et al. (2017) approached the problem of formulating an empirical Chebyshev inequality given i.i.d. samples from an unknown distribution and their empirical mean and standard deviation . The authors derive a Chebyshev inequality bound with respect to the -th sample. The Multivariate Chebyshev inequality Stellato et al. (2017) can define the boundary for an ellipsoidal set centered at the mean. Consequently, we construct a confidence ellipsoidal set from the sample mean () and std () of only typical objects samples by computing a threshold vector . Let be a set of features extracted for only typical objects of -category, and the features of a typical object . We weakly define as edges of our semantic prototype the threshold vector that meets the expression:

(3)

where , , and .

Figure 3 shows the theoretical representation of category internal structure based on our CPM model. Our approach considers important concepts of the Prototype Theory, namely: i) our semantic prototype encoding is computed using only typical samples; ii) category prototype edges are well defined; iii) category edges provide a fuzzy definition because our semantic prototype is not computed with all category elements; iv) objects representativeness degree (typicality) within the category is simulated with our prototypical distance, and v) family resemblance relationship is simulated with our objects dissimilarity measure.

Figure 3: Category internal structure. The expected semantic representation of a category’s internal structure. The diagram also shows the key definitions and constraints of our Computational Prototype Model.

3.3 Prototype Construction

In this paper, we exploit the capability of CNNs in image semantic processing and classification tasks and use them as backbone of the components of our CPM framework. We assume as category attribute distribution the features distribution (

) composited with object’s-images features extracted from the last dense layer (before the softmax layer) of a CNN model; object’s images that belong to the same category. In addition, we assume as relevance (

) of the category’s attributes those category’s weights learned by the softmax layer.

Figure 4: Off-line construction of the semantic prototype dataset. Given a labeled images dataset, we compute our semantic prototype representation for each object category present in the dataset. The diagram details the offline computation of a semantic prototype for a specific category.

Since we have also considered the element’s typicality within the category to compute our semantic prototype representation, the prototype construction requires image datasets of objects with annotations of the image typicality score. Considering that large image datasets Russakovsky et al. (2015); Lin et al. (2014); Lecun et al. (1998); Krizhevsky and Hinton (2010)

do not have annotations of image typicality, we use as typical objects of a category those elements that are classified as belonging to a specific category (those with

probability of membership).

Figure 4 shows the main steps and concepts of our prototype construction procedure. Given a labeled dataset with images of objects, we extract the features and typicality score of objects’ images for each object category. Next, for those object image features (typical features) that have typicality scores higher than a threshold, we compute our semantic prototype representation (we use a threshold ). The resulting semantic prototypes dataset is used as prior knowledge in our procedures to introduce our CPM framework into deep learning models and evaluate it in real-world image processing tasks.

4 CPM model semantics in classification and description of object images

Rosch’s experiments Rosch (1973); Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976); Rosch (1978) indicated that category prototypes are cognitive reference point in constructing concepts. We apply the Prototype Theory as a theoretical foundation to represent the semantic of the visual information lying in the basics components of a scene: objects. The observations on the Prototype Theory raise the following two questions: i) Can a model of perception system be developed in which objects are described using the same semantic features that are learned to identify and classify them? ii) How can the category’s prototype be included as a reference point in the object global semantic description and classification tasks?

We address these two questions inspired by the human’s approach to classifying and describing objects globally. Humans use the generalization and discrimination processes to build object descriptions highlighting their most distinctive features within the category. For instance, a typical human description: a dalmatian is a dog (generalization ability to recognize the central semantic meaning of dog category) distinguished by its unique black or liver-colored spotted coat (discrimination ability to detect the semantic distinctiveness of object within the dog category). Figure 1 depicted our prototype-based classification and description hypothesis, and Figure 5 illustrates our workflow to model its main concepts.

4.1 Prototype-based Classification using CPM Model

This section introduces our CPM framework to simulate the prototype-based concept of categorization of Prototype Theory (Figure 1 steps 1-3). To evaluate our framework in the image classification task, we propose a CNN-Layer that converts a common CNN classification model into a prototype-based classification. The diagram in Figure 5

-c illustrates the internal structure of our new Prototypical Similarity Layer (PS-Layer). The diagram shows the process of using the PS-Layer in a common CNN classification model. It highlights in purple the mathematical model of a PS-Layer neuron. Notice how the

-th neuron body keeps, as prior knowledge, the semantic prototype () of -th category. The PS-Layer has many neurons as prototypes and categories (see Figure 5-c) and uses as neuron output activation our prototypical distance to measure the object’s semantic distinctiveness.

Figure 5: Prototype-based Models Workflow. Our methodology comprises two main stages: 1) Feature extraction and Categorization based on prototypes and 2) transformation of CNN-object features into our Global Semantic Signature based on Prototypes. a) input image; b)-c) features extraction and classification using a pre-trained CNN classification model. Our Prototypical Similarity Layer (PS-Layer) is used to convert a common CNN-model into a prototype-based CNN-classification model; d) prototype dataset; e) category prototype selection; f) global semantic description of an object using category prototype; g) graphic representation of our Global Semantic Descriptor signature resulting from the dimensionality reduction function (); and h) Global Semantic Signature.

Similar to MPM model Zaki et al. (2003); Minda and Smith (2002), the PS-Layer computes the probability with which an object is classified into i- category using the equation: where is the response-scaling parameter, and is the similarity between object and the i- prototype . Without loss of generality, and using the same MPM model assumptions Minda and Smith (2002); Zaki et al. (2003), we set and to be equal to . The classification probability of our PS-Layer can be rewritten as:

(4)

where

is our prototypical distance. Note that our PS-Layer is a softmax function over the prototypical distance as probability distribution

. To simplify the PS-Layer neuron gradient computation, we added several constraints: i) neuron weights must be non-negative, w.l.o.g., this allows to eliminate absolute value sign in the weights-term of our prototypical distance expression; ii) L2-regularization is used to guarantee small weights values Zaki et al. (2003). Consequently, since is a constant, our PS-Layer neuron gradient:

(5)

is as simple as common CNN neuron gradient . Then, the model that uses our PS-Layer can be trained using the same training conditions of a baseline CNN model. Our prototype-based classification approach differs from other literature works Seo and Obermayer (2003); Wohlhart et al. (2013); Jetley et al. (2015); Snell et al. (2017) as follows: i) the prototype representation is based on the structural bases of typicality effects Rosch and Mervis (1975); Rosch (1975); Rosch et al. (1976);  ii) the similarity measure is also based on some psychological measures that assume the categories as heterogeneous semantic structures;  iii) regarding simplicity and scalability: our approach is less complex than other works in the literature. It is easy to use and convert a common CNN model into a prototype-based approach without making substantial changes to the original CNN-model architecture; iv) regarding interpretability: Since we are trying to capture the image membership degree, our PS-Layer provides greater interpretive power to CNN classification models due to simplicity and clear geometric interpretation of the object typicality concept.

4.2 Global Semantic Descriptor based on Prototypes

4.2.1 Semantic Meaning Vector

Many cognitive neuroscience works have studied the effect of semantic meaning in object recognition task Tulving (2007); Martin (2007); Collins and Curby (2013). They observed that when an object is previously associated with semantic meaning in the brain, people are more prone to identify the object correctly. They also have shown that semantic associations allow a much faster recognition of an object, even when the task of object recognition becomes increasingly hard (varying points of view, occlusion). Moreover, the impressive performance of CNNs in object classification tasks fostered studies of possible links between CNN models and the visual system in the human brain. Cichy et al. Cichy et al. (2017), for instance, suggested that deep neural networks perform spatial arrangement representations like those performed by a human being. Khaligh et al. Khaligh-Razavi and Kriegeskorte (2014), for their turn, concluded that the weighted combination of features in the last fully connected layer of CNNs could thoroughly explain the inferior temporal cortex in the human brain. We lay hold of these theoretical foundations to model our representation of the object’s semantic meaning.

We redefine as semantic value of the object in the context of -th category, the image score , where is the relevance of -th object feature. Note that the semantic value for an object is the value commonly used to object categorization in the softmax layer of CNN classification models. Hence, our descriptor approach based on prototypes assumes as the object semantic meaning vector, the vector () constructed using the Hadamard product

to compute the object semantic value. Our semantic meaning representation applies a bias vector to dissolve the bias value in each semantic vector component uniformly. Thus, we use the sum of each semantic meaning vector component to recovering the semantic value,

i.e., .

4.2.2 Semantic Distinctiveness Vector

We stand for the semantic distinctiveness of an object for specific -category as the semantic discrepancy between object features and features of the most prototypical element of -category (-th abstract prototype). Consequently, our approach assumes as semantic distinctiveness vector of an object, the semantic difference vector: that is constructed with the element-wise operations to compute the object prototypical distance. The semantic difference vector is the weighted residual vector that is composed of the absolute different between each object feature and each feature of -th category abstract prototype, i.e., . Therefore, similar to semantic meaning vector, we use a sum of each semantic difference vector component to retrieve the object prototypical distance, i.e., .

1:Input: Image of an object
2:Output: GSDP signature ()
3:Prior-Data: Trained CNN-model ,
4:
5:
6:
7:
8:return
Algorithm 1 Global Semantic Descriptor

Figure 5 show an overview of our prototype-based description model. After the feature extraction and categorization processes (Figure 5a-c), we use the corresponding category prototype for describing the object features semantically. We show the steps that introduce the category prototype into the global semantic description of the object’s features in Figure 5d-h). A drawback of this semantic representation (Figure 5-f) is having high dimensionality since it is based on the semantic meaning vector () and the semantic difference vector (). The large dimensionality of our feature vectors could make its use unfeasible in common computer vision tasks Han et al. (2017); Kim et al. (2017). Figure 5 and Algorithm 1 detail the main steps of our approach; note that the steps follow the same workflow of human description hypotheses depicted in Figure 1.

4.2.3 Dimensionality Reduction

Discarding features, from the Prototypes Theory perspective, is not suitable when applied to the semantic space due to the absence of necessary and sufficient definitions to categorize an object (Table 1: intensional non-discreteness). Discarding features might lead to discarding discriminatory ability among category elements Geeraerts (2010), since some objects within the category do not have some category typical features. For example, flying is a typical feature of the bird category, but a penguin is a bird that does not fly.

1:Input: m-dim vector , type, size-option
2:Output: semantic signature
3:// Bias m-dimensional vector initialization
4: // 
5:// Auxiliary matrix initialization
6:
7: // square auxiliary matrix
8:// Computing angles matrix from auxiliary matrix .
9:
10:// Computing semantic-vector using the Hadamard product .
11:// Finding the optimal configuration .
12:, and
13:// Reshape semantic-feature to dimension.
14:
15:
16:// Sliding the angles-matrix (kernel) across features-vector
17:for each in  do // Computing the semantic-gradient matrix
18:     
19:     // Computing 8D-histogram of gradients
20:     for  do
21:         // Gradients are quantified with each angular bin
22:         -      
23:     // Adding 8d-histogram to final semantic signature
24:     -
25:return
Algorithm 2 Dimensionality Reduction

Several dimensionality reduction algorithms such as PCA Abdi and Williams (2010) and NMF Lee and Seung (2001) are based on discarding features that do not generate a meaningful variation. Although these approaches work on some tasks, we can lose the ability of data interpretation after applying these algorithms Abdi and Williams (2010). In order to encode our representation with low dimensionality while encapsulating some interpretable properties of the object’s image, we propose a simple transformation function  to compress our global semantic representation of the object’s features (Figure 5-f) into a low dimensional signature (Figure 5-h).

We propose the transformation to reduce the dimensionality of our image semantic representation (Figure 5-f) while retaining, in the final descriptor, properties such as object semantic meaning and object semantic difference (typicality vector). Our final descriptor  is computed by concatenating the corresponding signatures of semantic meaning vector  and semantic difference vector  compressed with the transformation . Algorithm 2 describes the steps of our dimensionality transformation function. The workflow can be summarized in nine main steps: 1) Transform the learned bias value in the -dimensional vector , = such that ;2) Compute the auxiliary matrix based on the descriptor signature size desired (size-option parameter); 3) Compute the angles matrix using the angles formed by the position of each auxiliary matrix cell with respect to auxiliary matrix center. To achieve uniqueness the diagonal angles are evenly distributed between among its neighboring angles magnitudes (max and min angles); 4) Compute our high-dimensional semantic representation based on prototypes (Figure 5-f); 5) Resize the semantic representation in the best 2D dimensional matrix configuration () whose dimensions are multiples of  (auxiliary matrix dimension); 6-7) Slide the angles-matrix (as kernel) across features-matrix and create an unitary semantic gradient for each angles-matrix mapped within the features-matrix. Each semantic gradient is constructed using the angle matrix , magnitude and sign of features-matrix values;  8) Reduce each semantic gradient to a 8D-histogram similarly to SIFT Lowe (2004)9) Concatenate, for each semantic gradient, the corresponding 8D-histogram resulted of flow 6–8.

Note that, similar to our global semantic representation of object features (Figure 5-f), the GSDP-signature holds important properties: the first half of our semantic signature preserves the objects semantic meaning, i.e., ; the second half retains the object semantic difference(object typicality), formally, . Additionally, our descriptor can construct semantic representations for: i) an object, and ii) an abstract prototype (ideal category member).

5 Experiments and Results

To verify that our Prototype Framework contributes to the success of semantic extraction, capturing the semantic core of the category, simulating the visual representation degree of each image as well as the usability of that semantic information in image processing tasks, we performed the following experiments. First, we qualitatively analyze the semantics captured by our CPM model. Second, we evaluate the performance and usability of our CPM model in the image classification task. Third, to validate the assumption that the core semantic information of a category must be invariant to the training dataset, we carry out cross-dataset transfer-learning experiments based solely on the information captured by our prototype representation. Finally, since our GSDP descriptor is based on the object typicality concept, we evaluate the performance of our semantic description approach in image clustering and classification tasks.

Datasets

We conducted our experiments on five image datasets. The off-line prototype computation process and the CPM model representation were conducted using MNIST 

Lecun et al. (1998), CIFAR10, CIFAR100 Krizhevsky and Hinton (2010), and ImageNetRussakovsky et al. (2015) datasets. PS-Layer classification performance was evaluated using MNIST Lecun et al. (1998), CIFAR10, and CIFAR100 Krizhevsky and Hinton (2010). We evaluated the prototype-based transfer learning performance and our GSDP descriptor performance using the ImageNet Russakovsky et al. (2015) and Coco Lin et al. (2014) as real images datasets.

Backbone Networks

We evaluated our CPM representation using CNN architectures based on LeNet Lecun et al. (1998)

and Deep Belief Network 

Krizhevsky and Hinton (2010)

for MNIST and CIFAR datasets, respectively. We assessed the PS-Layer performance with the following networks: sMNIST 

Lecun et al. (1998), sCF10 Krizhevsky and Hinton (2010), sCF100 Krizhevsky and Hinton (2010), vggCF10 Liu and Deng (2015), and vggC100 Liu and Deng (2015). We also conducted cross-dataset experiments in ImageNet Russakovsky et al. (2015) and Coco Lin et al. (2014) using VGG16 Simonyan and Zisserman (2014) and ResNet50 He et al. (2016) as backbone networks of our GSDP representation and prototype-based transfer learning experiments.

5.1 Computational Prototype Model

Due to the lack of a well-defined metric to quantify whether a framework correctly captures the semantic meaning of a category and annotated images with the object typicality score, we used four assessment approaches to analyze the semantics underlying our CPM model, namely: Semantic prototype encoding, Central and Peripheral meaning, Prototypical Organization, and Image Typicality Score.

5.1.1 Semantic prototype encoding

We analyzed the semantics behind our semantic prototype representation by conducting the hierarchical clustering of our categories’ semantic prototypes, which illustrates the hierarchical semantic organization of a specific image dataset. Figure 

6 shows an example of a dendrogram obtained when using the semantic prototypes computed in CIFAR10. Notice that our semantic categories’ representations distribute the CIFAR10-dataset, achieving a hierarchical semantic organization. For example, two macro-categories are visible in Figure 6: animals and transport vehicles. It is noteworthy that this last macro-category is also semantically interpreted by our representation as non-ground vehicles and ground vehicles.

Figure 6: Hierarchical clustering of CIFAR10 semantic prototypes.

5.1.2 Central and Peripheral meaning

We observed the visual representativeness of those elements allocated by our CPM model in the center and periphery of the category. We aim to study the visual representativeness, i.e., typicality, of category members closest and farthest from the category semantic center (our abstract prototype). We extract features from an object’s image with a CNN model and compute the prototypical distance for all -th category members. Next, the objects’ images are ranked in ascending order based on their prototypical distance score. Figure 7 shows examples of central (Top- closest) and peripheral (Top- farthest) meaning captured by our CPM model in ImageNet categories using VGG16-model as an image feature extractor. Note that our proposal finds typical elements (Top- closest) images with distinctive features in the category. Members with fewer characteristic features, or little readable, are placed in the periphery (Top- farthest) away from the category central semantic meaning. However, they are keeping the category features since they still belong to the category.

Figure 7: Central and Peripheral meaning captured by our CPM model. From left to right: Top- elements closest (in blue) to the semantic prototype of the corresponding category; and Top- elements furthest (red) from the category semantic prototype. Index value represents the image position within the category dataset.
Figure 8: Prototypical organization within categories. The internal structure of the Persian cat category in ImageNet. Each category member is represented with its VGG16 image features. We represented with color degrees the category’s internal disposition respect its prototype. In the bottom and on the top, from left to right, the mapped Top- elements closest (in blue) and furthest (in red) to the mapped semantic prototype (in black). The image dataset index of the first Top- element is annotated inside the black box.

5.1.3 Prototypical Organization

We conducted experiments to analyze the internal semantic structure of the category applying the CPM model constraints. Visualizing the category’s internal structure is infeasible in

-dimensional features space. Since most data visualization methods are based on feature discarding, we used topology techniques to perform some continuous deformations of the object features and preserve some object semantic properties. The proposed map allows making an object’s image interpretation based on all observed features.

Our function maps object’s image features from metric space to metric space ( is L1-norm condition) using its semantic value and its prototypical distance. Thus, using our CPM constraints we can show that: , i.e.,  is continuous, which means that every element of neighborhood in also belongs into neighborhood in . Consequently, the observed behavior of -th category internal structure – in terms of distance metrics – in is equivalent to the behavior in feature metric space . Figure 8 shows an example of the internal semantic structure captured by our CPM-model. Our experiments showed that the semantic value and the prototypical distance place the object in a unique semantic position within the category’s internal structure. Notice that the internal structure of the category shows a prototypical organization of its elements in metric space.

Figure 9: Typicality score analysis. Images with the same prototypical distance and different semantic values (red) have similar representativeness within the category, and category members with different prototypical distance and same semantic value (in blue) are visually different. We also observe that the image visual representativeness (typicality) decreases as prototypical distance increases. The VGG16 model was used to extract image features.

5.1.4 Image Typicality Score

Our approach to visualize the internal structure of a category also allows observing other semantic phenomena related to the visual representativeness degree of an object’s image. We conducted qualitative experiments to investigate the influence of variations of semantic value and prototypical distance on the image’s visual representativeness. The experiments showed a small strength of a linear association between those two variables (Pearson coefficient values between and ). Figure 9 shows an example of our experiment.

Lake et al. Lake et al. (2015) shows that the semantic value can be used as an indicator of the typicality of an input image. In contrast to Lake et al.’s results, our experiments in the ImageNet dataset with VGG16 and ResNet50 models showed that using the semantic value as typicality score of the object’s image can be problematic. Mainly because objects with the same semantic value do not imply the same image typicality (e.g., images highlighted in blue in Figure 9). Selecting the semantic value as image typicality score shares the same serious problems that CNNs still suffer from. Adding small noises or making small changes to the initial samples generates different predictions for these samples with high confidence (adversarial samples Szegedy et al. (2014)), thus generating drastically different semantic values for very similar images. Our experiments did not allow us to generalize a behavior pattern between semantic value and image typicality score. However, the experiments carried out suggest that our prototypical distance can capture the representativeness degree of the object’s image. We observe that as the prototypical distance increases, the visual typicality of the object’s image decreases.

5.2 Prototype-based Classification

We used our PS-Layer to assess the performance of our CPM framework in classification tasks. The experiments were performed using baseline models the following CNN architectures: sMNIST Lecun et al. (1998), sCF10 Krizhevsky and Hinton (2010), sCF100 Krizhevsky and Hinton (2010), vggCF10 Liu and Deng (2015), and vggC100 Liu and Deng (2015). The CNN-baseline models used differ in output categories number, model architecture, model depth, accuracy, and dataset size to evaluate our approach in different environments. For each CNN-baseline model, we replaced the softmax layer with our PS-Layer. We trained the resulting PS-Layer models using the same training conditions of its baseline CNN-model (i.e

., batch size, epochs, without data-augmentation, etc.). We evaluated several versions of PS-Layer models, changing the weights initialization method and version of our semantic distance function used as a prototypical similarity. For each weights initialization method: from scratch, freezing, and pre-train, we used two versions of our semantic distance function inside the PS-Layer: a) prototypical distance; and b) penalized prototypical distance. We penalized peripheral elements using our semantic edges constraints (see Equation 

3). Consequently, for each baseline CNN model, we evaluated six PS-Layer model versions:fromscratch-a, fromscratch-b, freezing-a, freezing-b, pretrain-a, pretrain-b. Notice that unlike other prototype learning approaches Mingbo Ma et al. (2013); Ojeda-Magaña et al. (2013); Wohlhart et al. (2013); Saleh et al. (2013); Zhao and Qin (2015); Jetley et al. (2015); Saleh et al. (2016); Oyedotun and Khashman (2018); Snell et al. (2017); Drumond et al. (2017); Dong and Xing (2018); Fort (2018); Yang et al. (2018); Allen et al. (2019); Angelov and Soares (2020); Xiao et al. (2020); Garnot and Landrieu (2021), our prototype representation is not updated during the training process. Because the lack of annotated data with the typicality information prevents end-to-end training, our main goal is to evaluate the performance and robustness of the semantic information encapsulated in our prototype’s representation.

Model Test Train
Top1 Top5 Top1 Top5
MeanStd Max MeanStd Max MeanStd MeanStd
sCF10Krizhevsky and Hinton (2010) 69.532.18 72.11 97.43.42 98.09 74.322.74 98.23.41
fromscratch-a 73.42.47 74.05 98.0.26 98.23 79.57.76 98.95.14
fromscratch-b 73.71.71 74.96 98.01.40 98.75 79.49.81 98.90.20
freezing-a 64.54.13 64.80 96.56.04 96.64 68.53.09 97.26.05
freezing-b 67.44.08 67.54 97.29.03 97.33 71.37.06 98.05.04
pretrain-a 75.84.62 76.84 98.45.14 98.64 82.541.06 99.23.11
pretrain-b 75.87.42 76.47 98.30.14 98.55 82.25.55 99.20.07
Table 2: Accuracy achieved by sCF10 versions using our PS-Layer in CIFAR10 dataset (best in bold).
Figure 10: PS-Layer performance summary. Classification accuracy overview of each baseline CNN model versus our PS-Layer versions. Each circle summarizes the metrics performance (Test-Top-, Test-Top-, Train-Top-, Train-Top-) of each case study analyzed. Accuracy values were normalized between .

Table 2 shows the performance for each PS-Layer version based on sCF10 model architecture (baseline). The baseline CNN-model (without PS-Layer) is in the first table row, separated from other PS-Layer models. Mean and Std accuracy values were computed using trained instances of each model version. Figure 10 summarizes the performance of each PS-Layer model version for each CNN-baseline model used as a case study. The experimental results show that the PS-Layer pre-train versions (highlighted in magenta and red) outperform the baseline CNN-model (black) in all architectures analyzed. It is noteworthy that our PS-Layer can achieve good performance while provides greater interpretative power to CNN models.

5.3 Prototype-based Transfer-Learning

Figure 11: Prototype-based transfer-learning summary. Classification accuracy overview of VGG16-based model versus our VGG16-based PS-Layer model in cross-dataset transfer learning.

Assuming that the central meaning of a learned category is invariant to the appearance of atypical members Rosch and Mervis (1975); Rosch (1975); we also evaluated on transfer learning (TFL) task the robustness of the semantic information encapsulated in our prototype representation. Classic transfer learning involves taking a pre-trained neural network and adapts the neural network to a new, different data set. In contrast, we performed transfer learning experiments using only the semantic information encapsulated in pre-computed prototypes. Similar to the previous experiments, we used our PS-Layer to compare the classical transfer learning approach versus a prototype-based knowledge transfer learning approach.

Using our category prototyping approach (see Figure 4), we computed category prototypes in the ImageNet dataset (see details in supplementary material). We used them as prior knowledge to classify images in the COCO dataset. The TFL classical approach was evaluated using a CNN model based on VGG16 as a backbone network. The prototype-based TFL approach used VGG16 architecture plus PS-Layer with network weights initialized randomly. We accomplished a performance analysis of both approaches, training (under the same conditions) different model instances and changing (increasing) some parameters such as the number of trainable/frozen layers. These experiments can be understood as a simple ablation study to observe how performance degrades in both approaches as model components degrade. Figure 11 summarises the results of each experiment performed.

5.4 Prototype-based image Global Descriptor

Figure 12: t-SNE visualization. t-SNE visualizations of first categories of ImageNet dataset using features constructed with a) VGG16 and b) ResNet50 models. Feature length is shown in the corresponding caption.

5.4.1 Signature Information Analysis

Our GSDP descriptor uses category prototypes as a semantic distinctiveness generator of signatures for category members. Elements with similar semantic meanings and share similar semantic differences with the abstract prototype will have similar GSDP semantic signatures (family resemblance concept). In other words, the abstract prototype can be interpreted as a DNA chain that stands for the typical features of category members. Since the t-SNE algorithm Maaten and Hinton (2008) can preserve the local structure, we used it to analyze the element neighborhood in -dimensional embedded space. We analyzed the discriminative power and t-SNE visualization performance of our GSDP semantic representation versus features extracted using CNN models. We performed the t-SNE visualization experiment for features-family constituted by CNN features, corresponding GSDP semantic signatures, and reduced PCA versions of CNN-features (we reduced CNN-features to the same GSDP feature dimensions). Figure 12 shows an example of t-SNE algorithm performance with VGG16 and Resnet features-family using Euclidean distance as similarity measure and as perplexity value. Note how GSDP representations achieved the best performance on each feature family.

5.4.2 Performance Evaluation

We evaluated our image semantic encoding performance in image clustering applications. According to observations of Yang et al. Yang et al. (2016)

, image features representations can generalize well when transferred to other tasks if they achieved good performance in the image clustering task. Based on these observations, we evaluated the GSDP descriptor performance in the clustering task by comparing its K-Means clustering metrics in ImageNet and Coco datasets. We compared our representation in the ImageNet dataset against:1) traditional handcraft image global descriptors: GIST 

Oliva and Torralba (2001), LBP Ojala et al. (2002), HOG Dalal and Triggs (2005), Color64 Li (2007), Color_Hist Song et al. (2004), Hu_H_CH Haralick et al. (1973); Hu (1962); Song et al. (2004); 2) deep learning images features trained on ImageNet: VGG16 features and ResNet50 features (and PCA-reduced versions).

Descriptor Size FPS Metrics Scores
H C V ARI AMI
Handcraft Features Performance on ImageNetRussakovsky et al. (2015)
GIST Oliva and Torralba (2001) 960 0.82 0.05 0.05 0.05 0.01 0.05
LBP Ojala et al. (2002) 512 0.72 0.02 0.03 0.03 0.01 0.02
HOG Dalal and Triggs (2005) 1960 33 0.04 0.04 0.04 0.01 0.03
Color64 Li (2007) 64 8 0.12 0.12 0.12 0.04 0.11
Color_HistSong et al. (2004) 512 26 0.08 0.08 0.08 0.03 0.07
Hu_H_CH Haralick et al. (1973); Hu (1962); Song et al. (2004) 532 6.9 0.04 0.04 0.04 0.01 0.02
Deep Features Performance on ImageNetRussakovsky et al. (2015)
VGG16 Simonyan and Zisserman (2014) 4096 15 0,87 0,88 0,88 0,78 0,87
VGG_PCA_256 256 12.5 0,89 0,90 0,89 0,82 0,89
VGG_PCA_1024 1024 12.5 0,89 0,89 0,89 0,81 0,89
GSDP_VGG_256 (our) 256 12.8 0,97 0,99 0,98 0,93 0,97
GSDP_VGG_1024 (our) 1024 11.6 0,94 0,98 0,96 0,84 0,94
ResNet50 He et al. (2016) 2048 10.6 0,88 0,90 0,89 0,78 0,88
ResNet50_PCA_128 128 12.5 0,88 0,88 0,88 0,81 0,88
ResNet50_PCA_512 512 12.5 0,89 0,90 0,90 0,82 0,89
GSDP_RNet_128 (our) 128 9.6 0,97 0,98 0,98 0,93 0,97
GSDP_RNet_512 (our) 512 9 0,91 0,97 0,94 0,73 0,91
Deep Features Performance on CocoLin et al. (2014)(CrossDataset)
VGG16 Simonyan and Zisserman (2014) 4096 15 0.32 0.34 0.33 0.15 0.31
VGG_PCA_256 256 12.5 0.35 0.37 0.36 0.19 0.34
VGG_PCA_1024 1024 12.5 0.35 0.37 0.36 0.18 0.34
GSDP_VGG_256 (our) 256 12.8 0.47 0.72 0.57 0.23 0.56
GSDP_VGG_1024 (our) 1024 11.6 0.46 0.54 0.49 0.17 0.49
ResNet50 He et al. (2016) 2048 10.6 0.29 0.36 0.32 0.17 0.31
ResNet50_PCA_128 128 12.5 0.32 0.34 0.33 0.17 0.31
ResNet50_PCA_512 512 12.5 0.34 0.35 0.34 0.20 0.33
GSDP_RNet_128 (our) 128 9.6 0.43 0.69 0.53 0.26 0.52
GSDP_RNet_512 (our) 512 9 0.34 0.47 0.40 0.09 0.39
Table 3: K-Means cluster metrics achieved for each evaluated global image representation in the first categories of ImageNet and Coco datasets (best in bold).
Figure 13: K-Means metrics on ImageNet. History of K-Means metrics reached by ResNet50 features versus our GSDP representation in the first categories of ImageNet dataset.

Table 3 shows the results achieved by each global image descriptor on the -th iteration of the experiments using K-Means clustering metrics: Homogeneity (H), Completeness (C), V-measure (V), Adjusted Rand Index (ARI), and Adjusted Mutual Information (AMI). Figure 13 depicts an example of K-Means metrics history achieved for ResNet50 features against our GSDP signatures in the first categories of the ImageNet dataset. The experiments show that as the data diversity of objects’ images increases, our semantic GSDP encoding significantly outperforms another image global encoding in terms of cluster metrics in the ImageNet dataset. Furthermore, we conducted the same experiment in Coco (cross-dataset) to evaluate each image representation’s performance and generalization ability on unseen data. Experiments showed that even when all image representations evaluated performed poorly in the Coco dataset, our GSDP representations performed best. The experiments show that the lowest dimensional GSDP representations (for each CNN model) were the ones that achieved the best size-performance trade-off.

6 Conclusion

In this paper, we introduced a Computational Prototype Model (CPM) based on the foundations of Prototype Theory. Our approach provides another point of view for semantic representation of the internal structure of object categories. Our proposal retrieved some experimental psychology results to model some semantic properties of the object’s image (e.g. typicality) that were still not analyzed by current prototype learning approaches.

We presented a straightforward Prototypical Similarity Layer (PS-Layer), which uses the constraints of the CPM model to learn object categories, and it allowed the evaluation of the CPM model in classification tasks and transfer learning. The experiments carried out pre-computed semantic prototypes as prior knowledge, which did not update during the training process. Cross-dataset experiments showed that even under these unfavorable training conditions, the semantic information captured with the CPM model could be robust and achieve reasonable performance.

Furthermore, using the CPM model components (semantic prototype and semantic distance), we proposed a prototype-based description model (GSDP) that introduces a new approach to the semantic description of objects’ images. The GSDP descriptor built discriminatory signatures that semantically describe objects’ images highlighting its most distinctive features within the category. Experiments in large image datasets showed that GSDP-descriptor is discriminative, small dimensioned, and can encode/preserves the semantic information of category members captured by the CPM model.

In summary, our experiments111All source code, prototypes datasets, GSDP tutorial and PS-Layer experiments examples will be publicly available in the project page: https://www.verlab.dcc.ufmg.br/global-semantic-description/. showed that the CPM model could encapsulate prominent semantic features of the object category in our semantic prototype representation, features that allow simulating the central and peripheral meaning of the category. Moreover, we showed that the semantic distance metric proposed in this article could simulate semantic relationships in terms of visual typicality between category members. Our prototypical distance can be understood as an object’s image typicality score in which our CPM model can capture the visual representativeness degree of the object. The experiments also showed that it is possible to build robust semantic entities using little data (only typical images).

Limitations and Future works

The lack of data sets with annotations of the image’s typicality prevented a more robust evaluation of our approach. This limitation also generated deep learning feature engineering (or post-processing) to evaluate our CPM-model since the end-to-end training wasn’t feasible. Consequently, as future work, we intend to construct a new image dataset with typicality annotations according to the interpretation criteria of human beings. With this initial work, we intend to encourage the pattern recognition community to delve into how to capture the image’s typicality, a semantic property known to influence the learning process but which, to date, is only a skill of human beings.

Acknowledgment

This research was supported by funding from the Brazilian agencies CAPES, CNPq, and FAPEMIG.

References

  • H. Abdi and L. J. Williams (2010) Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2 (4), pp. 433–459. Cited by: §4.2.3.
  • K. Allen, E. Shelhamer, H. Shin, and J. Tenenbaum (2019) Infinite mixture prototypes for few-shot learning. In

    Proceedings of the International Conference on Machine Learning (ICML)

    ,
    pp. 232–241. Cited by: §1, §1, §2, §3.1, §5.2.
  • P. Angelov and E. Soares (2020) Towards explainable deep neural networks (xdnn). Neural Networks 130, pp. 185–194. Cited by: §1, §1, §2, §3.1, §5.2.
  • R. C. Atkinson and R. M. Shiffrin (1968) Human memory: a proposed system and its control processes. Psychology of learning and motivation 2, pp. 89–195. Cited by: §1.
  • H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool (2008) Speeded-up robust features (surf). Computer Vision and Image Understanding (CVIU) 110 (3), pp. 346–359. Cited by: §1.
  • R. M. Cichy, A. Khosla, D. Pantazis, and A. Oliva (2017) Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks. NeuroImage 153, pp. 346–358. Cited by: §4.2.1.
  • J. A. Collins and K. M. Curby (2013) Conceptual knowledge attenuates viewpoint dependency in visual object recognition. Visual Cognition 21 (8), pp. 945–960. Cited by: §4.2.1.
  • N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 886–893. Cited by: §5.4.2, Table 3.
  • N. Dong and E. P. Xing (2018) Few-shot semantic segmentation with prototype learning.. In British Machine Vision Conference (BMVC), Vol. 3. Cited by: §1, §1, §2, §3.1, §5.2.
  • T. Drumond, T. Viéville, and F. Alexandre (2017) Using prototypes to improve convolutional networks interpretability. In Annual Conference on Neural Information Processing Systems(NIPS): Transparent and interpretable machine learning in safety critical environments Workshop, Cited by: §1, §1, §2, §3.1, §5.2.
  • S. Fort (2018) Gaussian prototypical networks for few-shot learning on omniglot. External Links: Link Cited by: §1, §1, §2, §3.1, §5.2.
  • V. S. F. Garnot and L. Landrieu (2021) Leveraging class hierarchies with metric-guided prototype learning. External Links: Link Cited by: §1, §1, §2, §3.1, §5.2.
  • D. Geeraerts (2010) Theories of lexical semantics. Oxford University Press. Cited by: §1, §1, Figure 2, §2, §2, §2, §2, Table 1, §3, §4.2.3.
  • Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew (2016) Deep learning for visual understanding: a review. Neurocomputing 187, pp. 27–48. Cited by: §1.
  • K. Han, R. S. Rezende, B. Ham, K. K. Wong, M. Cho, C. Schmid, and J. Ponce (2017) SCNet: learning semantic correspondence. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1831–1840. Cited by: §4.2.2.
  • R. M. Haralick, K. Shanmugam, et al. (1973) Textural features for image classification. IEEE Transactions on systems, man, and cybernetics 6 (6), pp. 610–621. Cited by: §5.4.2, Table 3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §5, Table 3.
  • D. Homa and R. Vosburgh (1976) Category breadth and the abstraction of prototypical information.. Journal of Experimental Psychology: Human Learning and Memory 2 (3), pp. 322. Cited by: §3.2.
  • M. Hu (1962)

    Visual pattern recognition by moment invariants

    .
    IRE Transactions on Information Theory 8 (2), pp. 179–187. Cited by: §5.4.2, Table 3.
  • S. Jetley, B. Romera-Paredes, S. Jayasumana, and P. Torr (2015) Prototypical priors: from improving classification to zero-shot learning. In British Machine Vision Conference (BMVC), Cited by: §1, §1, §2, §3.1, §4.1, §5.2.
  • S. Khaligh-Razavi and N. Kriegeskorte (2014) Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology 10 (11), pp. e1003915. Cited by: §4.2.1.
  • S. Kim, D. Min, B. Ham, S. Lin, and K. Sohn (2017) FCSS: fully convolutional self-similarity for dense semantic correspondence. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6560–6569. Cited by: §4.2.2.
  • T. Kohonen (1997) Learning vector quantization. In Self-organizing maps, pp. 203–217. Cited by: §2.
  • A. Krizhevsky and G. Hinton (2010) Convolutional deep belief networks on cifar-10. Unpublished manuscript 40. Cited by: §3.3, §5, §5, §5.2, Table 2.
  • B. Lake, W. Zaremba, R. Fergus, and T. Gureckis (2015) Deep neural networks predict category typicality ratings for images. In 37th Annual Conference of the Cognitive Science Society, Cited by: §5.1.4.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219 Cited by: §3.3, §5, §5, §5.2.
  • D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS), pp. 556–562. Cited by: §4.2.3.
  • L. Li, S. Zhu, H. Fu, P. Tan, and C. Tai (2020) End-to-end learning local multi-view descriptors for 3d point clouds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • M. Li (2007)

    Texture moment for content-based image retrieval

    .
    In IEEE International Conference on Multimedia and Expo, pp. 508–511. Cited by: §5.4.2, Table 3.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, §3.3, §5, §5, Table 3.
  • C. Liu and M. Nakagawa (2001) Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition 34 (3), pp. 601–615. Cited by: §2.
  • S. Liu and W. Deng (2015) Very deep convolutional neural network based image classification using small training sample size. In Asian Conference on Pattern Recognition (ACPR), pp. 730–734. Cited by: §5, §5.2.
  • J. Long, E. Shelhamer, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39 (4), pp. 640. Cited by: §1.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60 (2), pp. 91–110. Cited by: §1, §4.2.3.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: §5.4.1.
  • A. Martin (2007) The representation of object concepts in the brain. Annual Review of Psychology 58, pp. 25–45. Cited by: §4.2.1.
  • D. L. Medin and M. M. Schaffer (1978) Context theory of classification learning.. Psychological review 85 (3), pp. 207. Cited by: §3.2.
  • J. P. Minda and J. D. Smith (2002) Comparing prototype-based and exemplar-based accounts of category learning and attentional allocation.. Journal of Experimental Psychology: Learning, Memory, and Cognition 28 (2), pp. 275. Cited by: §3.2, §3.2, §4.1.
  • Mingbo Ma, Ming Shao, Xu Zhao, and Yun Fu (2013) Prototype based feature learning for face image set classification. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. , pp. 1–6. Cited by: §1, §1, §2, §2, §3.1, §5.2.
  • C. G. Netto, L. H. Andrade, and H. E. Toma (2015) Association of pseudomonas putida formaldehyde dehydrogenase with superparamagnetic nanoparticles: an effective way of improving the enzyme stability, performance and recycling. New Journal of Chemistry 39 (3), pp. 2162–2167. Cited by: §1.
  • T. Ojala, M. Pietikainen, and T. Maenpaa (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 24 (7), pp. 971–987. Cited by: §5.4.2, Table 3.
  • B. Ojeda-Magaña, R. Ruelas, M. A. C. Nakamura, D. W. C. Finch, and L. Gómez-Barba (2013) Pattern recognition in numerical data sets and color images through the typicality based on the gkpfcm clustering algorithm. Mathematical Problems in Engineering 2013 (11), pp. 160–171. Cited by: §1, §1, §2, §3.1, §5.2.
  • A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV) 42 (3), pp. 145–175. Cited by: §5.4.2, Table 3.
  • O. K. Oyedotun and A. Khashman (2018) Prototype-incorporated emotional neural network. IEEE Transactions on Neural Networks and Learning Systems 29 (8), pp. 3560–3572. Cited by: §1, §1, §2, §3.1, §5.2.
  • I. Rocco, R. Arandjelović, and J. Sivic (2018) End-to-end weakly-supervised semantic alignment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6917–6925. Cited by: §1.
  • E. H. Rosch (1973) On the internal structure of perceptual and semantic categories. In Cognitive Development and the Acquisition of Language, pp. 111–144. Cited by: §1, §1, §2, §2, §2, §2, §3.1, §4.
  • E. Rosch and C. B. Mervis (1975) Family resemblances: studies in the internal structure of categories. Cognitive psychology 7 (4), pp. 573–605. Cited by: §1, §1, §1, §2, §2, §2, §2, §2, §3.1, §3, §4.1, §4, §5.3.
  • E. Rosch, C. Simpson, and R. S. Miller (1976) Structural bases of typicality effects.. Journal of Experimental Psychology: Human perception and performance 2 (4), pp. 491. Cited by: §1, §2, §2, §2, §3.1, §3, §3, §4.1, §4.
  • E. Rosch (1975) Cognitive representations of semantic categories.. Journal of Experimental Psychology: General 104 (3), pp. 192. Cited by: §1, §1, §2, §2, §2, §2, §2, §2, §3.1, §4.1, §4, §5.3.
  • E. Rosch (1978) Principles of categorization. In Cognition and Categorization, E. Rosch and B. B. Lloyd (Eds.), pp. 27– 48. Cited by: §1, §1, §1, §2, §2, §2, §3.1, §3, §4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §1, §3.3, §5, §5, Table 3.
  • B. Saleh, A. M. Elgammal, and J. Feldman (2016) Incorporating prototype theory in convolutional neural networks.. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    ,
    pp. 3446–3453. Cited by: §1, §1, §1, §3.1, §5.2.
  • B. Saleh, A. Farhadi, and A. Elgammal (2013)

    Object-centric anomaly detection by attribute-based reasoning

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 787–794. Cited by: §1, §1, §1, §3.1, §5.2.
  • S. Seo and K. Obermayer (2003) Soft learning vector quantization. Neural computation 15 (7), pp. 1589–1604. Cited by: §2, §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §5, Table 3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), pp. 4080–4090. Cited by: §1, §1, §2, §3.1, §4.1, §5.2.
  • Y. Song, W. Park, D. Kim, and J. Ahn (2004) Content-based image retrieval using new color histogram. In International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 609–611. Cited by: §5.4.2, Table 3.
  • B. Stellato, B. P. Van Parys, and P. J. Goulart (2017)

    Multivariate chebyshev inequality with estimated mean and variance

    .
    The American Statistician 71 (2), pp. 123–127. Cited by: §3.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. iclr. In International Conference on Learning Representation, Vol. . Cited by: §5.1.4.
  • E. Tulving (2007) Coding and representation: searching for a home in the brain. Science of Memory: Concepts, pp. 65–68. Cited by: §1, §1, §4.2.1.
  • P. Wohlhart, M. Köstinger, M. Donoser, P. M. Roth, and H. Bischof (2013) Optimizing 1-nearest prototype classifiers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 460–467. Cited by: §1, §1, §2, §3.1, §4.1, §5.2.
  • M. Xiao, A. Kortylewski, R. Wu, S. Qiao, W. Shen, and A. Yuille (2020) Tdapnet: prototype network with recurrent top-down attention for robust object classification under partial occlusion. In ECCV 2020 Workshop on Visual Inductive Priors for Data-Efficient Deep Learning, Cited by: §1, §1, §2, §3.1, §5.2.
  • H. Yang, X. Zhang, F. Yin, and C. Liu (2018) Robust classification with convolutional prototype learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3474–3482. Cited by: §1, §1, §2, §2, §3.1, §5.2.
  • J. Yang, D. Parikh, and D. Batra (2016)

    Joint unsupervised learning of deep representations and image clusters

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5147–5156. Cited by: §5.4.2.
  • E. Yee, M. N. Jones, and K. McRae (2018) Semantic memory. Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience 3, pp. 1–38. Cited by: §1, §1.
  • S. R. Zaki, R. M. Nosofsky, R. D. Stanton, and A. L. Cohen (2003) Prototype and exemplar accounts of category learning and attentional allocation: a reassessment. Journal of Experimental Psychology: Learning, Memory and Cognition 29 (6), pp. 1160–1173. Cited by: §3.2, §4.1.
  • H. Zhao and Z. Qin (2015) Clustering data and vague concepts using prototype theory interpreted label semantics. In Integrated Uncertainty in Knowledge Modelling and Decision Making, pp. 236–246. Cited by: §1, §1, §3.1, §5.2.