The extraction of image relevant features has been the subject of Computer Vision research for decades. The advent of Convolutional Neural Networks (CNN) enabled to achieve a visual recognition model with similar behavior ofsemantic memory  for classification tasks [18, 47, 51]
, and sparked the tendency of semantic processing of images using deep-learning techniques.
and machine learning[36, 46, 50] were the choice methods for image feature description tasks. The impressive success of CNN-models spawned numerous CNN-descriptors produced by different approaches that learn effective representations for describing image features [16, 21, 45, 60, 62]. Consequently, representations of image features extracted using deep classification models [18, 47, 51], or using CNN-descriptors are commonly referred to as semantic feature or semantic signature.
The term semantic feature has been extensively studied in the field of linguistic semantics and it is defined as the representation of the basic conceptual components of the meaning of any lexical item . In the seminal work of Rosch  the author analyzed the semantic structure of the meaning of words and introduced the concept of prototype semantics (or Prototype Theory). According to Rosch [41, 43], the representation of a category semantic meaning is related to the category prototype, particularly to those categories naming natural objects.
Some CNN-description models [16, 25, 45, 60, 62] (and semantic description models [4, 15, 21, 40]) stand for the semantic information of the image features using a range of different approaches. Nevertheless, none of these models construct their representations coding the visual semantic information with the extensive theoretical foundation of Cognitive Science to represent the semantic meaning. We rely on cognitive semantics studies related to the Prototype Theory for modeling the central semantic meaning of category. Our approach uses the representation of central semantic meaning of category for simulating the human behavior in object features description task.
In this paper, we propose a novel approach to take on the semantic features descriptions of objects. We bring to light the Prototype Theory as a theoretical basis to represent the semantic meaning of the image visual information. We develop a prototype-based description model that uses the category’s prototype to find a global semantic representation of the basic conceptual components (objects) of the image semantic meaning. Human beings can learn the most distinctive features of a specific category [30, 52]
. These learned features (or properties) are used by the human brain to identify, classify and describe objects. The Prototype Theory proposes that human beings think of a category in terms of abstract prototypes, defined by the central case of a category [13, 41, 42].
Successful execution of the object recognition and description tasks in the human brain is inherently related to the learned prototype of the object category [32, 41, 42, 61]. This raises the following two questions: i) Can a model of the perception system be developed in which objects are described using the same semantic features that are learned to identify and classify them? ii) How can the category prototype be included in the object global semantic description? We address these two questions motivated by the human approach for describing objects highlighting their most distinctive features within the category. For example, a typical human description: a dalmatian is a dog (semantic meaning) that is distinguished by its unique black or liver colored spotted coat (semantic difference with respect to the central semantic meaning of dog category). Figure 1 depicts our prototype-based description model.
We evaluate our approach using CNN-models both on MNIST and ImageNet datasets. The experiments show that our prototype-based description model can simulate the prototypical organization of objects categories. Furthermore, our descriptor can construct semantic signatures that are discriminative, interpretable, with low dimensionality, and with the ability to encapsulate and to retain the meaning of object features.
2 Related works
The CNN descriptor family showed that it is possible for a learning approach to outperform the best techniques based on carefully hand-crafted features [2, 28, 53]. These models differ among themselves on how to compute the descriptors in their deep architectures, similarity functions and features extraction methods. Some approaches extract immediate activations of the model as a descriptor signature [6, 8, 14, 27]. Other methods directly learned a measure of similarity to compare image patches using a similarity convolutional network [16, 45, 59, 60]. Siamese networks were used to learn discriminative representation and to learn a similarity metric [16, 60, 62]. The deep model LIFT 
learns each of the tasks involved in feature management: detection, orientation estimation, and feature description. Lin constructed a compact binary descriptor for efficient object matching based on the features extracted with the VGG16 model .
Semantic descriptors and correspondence.
Finding correspondences between different scenes that share similar or semantically related features is a challenging problem. Liu  propose to use SIFT Flow to create semantic flow family methods as a solution to the high degree of variation that includes the challenge of semantic correspondence [3, 20, 26, 37, 54, 57]. Several of these methods combine their approaches with the extraction of hand-crafted features [28, 53]. Some works [4, 15, 63] use the robustness of CNN-models for training deep learning architectures that address the problem of semantic correspondence. Kim  tackled the problem of semantic correspondence by constructing a semantic descriptor. FCSS descriptor  has the property of being robust to intra-class appearance variation due of its local self-similarity (LSS) and its ability to keep the precise localization of deep neural networks. The performance of CNN-models used in description tasks are still not at par with the performance achieved by CNN used in classification models. In general, CNN descriptors and semantic descriptors are trained to learn their own semantic representations and use different deep learning architectures. Most of these feature description models do not use the discriminative power of the features extracted using the well-know CNN-classification models [18, 47, 51]. Moreover, none of these feature description approaches incorporates the cognitive sciences foundation to introduce meaning in the representations of image features.
The Prototype Theory analyzes the internal structure of categories and introduces the prototype-based concept of categorization. It proposes a categories representation as heterogeneous and not discrete, where the features and category members do not have the same relevance within the category. Rosch  obtained evidence that humans store the semantic meaning of category based on the degrees of representativeness (typicity) of category members. The author showed that human beings store the category knowledge as a semantic organization around of category prototype (prototypical organization) . The prototype or prototypical concept was formally defined as the clear central member of a category [13, 41]. Rosch  showed that human beings learn first the core semantic meaning of the object (prototype) and then its specificities. In this paper, we model the central semantic meaning of category based on the four types of prototypicality effects [12, 13]: extensional non-equality, intensional non-equality, extensional non-discreteness, and intensional non-discreteness. The prototypicality effects surmise the importance of the distinction between central and peripheral meaning .
Rosch [41, 42] showed that humans learn the central semantic meaning of categories (the prototype) and include it in their cognitive processes. Based in these assumptions, our proposal follows the flow of conceptual processes presented in Figure 1 as hypothesis for simulating the human behavior in object features description. We propose to describe an object, highlighting the global features that distinguish it within a category. In other words, after recognizing the category to which the object belongs, how do we find what are the features that distinguish it from others within the category? How to model a global object description with similar behavior of the diagram in Figure 1?
To address these issues, an due to their good performance, we use CNN-classification models in feature extraction, recognition and classification of the visual information received as input (processes a, b, c, and d in Figure 1). The CNN-models, analogous to the human memory , make associations that keep the knowledge in their connection structures. Our method downloads that knowledge of pre-trained CNN-models into a semantic structure (semantic prototype) that stands for the central semantic meaning of learned categories (Figure 1e)). Our method proposes a representation (signature) that describes an object, encapsulating the semantic meaning of extracted features, and its semantic differences in relation to the central semantic meaning of the category. In the following sections, we present part of our method that encapsulates the category central meaning (prototype). Also, we present how to introduce the prototype representation in semantic description of object features. Figure 2 shows the architecture overview of our prototype-based description model.
3.1 Prototype Construction
The semantic structure, i.e., central/peripheral meaning, of a category are related with differences of typicality and membership salience of category members (extensional non-equality). The prototype is an “average” of the abstraction of all objects in the category . It summarizes the most representative members (or features) of the category. The combination of the observed features and their relevance for the category enables the grouping of objects into family resemblance (intensional non-equality). This approach justifies the object’s position within the semantic structure of the category and allows typical objects to be grouped into the semantic center of the category (prototypical organization).
Let be a finite set of categories of objects, be a finite set of distinguishing features of an object, and , is the set of objects that share the same category (where ).
Semantic prototype. We call the central meaning of the category , semantic prototype of -category or simply semantic prototype
to the “average” and standard deviation of each of the features of all objects within the category, along with a “measure” of the relevance of those features. Formally the semantic prototype is a-tuple where : i) is a nonempty
-dimensional vector, whereis the mean of the j- feature extracted for only typical objects of the category ; ii) is a nonempty -dimensional vector, where is the standard deviation of the j- feature extracted for only typical objects of the category ; iii) is a nonempty -dimensional vector, where is the relevance value of the j- feature for the category .
Convolutional semantic prototype. The convolutional semantic prototype of a category is a -tuple where are computed using features of category extracted from the fully convolutional layer of CNN-models; and are the learned parameters of i-
category in the softmax layer. Next, we refer to theconvolutional semantic prototype of the category as semantic prototype.
Semantic value. The semantic meaning of observed features for category , summary value of the observed features , or simply semantic value of in -category is an abstract value: where . Consequently, the central semantic meaning of the category or summary value of the semantic prototype is the semantic value where
Prototypical distance. Let a representative object of category , the features of object o and the semantic prototype of the category . We defined as prototypical distance between o and the semantic distance:
Distance between objects. Let be a representative objects of category the features of objects respectively. We define the objects distance between and as the semantic distance given by:
where and (We introduce the learned weights of CNN-models in the psychological distance between two stimuli defined by Medin ).
Feature metric space. Let be a nonempty set of all object features of the category . Since the distance function satisfies the axioms of non-negativity, identity of indiscernible, symmetry and triangle inequality; is a metric in the features set . Consequently, is a metric space or feature metric space.
3.2 Global Semantic Descriptor
Our approach of object semantic description based on prototypes assumes as semantic meaning vector, the semantic vector () constructed from element-wise operations to compute the semantic value (Definition 3). Furthermore, we represent the semantic difference vector as the weighted residual vector () composed of the absolute values of the difference of each object feature with each feature of the category prototype.
Figure 2 shows an overview of our prototype-based description model. Our Global Semantic Descriptor model uses as requirement the prototypes priori knowledge of each CNN-model categories (prototypes are computed using the Algorithm 1). After the categorization process, we use the corresponding category prototype for semantic description of object features. We show graphically in Figure 2d) how to introduce the category prototype into the global semantic description of object’s features. A drawback of our representation (Figure 2d)) is having high dimensionality, since it is based on the semantic meaning vector () and the semantic difference vector (). The large dimensional of our feature vectors makes its practical uses unfeasible in common computer vision tasks such as semantic correspondence [15, 21].
Several dimensionality reduction algorithms such as PCA  and NMF  are based on discarding features that do not generate meaningful variation. Although this approach works on some tasks, after applying these algorithms we lost the ability of data interpretation . For the perspective of Prototypes Theory, discarding features it is no suitable when applied to the semantic space, due to the absence of necessary and sufficient definitions to categorize an object (intensional non-discreteness). Sometimes discarding features may mean discarding elements of the category . For instance, there may be some objects within the category that do not have some of category typical features (flying is a typical feature of bird category; however, penguin is a bird that does not fly).
We propose a simple transformation to compress our global semantic description representation of the object’s features (Figure 2d) in a global semantic signature (Figure 2f). The final descriptor signature preserves the semantic meaning (Property 1) and the semantic difference (Property 2) present in the first global semantic description representation. Depending on the input values, our descriptor uses the transformation to construct global semantic signatures with different meanings within the category (Property 3).
The descriptor signature is computed by concatenating the corresponding signatures of semantic meaning vector () and semantic difference vector () with our transformation (see Algorithm 2). Figure 3 shows the main steps of transformation: 1) Resizing the input vector in the best configuration of square auxiliary matrices and concatenate the output signatures of the flow , , for each ; 2) and 3) constructing the semantic gradient using the angle matrix () formed by the position of each feature with respect to the center of ; 4) reducing the gradient to 8-vectors similarly to SIFT . Algorithm 3 details the steps.
Semantic preservation. The semantic descriptor signature preserves the semantic value:
To prove this, it suffices to follow backward through steps of Algorithm 3. = = = ∎
Prototypical distance preservation. The object signature preserves the prototypical distance:
Similar to the previous proof (type=other). ∎
Structural polymorphism. Our Global Semantic Descriptor has the polymorphic property of describing, with the same structural representation, distinctly different semantic meanings within the -category. Consequently, our descriptor uses the category prototype to construct different semantic signature taxonomies:
an object . ;
central semantic meaning (abstract prototype) of -category. ;
semantic meaning of -category. .
4.1 Experimental Setup
We used a CNN-MNIST model based on the LeNet architecture  for digit classification in the MNIST dataset. The CNN-MNIST model was used as a pilot model of our experiments as well as the VGG16 model  was the ground of our semantic description model. We used VGG16 models because its features are the basis of a variety of image processing tasks such as object detection , image annotation , video emotion recognition , style transfer , image alignment [15, 39]
, cluster, and scene classification. Our prototype-based descriptor model is scalable and can easily be adapted to any other CNN-classification model.
4.2 Prototype construction
In the experiments, we computed the prototypes with CNN-MNIST and VGG16 models in MNIST and ImageNet datasets, respectively. We assume as the object features those extracted from the model layer that is at once right before the softmax layer (see Feature Layer in Figure 2b). We need typical objects of the category, or any information about the typicality value (or typicality degree) of the object of a specific category, to properly build the proposed semantic prototype. However, none of the datasets used have this information. For this reason, we used as category of typical objects only those elements that are - unequivocally - classified as category members (Top 1) by CNN-classification models. For each category in the datasets, we extracted features of typical members and computed the semantic prototype (see Definition 1) using Algorithm 1.
4.3 Semantic information analysis
Achieving the members prototypical behavior within the category is one of the motivations and theoretical basis of our work. Nevertheless, there is no defined metric to quantify whether our representation correctly captures the category semantic meaning. This is a consequence of the fact that there is no defined metric to robustly evaluate the typicality level of an object to a category, this skill is still reserved only for human beings.
Our prototype model (semantic prototype + prototypical distance) tries to capture the central semantic meaning of the category. In a comparable way to the human being, we want to simulate that visually typical elements of category are organized close (based on the prototypical distance metric) to the category prototype.
Figure 4 and Table 1 present an example of the semantic meaning captured by our prototype model for members of the number five category in MNIST dataset. As shown in Figure 4, our proposal finds as typical elements of number (top-5 closest) the handwritten digits with features that are, undoubtedly, distinctive of the -category. Our model also can find the peripheral meaning of the category. Members with less representative features of the -category, or little readable, are placed in the periphery (top-5 farthest), away from the central meaning, but keeping the features of the category (it still belongs to the category). Our model finds, as a human being, that it can be a , but not a typical .
Based on our experiments results (in MNIST and ImageNet datasets), we assume that the proposed semantic prototype correctly captures the central semantic meaning of the category. Our prototypical distance has an influence on the arrangement of the elements around the category semantic prototype. Top-5 typical objects of the category are positioned close to the prototype and Top-5 less typical ones are positioned more distant from the semantic center. But, does our model organize all category members with this prototypical organization?
Visualizing the semantic position of each category member with respect to the central semantic meaning of the category (the abstract prototype), constitutes a simple approach to see the internal semantic structure of the entire category. The experiments in this section aim to visualize the internal semantic structure of the category using the semantic meaning encapsulated by our model for each category member. First, we need to corroborate that our prototype model can correctly interpret the object features and position it semantically within the category, keeping a prototypical organization. Second, we want to verify if the proposed semantic descriptor encodes and preserves the semantic information contained in the object features, while preserving the prototypical organization within the category.
Visualizing the category internal structure is infeasible in the m-dimensional features space since most techniques of data visualization are based on the discarding of features. From the perspective of the Prototype Theory foundations, this approach can be problematic (intensional non-discreteness). For this reason, we used topology techniques to show that our model simulates the prototypical organization within the category.
Let and be the metric spaces; and the map , where are the object features, is the object semantic value (see Definition 3), is the prototypical distance; the point and is L1-norm condition. maps the object to the metric space with its semantic value and its prototypical distance.
Let , and , the mapped point in metric space. Then, the Sum of Absolute Difference (SAD) . Using the Definitions 3, 4 and 5; we end up with the expression: . Consequently, for every and exists a such that: , that is, is continuous. This means that if , .
Let the metric space of objects descriptor signatures. Similarly, using the Properties 1 and 2 we can show that the map is continuous. Since and are continuous, the behavior in metric space is equivalent to the behavior in feature metric space and descriptor’s metric space .
Figure 5 shows examples of the internal semantic structure of categories mapped using and . The experiments demonstrate a prototypical organization within the category in the metric space. Note how the semantic value and prototypical distance organize prototypically all category elements. Top5 most visually representative members of the number five in metric space (see Figure 4) are the same Top5 most representative in metric space. Top5 closest members are mapped (in blue) and positioned near the abstract prototype mapped (in black) (see Figure 5). Likewise, Top5 less representative members (in red) continue to be positioned in the peripheries. Even with different models and datasets, the internal prototypical organization of the category achieved in the descriptor signature domain (right) is identical to the prototypical organization in features domain (left). This means that our descriptor signature preserves in its taxonomy the semantic information contained in the object features.
Figure 6 shows an example of the signatures taxonomies constructed with our descriptor using CNN-MNIST model (signatures with size ). We showed the structural polymorphism property of our descriptor (Property 3) to construct signatures of the central semantic meaning (abstract prototype), the semantic meaning of the category and the meaning of a category member. The abstract prototype signature is a degenerate version of the category signature. The abstract prototype signature can be understood as the numbers distribution (or DNA chain) that stands for the category. The category members will have a semantic meaning with similar representation of category DNA chain. The semantic difference of the category signature can be understood as the features boundary of all category members. Consequently, semantic information encoded in our global semantic descriptor signatures allows, easily, to recover object semantic information (Properties 1, 2); and it also allows to interpret the object typicality within the category (typicality score ).
4.4 Performance evaluation
We evaluate the proposed semantic encoding of our Global Semantic Descriptor (GSDP) (version based in VGG16 model) comparing our representation against the following image global description: GIST , LBP , HOG , Color64 , Color_Hist , Hu_H_CH [17, 19, 48], and VGG16 . Yang  showed that when the features representations achieve good metrics in clustering tasks, it can generalize well when transferred to other tasks. Based in these assumptions, we evaluate our semantic encoding for verifying its usefulness and suitability in image clustering tasks.
We used the K-means algorithm for clusteringimages of the first categories of ImageNet ( category) using the descriptors signatures. The experiment was conducted incrementally, starting with cluster (for category) and incrementing a category for each iteration. Table 2 shows a screenshot of K-means-metrics achieved by the selected descriptors in the first categories. Figure 7 shows the Kmeans metrics behavior for VGG16 and GSDP signatures, when the number of clusters (categories) increased in each execution of algorithm. Our GSDP descriptor keeps the semantic information of VGG16 signatures (see Figure 5) with a more discriminatory representation and even lower feature dimension (). The results show that our descriptor encoding significantly outperforms the other image global encodings in terms of cluster metrics. The results achieved in clustering tasks encourage us to evaluate the generalization ability of our semantic representation in other computer vision tasks.
We introduced a novel Global Semantic Descriptor111All source code and data used will be made publicly available in our lab’s website: https://www.verlab.dcc.ufmg.br/global-semantic-description/wacv2019/ that is based on the foundations of the Prototype Theory. Our prototype-based description model does not need to be trained and it is easily adaptable to be used with any other existing CNN classification model. As shown in the experiments, our semantic descriptor is discriminative, small dimensioned, encodes the semantic information of the category, and achieves a prototypical organization of the category members. We further showed how to interpret and retrieve the object typicality information encoded in our representation. Our model proposes a starting point to introduce the theoretical foundation related to the representation of semantic meaning and the learning of visual concepts of the Prototype Theory in the CNN-Descriptors family.
-  H. Abdi and L. J. Williams. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.
-  H. Bristow, J. Valmadre, and S. Lucey. Dense semantic correspondence where every pixel is a classifier. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4024–4031, 2015.
-  C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE, 2005.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML), pages 647–655, 2014.
-  W. Estes. Memory storage and retrieval processes in category learning. Journal of Experimental Psychology: General, 115(2):155, 1986.
-  P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769, 2014.
-  V. Fromkin, R. Rodman, and N. Hyams. An introduction to language. Cengage Learning, 2018.
-  J. M. Fuster. Network memory. Trends in neurosciences, 20(10):451–459, 1997.
-  L. Gatys, A. Ecker, and M. Bethge. A neural algorithm of artistic style. Nature Communications, 2015.
-  D. Geeraerts. Diachronic prototype semantics: A contribution to historical lexicology. Oxford University Press, 1997.
-  D. Geeraerts. Theories of lexical semantics. Oxford University Press, 2010.
-  Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the of the European Conference on Computer Vision (ECCV), pages 392–407. Springer, 2014.
-  K. Han, R. S. Rezende, B. Ham, K.-Y. K. Wong, M. Cho, C. Schmid, and J. Ponce. Scnet: Learning semantic correspondence. arXiv preprint arXiv:1705.04043, 2017.
X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg.
Matchnet: Unifying feature and metric learning for patch-based
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.
-  R. M. Haralick, K. Shanmugam, et al. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(6):610–621, 1973.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
Visual pattern recognition by moment invariants.IRE transactions on information theory, 8(2):179–187, 1962.
-  J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2307–2314, 2013.
-  S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity for dense semantic correspondence. arXiv preprint arXiv:1702.00926, 2017.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
-  D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.
Texture moment for content-based image retrieval.In Multimedia and Expo, 2007 IEEE International Conference on, pages 508–511. IEEE, 2007.
-  K. Lin, J. Lu, C.-S. Chen, and J. Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1183–1192, 2016.
-  C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(5):978–994, 2011.
-  J. L. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In Advances in Neural Information Processing Systems, pages 1601–1609, 2014.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
X. Lu, Y. Yuan, and J. Fang.
Jm-net and cluster-svm for aerial scene classification.
Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2386–2392. AAAI Press, 2017.
-  A. Martin. The representation of object concepts in the brain. Annu. Rev. Psychol., 58:25–45, 2007.
-  D. L. Medin and M. M. Schaffer. Context theory of classification learning. Psychological review, 85(3):207, 1978.
-  J. P. Minda and J. D. Smith. Comparing prototype-based and exemplar-based accounts of category learning and attentional allocation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(2):275, 2002.
-  V. N. Murthy, S. Maji, and R. Manmatha. Automatic image annotation using deep learning representations. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 603–606. ACM, 2015.
-  T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. TPAMI, 24(7):971–987, 2002.
-  A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
-  C. B. Perez and G. Olague. Genetic programming as strategy for learning image descriptor operators. Intelligent Data Analysis, 17(4):561–583, 2013.
-  W. Qiu, X. Wang, X. Bai, A. Yuille, and Z. Tu. Scale-space sift flow. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 1112–1119. IEEE, 2014.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neural network architecture for geometric matching. In Proc. CVPR, volume 2, 2017.
-  I. Rocco, R. Arandjelović, and J. Sivic. End-to-end weakly-supervised semantic alignment. In CVPR, 2018.
-  E. Rosch. Cognitive representations of semantic categories. Journal of experimental psychology: General, 104(3):192, 1975.
-  E. Rosch. Principles of categorization. cognition and categorization, ed. by eleanor rosch & barbara b. lloyd, 27-48, 1978.
-  E. Rosch and C. B. Mervis. Family resemblances: Studies in the internal structure of categories. Cognitive psychology, 7(4):573–605, 1975.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 118–126, 2015.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(8):1573–1585, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Y.-j. Song, W.-b. Park, D.-w. Kim, and J.-h. Ahn. Content-based image retrieval using new color histogram. In Intelligent Signal Processing and Communication Systems, 2004. ISPACS 2004. Proceedings of 2004 International Symposium on, pages 609–611. IEEE, 2004.
-  R. J. Sternberg and K. Sternberg. Cognitive psychology. Nelson Education, 2016.
-  C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Ldahash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(1):66–78, 2012.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, pages 4278–4284, 2017.
-  S. L. Thompson-Schill. Neuroimaging studies of semantic memory: inferring how from where. Neuropsychologia, 41(3):280–292, 2003.
-  E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.
-  E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer. Dense segmentation-aware descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2890–2897, 2013.
-  E. Tulving. Coding and representation: searching for a home in the brain. Science of memory: Concepts, pages 65–68, 2007.
B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal.
Video emotion recognition with transferred deep feature encodings.In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 15–22. ACM, 2016.
-  H. Yang, W.-Y. Lin, and J. Lu. Daisy filter flow: A generalized discrete approach to dense correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3406–3413, 2014.
J. Yang, D. Parikh, and D. Batra.
Joint unsupervised learning of deep representations and image clusters.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5147–5156, 2016.
-  K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In Proceedings of the of the European Conference on Computer Vision (ECCV), pages 467–483. Springer, 2016.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4353–4361, 2015.
-  S. R. Zaki, R. M. Nosofsky, R. D. Stanton, and A. L. Cohen. Prototype and exemplar accounts of category learning and attentional allocation: A reassessment. Journal of Experimental Psychology: Learning, Memory and Cognition, 29(6):1160–1173, 2003.
-  J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1592–1599, 2015.
-  T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 117–126, 2016.