Making a Bird AI Expert Work for You and Me

by   Dongliang Chang, et al.

As powerful as fine-grained visual classification (FGVC) is, responding your query with a bird name of "Whip-poor-will" or "Mallard" probably does not make much sense. This however commonly accepted in the literature, underlines a fundamental question interfacing AI and human – what constitutes transferable knowledge for human to learn from AI? This paper sets out to answer this very question using FGVC as a test bed. Specifically, we envisage a scenario where a trained FGVC model (the AI expert) functions as a knowledge provider in enabling average people (you and me) to become better domain experts ourselves, i.e. those capable in distinguishing between "Whip-poor-will" and "Mallard". Fig. 1 lays out our approach in answering this question. Assuming an AI expert trained using expert human labels, we ask (i) what is the best transferable knowledge we can extract from AI, and (ii) what is the most practical means to measure the gains in expertise given that knowledge? On the former, we propose to represent knowledge as highly discriminative visual regions that are expert-exclusive. For that, we devise a multi-stage learning framework, which starts with modelling visual attention of domain experts and novices before discriminatively distilling their differences to acquire the expert exclusive knowledge. For the latter, we simulate the evaluation process as book guide to best accommodate the learning practice of what is accustomed to humans. A comprehensive human study of 15,000 trials shows our method is able to consistently improve people of divergent bird expertise to recognise once unrecognisable birds. Interestingly, our approach also leads to improved conventional FGVC performance when the extracted knowledge defined is utilised as means to achieve discriminative localisation. Codes are available at:


page 1

page 2

page 6

page 8


The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

Most of us are not experts in specific fields, such as ornithology. None...

Training Novices: The Role of Human-AI Collaboration and Knowledge Transfer

Across a multitude of work environments, expert knowledge is imperative ...

Towards Unbiased and Accurate Deferral to Multiple Experts

Machine learning models are often implemented in cohort with humans in t...

Query Answering via Decentralized Search

Expert networks are formed by a group of expert-professionals with diffe...

Decision Rule Elicitation for Domain Adaptation

Human-in-the-loop machine learning is widely used in artificial intellig...

Learning for Expertise Matching with Declination Prediction

We study the problem of finding appropriate experts who are able to comp...

Understanding AI Data Repositories with Automatic Query Generation

We describe a set of techniques to generate queries automatically based ...

1 Introduction

Figure 1: AI Bird Expert Enriches Human Bird Knowledge. By retreating from the common goal of a FGVC model in pursuing better expert label predictions, we envision a human-centred FGVC endeavour and propose a first solution.

AI is great – arguably the debate is on how it ultimately benefits mankind. Progress on computer vision has predominately followed the “Human for AI” trend, where human data are used to train AI models that replace humans in some capacity. In this paper, we are interested in the complete opposite direction,

i.e. “AI for Human”, and ask the question “can trained AI models help to enrich human knowledge instead?”.

We pick the problem of fine-grained visual classification (FGVC) as a test bed on this quest. FGVC is a good fit as one of the few areas in computer vision where AI agents111We call FGVC and AI Expert interchangeably throughout the paper. can already reasonably replace human experts [biederman1999subordinate, deng2012hedging, berg2013poof, lin2015bilinear, zheng2017recognition, sun2018multi, yang2018learning, huang2020interpretable, joung2021learning], e.g. in identifying species of birds [wah2011caltech], models of cars and aircrafts [krause20133d, maji13fine], and tell one flower from another [nilsback2008automated]. The question then becomes – can the expert knowledge learned by AI be transferred across to an average human, so that “you and me” become experts too? i.e. those that can tell that the eight birds in Fig. 1 are in fact from different species.

Fig. 1 illustrates the ambition of this paper – to complete this three-way transfer cycle among human expert, AI expert and average human (you and me). The link where human experts provide labels to train an AI bird expert is the known part and precisely what FGVC in its conventional form strives for. Key for this paper is on how to make the remaining two connections: (i) how to extract knowledge from AI that is digestible to a human (like a book), and (ii) how can we measure the progress on “you and me” becoming more expert-like using that knowledge.

On making the first link, we first stand with past works [ordonez2013large, mac2018teaching, chang2021your] on the lack of interpretability of fine-grained expert labels to an average human (Fig. 2(a)). As such, they do not constitute good “knowledge” in our context, e.g. telling me the top left bird in Fig. 2 is a “California Gull” probably does not say much. Our key innovation here is resorting to the highly discriminative regions that experts exclusively attend to as transferable knowledge (Fig. 2(b)). This echoes well with psychological findings on the importance of using visual highlights for novices to learn in complex visual tasks [grant2003eye, roads2016using, hommel2019no].

To form the second connection and therefore close the loop, we literally take inspiration from a “book” – an expert bird guide in this context. More specifically, we present knowledge extracted from the AI expert as a bird guide to a human. The idea is then a better bird guide (i.e. knowledge) will result in “you and me” becoming more expert-like. We therefore take the degree to which the human has improved in being able to tell different species of birds as a measure of how good the extracted knowledge actually is.

Figure 2: Capitalising on knowledge in visual form (instead of abstract label inherent to a FGVC model), we show positive human feedback in digesting it towards better recognition.

It follows that we define knowledge as highly descriminative visual regions that are exclusively attended by domain experts, i.e. what parts of a photo experts focus on upon recognition. More specifically, we represent this expert attention as an optimal subset from a mixed pool of potentially discriminative regions (Fig. 3(a)) that leads to maximum recognisability (Fig. 3(c)). Our goal is then to eliminate non-expert ones that are shared between experts and novices so that expert-exclusive parts (knowledge) can be identified (Fig. 3(b)). In accounting for novice knowledge, we show that fine-grained image caption works best amongst alternatives (e.g. human annotated bubble regions (Sec. 3.3). Taken together, our technical solution is a multi-stage learning framework that (i) first conducts fine-grained representation learning in the visual domain, (ii) followed by associating human caption onto corresponding image regions, and (iii) distilling cross-modal attention differences to model expert-exclusive knowledge.

On measuring the efficacy of our bird guide (i.e. knowledge learned from AI), we conduct a large-scale human study with a total of 15,000 trials on a fine-grained bird dataset [wah2011caltech]. Results show, of the 2407 trials that participants initially failed fine-grained bird recognition, an average 53.39% later successfully reverted their decisions, after being presented with our bird guide. We provide further analysis showing our approach is not constrained to work with bird species only, and improves conventional FGVC performance when exploited as a way to achieve explicit discriminative localisation.

1.1 In Connection to Existing FGVC Works

Our work is a general extension to the existing bulky literature of FGVC (a most recent survey at [wei2021survey]), where we re-envisage the FGVC functionality from better label classification accuracy to that of providing useful knowledge for human consumption. It also brings out an important question of whether current FGVC methodologies in the traditional benchmarking sense have in fact learned any fine-grained knowledge – the exact implication of that we however leave to future work.

From a knowledge dissemination perspective, the way how we reason and dissect a FGVC model also seems to resonate with recent literature on generating FGVC visual explanations [selvaraju2017grad, chang2018explaining, wagner2019interpretable, chen2019looks, huang2020interpretable, huang2021stochastic], especially at the first glimpse. A closer inspection however reveals the fundamentally different purposes they each serve. The goal of the existing works is on machine explainability, i.e. looking into the pixels responsible for a model’s decision and judging whether they align with human intuition or not (e.g. by trying to make sense of visualised attention maps). We however take a human-centred view and only care if whatever extracted information can be instilled into our very brain as transferable knowledge.

More precisely, existing works generally present a pixel selection function h() that either explains the decision of the black-box FGVC model in the form of post-hoc visualisation p(yh(x)) [selvaraju2017grad, chang2018explaining, wagner2019interpretable] or making FGVC an explainable model itself p(h(x)y) [chen2019looks, huang2020interpretable, huang2021stochastic]. As a result, h(x) inevitably contains many visual cues that most human novices can already perceive. Our solution instead models p(yh(xx_human)), i.e. we take into account of human (non-expert) prior knowledge of an image and exclude them from our knowledge base to ensure a well-defined expert representation. We verify the importance of doing so in Sec. 3.2.

Figure 3: Schematic illustration of how to obtain expert-exclusive but highly discriminative visual regions via our approach (Sec. 2).

2 Methodology

In the traditional FGVC setting, given an image and its fine-grained label (e.g

. bird species name), a deep feature extractor

will first process into a set of feature pool , representing a total size of

visual features covering a diverse location and scale of visual regions. A classifier

is then appended upon the rich visual information provided in and optimised to predict under cross-entropy classification objective. The composition of and is therefore what we often regard as an AI-enabled domain expert.

Our goal is to extract highly discriminative visual regions that experts exclusively attend to in classifying a fine-grained image. Denoting the visual attentions of experts and novices as and , this equates to learning an attention re-sampling operation on that can successfully bridge the gap between and in expert label prediction (Fig. 3). Putting it formally:


We model , ,

as a learnable probability row vector (

) in practice, i.e. . We shall now detail below how to obtain each component in Eq. 1.

2.1 Stage I: Visual Learning for

Obtaining .  Though obtaining is not restricted to one specific method, it does need some careful consideration given will be subjected to bi-modal sampling from both and . We find that the trivial workflow of learning , and in end-to-end fashion will bias towards and makes it incompatible to work with later. For this, we propose a simple yet effective solution by decoupling the learning of with that of . Specifically, given an image , we divide it into , ,…, and uniform image blocks and use to extract feature for each local block to build our visual feature pool . We introduce an auxiliary classifier to guide the learning of for ground-truth label predictions. is then fixed with scrapped after this stage.

Obtaining .  We compute by first conducting self-attention (SA)222We implement in the form of the popular Scaled Dot-Product Attention [vaswani2017attention]: . on to better capture the long-term visual spatial dependency. We then append one fully-connected (FC) layer normalised with Softmax to simulate expert visual attention upon recognising a fine-grained object:


is a stop_gradient operation that forbids gradient flowing through the variable it functions on, which we will apply throughout. Denoting as , we optimise in the multi-label classification formulation:


2.2 Stage II: Visual Grounding for

To bypass the otherwise fatal lack of human novice annotations on their perceivable visual regions of an image, we exploit the existing human fine-grained image caption dataset [reed2016learning] to model . Given an image, ten single sentence visual descriptions are collected from different crowdsourced workers and we use their aggregate333To get image caption aggregate from different human visual descriptions, we use TextBlob [textblob] to extract registered noun phrases from each human and combine them into one caption by eliminating the duplicates. as a summary of the best possible visual perceptive zones from human novices. The question is how to ground human language input to the visual representation of ? We first process with an off-the-shelf pre-trained language model [devlin2019bert] to get its semantic embedding

and append a multi-layer perceptron

aiming to project to an embedding space compatible with .

is then formulated as the broadcast element-wise cosine similarity between

and :


Since the role of is to ensure human intentions expressed in language transfer visually, we require the training objective to maximise the cross-modal feature-wise mutual information , where .

Noise contrastive learning.

  Mutual information is notoriously intractable to optimise, where we resort to noise contrastive estimation


as a surrogate loss function. In particular, we implement it as InfoNCE

[oord2018representation] due to its wide adoption in the weakly-supervised visual grounding literature [xiao2017weakly, gupta2020contrastive, wang2021improving, wang2021improving]. InfoNCE is manifested in the popular cross-entropy fashion and measures how well the model can classify one positive representation amongst a set of unrelated negative samples:


with some slight abuse of notations, we use and as the batch alternatives to and . is the size of samples we use for contrastive learning with always positive and negatives.

2.3 Stage III: Knowledge Distillation for

Recall the two key traits of we defined conceptually. is first expert-exclusive visual attention on . This gives us the important prior information of the element-wise importance of for : attend to a subset of visual regions in that is disjoint with , i.e. corresponds to an attention re-sampling operation from the non-zero entry in . Similar to Eq. 2, we model with feature-wise self-attention followed by one FC layer for output normalisation:


The second trait of is being highly discriminative that bridges the recognition gap between and . Denoting the visual feature attended by as , we portray the learning as a process of knowledge distillation, where the student () tries to distil expert-exclusive knowledge from the teacher ():



is the Kullback-Leibler divergence between two distributions and

is the temperature hyperparameter

[hinton2015distilling, muller2019does, zhou2021rethinking] balancing the quality (sharpness) of the knowledge distilled, with smaller corresponding to fewer coverage of teacher’s knowledge base and larger risking over smoothing out teacher’s focus. We set throughout.

Inference without reliance on .  There is one shortcoming in Eq. 7 when practically deployed: relies critically on the outcome of , which requires the fine-grained human language description of an image () from the user. We argue it’s a big ask of user to offer the same level of descriptive comprehensiveness like those we use for training: “This bird has a yellow, long, pointy beak, grayish feathers and grayish feathers, with white on the crown and black on the wingbars.” We provide a simple solution to address this. A post-hoc approach is adopted that learns how to produce similar expert-exclusive discriminative visual attentions as with directly from :


We choose Maximum Mean Discrepancy (MMD) as a discrepancy metric for its ability to distinguish between two distributions with finite samples:


where with as a bandwidth hyperparameter. We show by replacing with only brings marginal performance downgrading in our empirical evaluations.

2.4 Incorporating as FGVC Booster

also provides an answer to the FGVC debate on what is the best way to achieve discriminative localisation [zhang2014part, lin2015deep, deng2015leveraging, wang2018learning, ji2020attention]. Our speculation is that if has successfully encoded expert-exclusive visual attentions, visual regions corresponding to are then already locally discriminative with expert endorsement. There are of course many ways to embed as explicit localisation information into existing FGVC frameworks, of which we choose perhaps the most intuitive of seeing as another pixel space input together with the original image . We find such simple solution suffices to bring improvements on top of many existing FGVC frameworks. To avoid ambiguity with existing notations, we abstract a FGVC solver into two parts of feature extractor and classifier . We now formulate a new generic way of FGVC training with both and as input:


where in practice, we set , i.e. we only select the single most discriminative region in as one more image input for explicit localisation.

3 Experiments

Our main experiments are conducted on the CUB-Bird-200 dataset [wah2011caltech], which contains 11,877 images from a label categorisation of 200 bird species444Due to page space limit, we omit implementation details here and welcome the readers to find more in our codes upon paper publication.. We first show how the learned knowledge in the visual form of expert-exclusive discriminative regions can help people with divergent levels of bird expertise towards better recognising their once unrecognisable birds. We then confirm that is indeed attending to visual regions exclusive to domain experts and only with such type of visual feedback (vs. ) can enable practically more interpretative and digestible knowledge to human participants. Given the lack of suitable baselines from existing FGVC research (elaborated in Sec. 1.1), we move on to conduct ablation on our key technical choice of adopting image caption aggregate to represent the otherwise hard-to-quantify domain novice visual attentions. We wrap up the experiments by providing further analysis on .

Figure 4: Top row: Box plot to demonstrate the efficacy of / in helping people reverse their failed decision for bird recognition. Green triangle represents the mean performance. Bottom two rows: Sample illustration of Top K visual regions / attend to.

Data & participant setup.  We recruit 200 participants across different ages, genders and education levels, where each of whom is expected to complete a questionnaire with 300 bird recognition tasks. In each task trial, the participant will be given a query bird image and five gallery bird images, and asked to select the only image in the gallery that he/she believes to belong to the same bird sub-class with the query (Fig. 6). The difficulty of the task then lies in the similarity level between the query and gallery images, where we define three challenge levels based on the biological bird stratification of Order-Family-Species: (i) Easy: gallery samples are manifested in different bird orders. (ii) Medium: gallery samples are from different bird families but all belonging to the same bird order. (iii) Hard: gallery samples only differ in the finest species level, i.e. they come from the same bird order and family. We assign different score for correct answer to question of different difficulty (0.5, 1, 1.5 point for easy, medium and hard respectively). This means when the 300 tasks are decomposed into three subsets of [90, 120, 90] for each challenge level (the setting we adopt), the full mark would be 300 points. We plot the normalised scoring histogram by counting the number of people falling in each of the 10 discretised bins and observe an intuitive Gaussian-like bell shape distribution. Since our goal is to simulate a study covering fairly for people with divergent levels of bird expertise, we form three representative population groups by randomly selecting [15,20,15] participants from the first three bins (worst scorers), the fourth to seventh bin (medium scorers) and the last three bins (best scorers) respectively. These 50 participants and their bird recognition cases are then setting our basis for the experiments later.

CVPR18 [wang2018learning]
CVPR18 [yang2018learning]
CVPR19 [chen2019destruction]
CVPR20 [huang2020interpretable]
ICCV21 [huang2021stochastic]
ICCV21 [rao2021counterfactual]
Figure 5: Visualisations of the typical supporting regions for the classifiers in the existing FGVC works. Examples shown are from direct copy-and-paste of the original paper.
Figure 6: Sample questionnaire for measuring the efficacy of AI-empowered knowledge by simulating it as a book guide. In data and participant setup stage, is not shown.

3.1 Is Your Expertly Curated Bird Guide

Experimental method.  We now take the failed recognition cases from the 50 participants at the setup stage and examine the efficacy of on improving their recognisability. We also test the performance of , a practical alternative to when there is lack of fine-grained visual description of an image. We follow the similar “query-gallery” experimental procedure, with just the difference that the query is now highlighted with the knowledge provided by (Fig. 6

). By paying extra attention to the AI-empowered knowledge, participants are required to re-make their decision of selecting the target image from the gallery that shares the same sub-class with the query. To mitigate the issue of participants intentionally altering their decision due to the awareness of their past failure or keeping their decisions unchanged because of behavioural inertia and memorisation, three strategies are taken to get around it: (i) we repeat the tests of their successful recognition cases and interleave them between failed ones, while not include these results in our statistics count; (ii) we enforce the participants to have at least have 24-hour time gap between undergoing any two different purposed experiments; (iii) We randomised the order of the individual task and its gallery images display. Our evaluation metric is twofold: average human correction percentage (

CP) and average weighted human correction percentage (WCP). The former calculates the percentage of cases (2407 in total) that one human participant has successfully reverted their erroneous recognition under the guidance of , where the latter corresponds to a weighted version that assesses the correction rate for cases in each challenge level first and weight it with the corresponding challenged point. Lastly, we rank the visual attentions of in descending order and always present the Top visual regions to human participants. We experiment with different values of [1,2,3,4,5,6,7] – we find 7 already brings notable degenerate performance in our pilot study as people tend to feel uncomfortable and fail to focus when faced with too many visual cues.


We graphically describe CP and WCP for each human participant via the box plot of five-number quartile summary

[tukey1977exploratory] in Fig. 4. Following observations can be made: (i) Under the metric of both CP and WCP, is able to provide the best performance when used to display its top 3 visual attentions with mean values, 53.39% and 54.24%. This provides compelling evidence that our learned is indeed extracting knowledge from an AI agent in a way that guides people towards better recognising a unknown bird. We further calculate the mean CP (mCP) and WCP (mWCP) values for people from three different scorer groups at setup stage, where the small differences among groups (52.63%, 52.91%, 54.91%@mCP, 52.96%, 53.88%, 56.53%@mWCP) confirm that is friendly and effective to users with divergent levels of bird expertise. (ii) Difference between and is marginal in the eyes of human participants. This is an important message indicating that our framework can stand up to the fatal but common lack of per-image fine-grained language descriptions with little performance sacrifice. (iii) WCP values are slightly larger compared with those of CP. This is expected. Given / is designed to offer the most subtle expert-exclusive visual cues for successful fine-grained recognition, it naturally works better for solving harder cases with more bonus points (California Gull vs. Western Gull) compared with that for easier ones (Gull vs. Flamingo). (iv) There seems to exist a safety value (K7) of how many visual regions to display and when the threshold is violated, humans start to show general failure in digesting the visual knowledge provided by /. In line with psychological findings [grant2003eye, roads2016using], we ascribe such phenomenon to the fact that redundant visual distractors superimposed upon the most compact visual highlights can be very detrimental for people to gain attentional expertise in practice. Interestingly, we also demonstrate some common visualisation results of existing FGVC works in Fig. 5. We can see how they generally cover the full attention map of a bird and correspond roughly to a K7 scenario (or even worse!) under / – indicating their natural unsuitability for human consumption as argued in Sec. 1.1. We also consult our human participants on why they perform drastically poorer when K grows over a certain value and their response is unanimous: “we don’t know how to make sense from the knowledge manifested in crowded and cluttered visual regions.

3.2 Works Because It Is Expert-Exclusive

In this section, we conduct deeper probe on . Our goal is to show that is indeed distilling unique visual attentions from experts () that are not shared by domain novices (), and this very property of consequently helps human participants to better recognise a bird. Below is detailed analysis.

We first adopt Intersection over Union (IoU) to measure the correlation between the top rankings of two visual attention sequences – if is significantly larger than before grows impractically large, we know has successfully extracted the exclusive parts from . Results in Fig. 7(a)(b) confirm that indeed shares negligible () attentional overlap with for K up to 20, in a stark contrast with the strong correlated interplay between and . To shed further light on the importance of refining to and its practical meaning as a form of knowledge to human participants, we calculate the expert label prediction accuracy and 555To obtain () on different values, we first work out normalised mean feature representation , s.t. before feeding it into classifier. with the combined visual cues from and respectively. Our intuition is that if provides practically more useful complementary knowledge to what human already knows, should reach to a considerably satisfying performance at a much smaller K than that of . In other words, knowledge encoded in is more condensed and effective for human to digest because of its nature of expert-exclusive. Fig. 7(c) shows this is exactly the case where 91.40% of label prediction performance is retained with only one best visual attention in and up to 94.05% when K3. We also repeat the “query-gallery” experiment in Sec. 3.1 and aim to figure out to what extent can and improve people’s bird recognisability in a human study. By examining their performance under both mCP and mWCP and comparing them with (40.02% and 47.05% vs. 53.39% @mCP, 39.56% and 45.51% vs. 54.24% @mWCP), we can fairly conclude that the hypothesis of fine-grained visual knowledge being expert-exclusive does matter for practically more effective human consumption.

Figure 7: (a): Exemplified comparisons among the Top 3 visual attentions encoded by , and . (b)(c): Understanding our learned from two different aspects. Comfort zone: maximum number of visual regions for display that we find humans can practically make sense of (K7). More details in text.

3.3 Good Solver Needs Human Language Input

A critical part of our framework at design-level is how to define and quantify with data at hand. Given the rich visual elements of an image and the subjective nature of human vision on their relative importance, deciding the best form of representing domain novice visual attentions becomes indeed an art of choice. Our proposed method advocates the use of human fine-grained caption aggregate of an image for learning , where we compare it with several competitors below.

width=center Raw Image Annotated Bubbles Human Drawing Caption Aggregate This bird has a yellow-breasted, a black cheek patch, and a white superciliary… This bird has long legs, short tan wings, a small bill, and grayish-brown belly feathers…

Table 1: Human annotated bubbles and drawings as alternatives of caption aggregate (Ours) to representing .
Ours Bubble [deng2015leveraging] Drawing [wang2020progressive] Beginner [chang2021your]
1-5 mCP
1-5 mWCP
Table 2: Performance comparisons (%) between different realisations of defined in Sec. 3.3.
Figure 8: Typical Top 3 visual attentions of when trained on fine-grained flower dataset.
Method CUB-Bird-200 Oxford-Flower-102
2-5 Baseline Ours Baseline Ours
B-CNN (ICCV15  [lin2015bilinear])
NTS (ECCV18 [yang2018learning])
PC (ECCV18 [dubey2018pairwise]
DCL (CVPR19 [chen2019destruction])
CrossX (ICCV19 [luo2019cross])
PMG (ECCV20 [du2020fine])
DeiT-B (ICML2021 [touvron2021training])
ViT-B-16 (ICLR2021 [dosovitskiy2020image])
CVL (CVPR17 [he2017fine])
PMA (TIP20 [song2020bi])
Table 3: improves existing FGVC methods when exploited for providing localisation information. : results obtained using publicly released code. : methods that use fine-grained text descriptions as side information.

Competitors.  We include three competitors for different conceptualisations of : (1) Discriminative bird bubbles: These annotated bird circular regions (Tab. 1), namely “bubbles” [deng2015leveraging]

, are collected via a novel online game aiming to reveal the most discriminative parts of a bird image. We aggregate the available bubbles of an image from multiple players and use their mean ImageNet pre-trained feature representation to learn

. (2) Human bird drawings: CUB-200-Painting [wang2020progressive]

is an extension of CUB-200-2011 bird dataset, which contains diverse human drawing forms (Tab. 


) aiming to visually interpret a fine-grained bird species, including watercolors, oil paintings, sketches and cartoons. We aggregate the human drawings under one bird species and use their mean ImageNet pre-trained feature representation to learn

. (3) Junior bird expert: We also model as a beginner-level bird specialist that can differentiate between 13 bird subclasses at order level [chang2021your] – instead of the finer recognition of 200 subclasses at species level required towards a bird expert. For this, we train a 13-way classification model and adopt it (like ImageNet pre-trained feature) to learn .

Results.  We follow the same “query-gallery” experimental procedures in Sec. 3.1 to evaluate the knowledge efficacy of provided by the different realisations of as described above. We report the result in Tab. 2 and confirm the significance of our technical choice of using fine-grained caption aggregate to represent what domain novice can perceive from an image. Interestingly, the worst choice of (Bubble) still outperforms and (48.40% vs. 47.05% and 40.02% @mCP, 46.98% vs. 45.51% and 39.56% @mWCP), stressing again the importance of our expert-exclusive modelling.

3.4 Further Analysis On

works beyond birds.  We repeat the learning process on the fine-grained flower dataset [nilsback2008automated] and conduct the same human study pipeline as with birds. The mCP performance of 69.39% when displaying Top 3 visual region from confirms its efficacy as digestible knowledge in helping human participants to better recognise an unknown flower type. We show some visualisations of in Fig. 8.

improves FGVC performance.  We embed into existing FGVC frameworks aiming to help model better localise to the most discriminative visual regions (detail in Sec. 2.4). We confirm in Tab. 3 that

is indeed a promising universal FGVC booster regardless of the base models built upon. Notably, our result also improves over CVL and PMA, two methods that have already specifically hedged their bets on the fine-grained textural information for more discriminative attention modelling.

4 Conclusion

Results from a human study indicate that our method is able to obtain useful human-digestible knowledge from a FGVC model that significantly improves participants’ ability on distinguishing between fine-grained objects. This is made possible by first representing knowledge as those visual regions attended exclusively by domain experts and managing to model it with a multi-stage cross-modal learning framework. By taking a first and firm step towards enabling a FGVC model as a human knowledge provider, this work could apply to a wide range of end-user applications that require fine-grained recognition outputs to be accessible for human consumption. Last but not least, we hope to have caused a stir and help to trigger potential discussions on how to make a heavily-invested AI expert system doing more good for all.