Local Explanations for Clinical Search Engine results

10/19/2021
by   Edeline Contempré, et al.
Vrije Universiteit Amsterdam
0

Health care professionals rely on treatment search engines to efficiently find adequate clinical trials and early access programs for their patients. However, doctors lose trust in the system if its underlying processes are unclear and unexplained. In this paper, a model-agnostic explainable method is developed to provide users with further information regarding the reasons why a clinical trial is retrieved in response to a query. To accomplish this, the engine generates features from clinical trials using by using a knowledge graph, clinical trial data and additional medical resources. and a crowd-sourcing methodology is used to determine their importance. Grounded on the proposed methodology, the rationale behind retrieving the clinical trials is explained in layman's terms so that healthcare processionals can effortlessly perceive them. In addition, we compute an explainability score for each of the retrieved items, according to which the items can be ranked. The experiments validated by medical professionals suggest that the proposed methodology induces trust in targeted as well as in non-targeted users, and provide them with reliable explanations and ranking of retrieved items.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

04/13/2022

Clinical trial site matching with improved diversity using fair policy learning

The ongoing pandemic has highlighted the importance of reliable and effi...
10/30/2018

Recent advances in methodology for clinical trials in small populations: the InSPiRe project

Where there are a limited number of patients, such as in a rare disease,...
02/13/2022

Incentivizing Participation in Clinical Trials

The difficulty of recruiting patients is a well-known issue in clinical ...
09/17/2020

Focused Clinical Query Understanding and Retrieval of Medical Snippets powered through a Healthcare Knowledge Graph

Clinicians face several significant barriers to search and synthesize ac...
09/16/2019

Explainable Product Search with a Dynamic Relation Embedding Model

Product search is one of the most popular methods for customers to disco...
10/12/2020

Predicting Clinical Trial Results by Implicit Evidence Integration

Clinical trials provide essential guidance for practicing Evidence-Based...
07/15/2019

CupQ: A New Clinical Literature Search Engine

A new clinical literature search engine, called CupQ, is presented. It a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

When healthcare professionals (HCPs) use a treatment search engine to find treatment options for their patients, they need to gain a certain trust in the system. While accuracy, performance, and the design are essential for accomplishing such trust, we may need to do more to reach a threshold where HCPs trust the system enough to want to use it in critical scenarios, e.g., when a patient’s life may be at risk.

A search engine would typically provide a ranked list of the related items with regards to a certain query. However, lack of explanations could lead to a lack of trust from users as they would not understand the underlying logic of retrieving an item in response to a query. In the medical domain, where the pressure to make no mistakes is high,incorrectly attributing the cause of a mistake could be fatal. As a result, without the ability to interpret the model, HCPs’ trust in the model decreases and will, ultimately, not use the model’s outputs (Pu and Chen, 2006). In addition, due to compliance regulations, most search engines in the medical domain provide unordered lists of related items in response to a query, making it difficult for users to look into or distinguish the most relevant items for their needs. As an instance, treatment search engines such as clinicaltrials.gov do not offer relevance-based ranking of documents, partly because ranking in-criteria treatment options may be suspect to favoritism, which is highly prohibited for clinical trials (5). In addition, for an efficient and thorough search, a status/date/title-based ordering may not always be the most practical for end users as they tend to scatter similar results from each other.

A potential solution to provide explanations is to use current explainable methods. A class of such methods would provide global explanations only so that we can evaluate the overall search engine as a whole, but cannot provide proper explanations in response to each individual query (Dam et al., 2018). Another option is to use other methods for local explainability such as LIME (Das and Rad, 2020) and SHAP (Lundberg and Lee, 2017)

as these explain how or why a specific result is provided. However, these methods are designed for machine learning problems, such as classification and regression, but not for explaining search engine results. When local explanations were used to explain search results, these were either based on one feature from the documents like word prominence

(Verma and Ganguly, 2019), or not applicable to the clinical trials since they are based on user reviews for each document (Catherine et al., 2017) (which are unavailable for clinical trials).

Another major challenge is to provide HCPs with user-friendly, reliable, and easy-to-understand explanations. A major drawback of current explainability techniques, such as LIME (Das and Rad, 2020), is that these techniques tend to focus on aiding users with technical backgrounds to interpret the system. HCPs are not universally expected to understand the detailed workings of a complex retrieval model. Thus, HCPs require explanations that are high level and intuitive, while they do not necessarily need to reflect the exact inner workings of retrieval models. This opens up an opportunity in which explanation models can be built in a way that they are not sensitive to minor changes in the model of a search engine. For instance, with a treatment search engine, we may offer a local explanation such as “the queried disease is mentioned in the retrieved clinical trial’s title”, and as long as the retrieval model relies on this as a feature, we are using a generic explanation method, and we can develop the search engine and explanation module fairly separately in practice.

In this paper, an explainability method is developed which provides tailored explanations to medical practitioners for retrieved items. To that end, meaningful features from clinical trials are extracted from different data sources, and preferences of different users are elicited by utilizing a crowdsourcing-based methodology. We then put forward a method to translate preferences into importance level of features. Based on features’ importance levels, tailored explanations are acquired for each specific query, according to which we develop a sentence template in order to present them to users. In addition, we introduce explainability scores, according to which we order retrieved items. The results suggest that the use of local explainability on clinical search engines promote HCPs trust, search experience, and result ordering satisfaction.

2. Related Work

The General Data Protection Rule (GDPR) requires (since 2016) all systems collecting data to be transparent on how these use data. Researchers have since then created surveys on explainability (Zhang and Chen, 2018; Došilović et al., 2018; Adadi and Berrada, 2018), defined different types of explainability (Sheh and Monteath, 2018), used explainability in a variety of sectors. However, up to date, there is no definition of explainability in the Oxford Dictionary111See https://www.oxfordlearnersdictionaries.com/spellcheck/english/?q=explainability (visited on 15/10/2020) (Rosenfeld and Richardson, 2019). Researchers argue about the definition of explainability, what shape explainability has, meaning it is a monolithic concept.

To clarify our vision of explainability, we identified three main dimensions of explainability that can be observed throughout researchers’ definitions: audience, understanding, and transparency. Understanding refers to the user’s ability to understand the model’s results. However, not all users can interpret all models using explainability as models can be domain specific. For example, users without knowledge in biology would struggle to understand highly biological terms generated by a model’s explainability attempting to diagnose a certain type of lung cancer. Likewise, explainable AI (XAI) could use simple terms, leading to a lack of details for the doctor assessing the diagnosis. The user is therefore required to have a certain amount of knowledge to understand the explanation itself, making it crucial for developers using explainability to target their audience (Rosenfeld and Richardson, 2019). Lastly, an explainable method should increase the model’s transparency by making it more interpretable for its users, and not try to generate seemingly arbitrary explanations that do not fit with how the model works (Dimanov et al., 2020).

A type of audience are, for example, health care professionals (HCPs) as they increasingly rely on artificial intelligence models to make concise decisions such as diagnosing or providing treatment options. A further increase of the use of AI occurs when the machine provides explanations as HCPs use these elaborations as a means to justify their decisions. For example, if a model suggests diagnosing a patient with lung cancer, the HCP would want a justification to this decision/suggestion. Other HCPs have reported a need to understand the cause of learned representations

(Holzinger et al., 2019). Moreover, HCPs reported it will not cause major complications if the model is not as accurate, but would always want to know its potential shortcomings. Therefore, local explanations benefit HCPs to construct their conclusions.

Local explanations increase trust in HCPs. Although providing local explanations about features is a straightforward solutions, these do not aid all clinicians in all departments. HCPs working in the intensive care unit and in the emergency department reported that local explanations were not useful. The two departments have time as common denominator. HCPs in these departments often lack time due to patients with a short prognosis that only have a few hours to live if the HCP does not take action. Local explanations could, therefore, not be useful to users lacking time.

Current state-of-the-art local explainability techniques do not use user-friendly explanations. These current local explainability techniques are either based on feature importance such as LIME (Das and Rad, 2020) and SHAP (Lundberg and Lee, 2017), rule-based (Verma and Ganguly, 2019), saliency maps (Mundhenk et al., 2019), prototypes (Gee et al., 2019) example based (Dave et al., 2020), or on counterfactual explanations (Dave et al., 2020). Up to date, feature importance (Zhang et al., 2019) and rule-based techniques (Verma and Ganguly, 2019) were used on search engines, but do not meet the criteria that these should be user friendly.

LIME is a type of local explainability method aiming to increase transparency for specific decisions given by an opaque model. It explains single result by letting users know why they are getting this specific result over another (Verma and Ganguly, 2019). Although LIME offers one way to solve the black-box problem, it has a few limitations. The first limitation of using LIME is that it is most commonly used for linear or classification models (Arrieta et al., 2020). This limits the degree to which the model can be meaningfully applied to, and restricts itself to non-user-friendly explanations. Consequently, this research does not use LIME methods, but developed a local explainability method to order and generate user friendly explanations.

3. Explainable Search Engine

This section presents the proposed model that provides explanations for its users, as well as how it orders a clinical search engine’s results. This enables users to efficiently find potential relevant clinical trials while understanding the underlying processes of the model. The proposed method also generates local explainability scores for each clinical trial and uses these scores to order the search engine’s results. Moreover, users are shown user-friendly explanations providing descriptions of the features are available in each clinical trial. The prerequisites of computing the explanations and ordering is the engineering of features from different data sources.

Figure 1 shows the pipeline of steps conducted for the proposed methodology. The search engine takes as input the user’s query, and returns an output with explainability-based ordered results with explanations. Figure 1 shows that the local explainability search engine combines resources with the HCP’s query to engineer features. These features are, thereupon, attributed local explainability scores which are used to order the list of clinical trials. In addition, the engineered features’ outputs fill template sentences. These explanations provide information to the user on how much of this clinical trial can the search engine explain. In the following section, each module in Figure 1 is discussed and explained in more detail.

Figure 1. Overview of the methods’ pipeline. Resources include the knowledge graph, data from UMLS, clinical trials from CT.gov, and data from Pubmed.

3.1. Feature Engineering

Before engineering the features linked to clinical trials, we first present the data sources according to which features for each clinical trial are extracted.

Table 1 provides an overview of the data sources used in the proposed model. First, we used data from UMLS (Unified Medical Language System) (Bodenreider, 2004), which is an official medical database where all conditions, diseases, infections, and more, are associated to Concept Unique Identifiers (CUIs). Second, different clinical trial sources such as clinicaltrials.gov222https://www.clinicaltrials.gov/ are used as it is the biggest clinical trial repository. Third, the database comprises Pubmed publications, as it saves medical papers. Lastly, it comprises data from the company’s conditions graph (knowledge graph) where parent-child relations between diseases are defined, terms are specified, as well as the clarification of terms and their synonyms.

Data source Data
UMLS Concept Unique Identifiers, disease terms, and relations between these
clinicaltrials.gov Clinical trials’ detailed descriptions, summaries, phase, title, overall status, primary purpose
Pubmed Publications associated to clinical trials
Knowledge graph Parent-child relationships between diseases taken from UMLS, disease concepts (with the diseases’ preferred term and synonyms), language.
Table 1. Description of the data sources for the feature engineering.

Table 1 shows the properties, from different data sources, that were used to engineer features. The properties, by themselves, do not measure how explainable a clinical trial is. Therefore, features were created using the conditions graph, the UMLS and pubmed databases to assess how much of a clinical trial the AI can explain. The engineered features are provided in Table 2.

width=0.5 Feature output type/ Query dependency Query dependent Query independent Output Binary query in title, preferred term in title clinical stage present, stage is recruiting, overall status given 0 or 1 Numeric query in summary, preferred term in summary, preferred term in summary, query in detailed description number of publications Between 0 and infinity

Table 2. Classification of features created for the local explainability based search engine.

To facilitate the explainability-based calculations, features were assigned to various categories. Table 2 shows different classifications where the first category is based on the user’s query: the feature is either query dependent or query independent. For example, the feature query in title is a query dependent feature as the feature depends on a match between a query and the title of the clinical trial. So, if a clinical trial about breast cancer mentions the condition in the title, the feature receives a score of 1. In contrast, query independent features do not depend on the query as regardless of what the query is, its score remains unchanged. For example, the feature number of publications attributed to a clinical trial remains fixed, regardless of user’s query. Second, Table 2 shows engineered features have two distinct outputs which are either binary or numeric. Binary features assess the presence of a feature in a study. Therefore, its output is either 0 or 1. On the other hand, numeric features count the occurrence of a feature. Thus, its outputs range between 0 and infinity.

3.2. Feature Importance Identification: A Crowdsourcing Approach

We created a statistical approach to determine the weights of our features, and conducted a crowdsourcing task on Amazon Sagemaker as an alternative method to collect data on feature importance. Compliance regulations prohibits pharmaceutical companies to retain data on its users, especially when these relate to drugs (5). However, previous research has shown that crowdsourcees provided equal quality answers when conducting medical labeling tasks compared to domain experts (Dumitrache et al., 2013, 2017). Hence, in this experiment, 1116 responses were collected from participants to determine users’ feature preferences and, thereupon, use these to order and explain results returned by the clinical search engine.

Features’ importance were measured using a cold start implicit strategy, where we asked participants to rate explainability sentences. The rating consisted of assessing sentences on a 5-point Likert-scale, from ”Not convincing at all” to ”Very convincing”. Each sentence explained the prominence or availability of a feature mentioned in Table 2. In addition, we changed the format of the sentences to implicitly measure which sentence format was most preferred to users in order to create user-friendly explanations. Therefore, when participants were asked to rate how convincing an explanation was to continue to read the clinical trial in further detail, we were implicitly measuring how important a certain feature was for our users.

We hypothesized that search features are not equally preferred by users. In addition, we hypothesized that the formulation of explanations were not equally preferred. We tested three dimensions of sentence formulation: numeric vs. non-numeric (using entities ‘3 times’ vs. ‘multiple times’ in an explanatory sentences), action-oriented versus fact-driven formulations (‘retrieved’ versus ‘clearly mentioned’), and disease specific versus non-disease specific outputs (‘HIV’ versus ‘condition’).

3.2.1. Results

Features Mean Std dev
Query in detailed description 3.69 0.82
Query in summary 3.53 0.91
Primary purpose availability 3.53 0.84
Number of publications 3.51 0.92
Stage availability 3.44 0.99
Query in title 3.15 0.93
Trial is recruiting 3.13 1
Table 3.

Features’ means and standard deviations.

width=0.5

Features Title Summary Description Publications Stage Recruiting Primary purpose
Title / / / / / / /
Summary 0.007 / / / / / /
Description 0.00002 0.51 / / / / /
Publications 0.02 0.61 0.35 / / / /
Stage 0.038 0.67 0.15 0.88 / / /
Recruiting 0.81 0.006 0.00001 0.01 0.054 / /
Primary purpose 0.012 0.82 0.42 0.82 0.43 0.006 /
Table 4. Results of feature labeling task. Note: the results in bold are statistically significant under the assumption p <0.05.

The results suggest that, in response to feature importance, partial ordering can be obtained via crowdsourcing tasks and statistical tests. The results in Table 3 illustrate that the feature with the highest mean score (3.69, on a 5-point Likert scale) was Query in detailed description, whereas the two least convincing features were Query in title, and Trial is recruiting (3.15, and 3.13, respectively). We determined the weights of our features using tests. Table 4 provides the results of these chi-square tests where, for example, the features Query in title and Query in summary were not equally preferred (as the results reveal a p-value of 0.007).

Data obtained in response to feature importance shows that there is at least a partial ordering that can be obtained via the crowdsourcing (based on statistical tests). We determined the features’ weight, using tests, based on statistical values. If two features were, for example, not statistically equally preferred, these two features would be attributed different weights.

width=0.5

Entity (1) (2) P-value
(1) Non-numerical (2) Numerical 3.7 3.34 0.01
(1) Clearly mentioned (2) Retrieved 3.65 3.33 0.036
(1) Specify disease (2) Not specify disease 3.4 3.48 0.44
Table 5. Experiment results for entities. Note: the results in bold are statistically significant under the assumption p <0.05.

When it comes to results for the three formulation dimensions (Table 5), when performing tests, we found that:

  • Users prefer explanations without non-numerical sentences (e.g. sentences mentioning that there are ’multiple’ articles linked to the clinical trial vs. ’2’).

  • Users prefer factual sentences (‘clearly mentioned’) compared to actions related to the search procedure (‘retrieved’).

  • No preferences were found between specifying the condition in a sentence (e.g. the condition ‘HIV’ was mentioned in the title) versus (‘the condition’).

3.3. Explainability Score: Ordering Retrieved Items

In this section, we use the importance of the extracted features to compute the explainability score for each of the clinical trials and order them accordingly. For doing so, features are first assigned a weight, which are then used to calculate the explainability score, and ultimately the clinical trials are ordered based on these scores.

Each clinical trial was attributed an explainability score based on its features availability or occurrences. Certain scores were fixed, while others depended on the user’s query. The former are defined as query independent features, and the latter as query dependent features. As such, query dependent and query independent features were separately calculated (Table 2 reports which features belong to each category).

Although query dependent and query independent scores were separately calculated, all explainability feature scores, shown as , were calculated in the same manner:

where the explainability score for each feature is determined by the weight (which depends on if the feature is binary or numeric, and if it is high or low importance), and the feature’s score . Binary scores are identically determined for all binary features. If the feature is present; , and if the feature is unavailable; . However, numerical features scores are calculated based on each feature’s occurrence.

As previously mentioned, the process to calculate the scores differ for query dependent and query independent features. For query dependent features, because all the terms related to one CUI refer to the same condition, all scores related to one CUI were grouped per clinical trial:

where the explainability scores for features belonging in the category query dependent were calculated by grouping features’ score per CUI for each clinical trial . On the other hand, features’ scores belonging to the query independent category scores were calculated as:

where all the scores are attributed to their respective studies, and linked to all the study’s CUIs. Therefore, as long as the HCP queries a condition related to the clinical trial, the score attributed to that clinical trial will remain unchanged. Finally, this score will be summed per CUI with its , which will give us our final explainability per clinical trial per CUI:

where was used to order the clinical trials by conducting linear feature ranking. Explainability scores range between 0 and 1, where clinical trials with scores close to 1 reflect that the search engine can explain more about these clinical trials compared to clinical trials with a score close to 0. Hence, the clinical trials linked to a CUI (that is queried by the user) with the highest XAI scores for that CUI appear higher in the results’ list.

Therefore, the algorithm orders the clinical trials by their explainability scores . To do so, the algorithm takes as input the HCP’s query condition. The algorithm will then search for the query term in the database, and identify the CUI associated to the condition. Secondly, the algorithm filters all clinical trials to keep studies related to that CUI, therefore providing a list of all articles related to the condition the HCP queried. Lastly, the list will be ordered based on explainability scores , where the highest explainable scores will receive the highest ordering position, and the lowest explainability scores will receive the lowest position.

3.4. Retrieval Explanations

Having extracted the features and computed their importance, this section is dedicated on how to explain the retrieved items in response to a query from the user. The explanations must be in a way that the HCP can readily understand them. For doing so, we develop a a template list of sentences, as shown in Table 6. These sentences are simple, user-friendly, hierarchically structured, short and straightforward, as well as they can explain the source of information, and cover why a result was returned.

width=0.5

Feature Template sentence
Query in title The condition is mentioned in the title
Preferred term in title The preferred term of the condition is mentioned in the title
Query in summary The condition is mentioned in the summary
Preferred term in summary The preferred term of the condition is mentioned in the summary
Query in detailed description The condition is mentioned in the detailed description
Preferred term in detailed description The preferred term of the condition is mentioned multiple times in the description
Number of publications The clinical trial has multiple publications
Stage availability The clinical trial’s stage is clearly mentioned
Overall status availability The clinical trial’s status is clearly mentioned
Trial is recruiting The clinical trial’s status is recruiting
Table 6. Template sentences created for the explainability based search engine.

Sentences were created in the following manner:

  1. A maximum of three sentences at a time are displayed. We assume that, given the limited amount of time HCPs spend on the search engine, a maximum of three sentences will be enough for the HCP to read.

  2. Sentences are only displayed if certain conditions are met. Given the limited time this research has, thresholds are determined based on intuitive knowledge. This allows users to only see relevant explanations.

  3. Sentences are ordered by feature preference. The ordering at which sentences are displayed rely on the results of the experiment described in sub-section 3.2.1.

  4. The sentences are kept simple. To understand which formulation of sentences users prefer, we researched entity preferences as described in sub-section 3.2.

4. Model Evaluation

We evaluated our model by comparing it to other simulated clinical search engines based on users’ trust, search experience, and result ordering satisfaction. Our hypotheses were that all search engines were equally preferred in all three dimensions. We, therefore, simulated 5 different search engines with different city names, where each engine queried either lyme disease, breast cancer, or HIV:

  1. Amsterdam: Search engine with ordered results and with explainable sentences

  2. Berlin: Search engine with ordered results and without explainable sentences

  3. Copenhagen: Search engine without ordered results and with explainable sentences

  4. Dublin: Search engine without ordered results and without explainable sentences

  5. Edinburgh: Search engine with titles ordered by alphabetical order

The engines used data from myTomorrows333https://search.mytomorrows.com/public in order to create scenarios as realistic as possible. The different query concepts were queried in each search engine, for which the top 10 results were extracted and put into the simulated environments to imitate the first page of a search engine showing 10 results at a time.

4.1. Experiment setup

Participants were recruited using different social media platforms such as Facebook, Linkedin, or recruited in the company itself. Participants received a link to a questionnaire focusing on one of the query concepts (either lyme disease, HIV, or invasive breast cancer). In each questionnaire, participants were shown one by one the different simulated search engines related to the query concept. The different search engines were shown in a random order. Additionally, participants were asked to:

  • Assess if they trusted the search engine.

    • Question asked: When looking at the search engine, how much do you trust the search engine?

    • Possible answers:

      1. I trust this search engine very much

      2. I trust this search engine

      3. My trust is neutral

      4. I do not trust this search engine

      5. I do not trust this search engine at all

  • Assess if they were satisfied with the ordering of the search engines’ results

    • Question asked: When looking at the search engine, are you satisfied with the ordering of clinical trials?

    • Possible answers:

      1. I am very satisfied with the ordering

      2. I am satisfied with the ordering

      3. I feel neutral

      4. I am not satisfied with the ordering

      5. I am highly not satisfied with the ordering

  • Asses their search experience while using the search engine

    • Question asked: What is your search experience when using the search engine?

    • Possible answers:

      1. I have a great search experience

      2. I have a good search experience

      3. My search experience is neutral

      4. My search experience is not good

      5. My search experience is not good at all

In the end of the questionnaire, participants were asked to order the different search engines by:

  • Trust: users had to order the search engines from most trustworthy to least trustworthy.

  • Result ordering satisfaction: users were asked to order of search engines they preferred from highest result ordering satisfaction to lowest result ordering satisfaction.

  • Search experience: users were asked to order the search engines from best search experience to least favourite search experience.

An example of one of the simulated search engines is shown in Figure 2.

Figure 2. Example of a simulated search engine.

In total, we created 9 questionnaires which were randomly allocated to 55 participants, among which 34 completed the experiment.

4.2. Results

All participants HCPs Non-HCPs
Trust 0.29 0.44 0.62
Search experience 0.04 0.09 0.37
Ordering 0.31 0.40 0.48

Note: the results in bold italic with a * are statistically significant under the assumption p <0.05. By all participants, we mean the combination of results of HCPs and non-HCPs.

Table 7. results of the comparison of all search engines to each other.

In the experiment, we asked participants to evaluate the different search engines one by one and report their experience. We conducted the test to test our hypotheses that all search engines are equally preferred in all three dimensions. Table 7 reports the

test results, and shows that when combining the responses of HCPs and non-HCPs, the null hypothesis that all search engines have equal search experience is rejected as the test returned a p-value of 0.04. This suggests that users have different search experiences when, distinctively, facing the search engines. The three following subsections investigate how all participants (HCPs and non-HCPs), HCPs alone, and non-HCPs alone, evaluate the different search engines.

The results of the task asking participants to order the search engines from most trusted to least trusted, and best to worst search experience, suggest that users, both HCPs and non-HCPs, reported more trust and better search experience while using the search engines using explainability sentences (Amsterdam and Copenhagen) compared to search engines not explaining its results (Berlin, Dublin and Edinburgh). The results displayed in Figure 3 and Figure 4 demonstrate these preferences.

Figure 3. Results: order the search engines from most trusted to least trusted.

Note: Results close to 1 indicate the most preferred search engines. On the contrary, results close to 5 are the least preferred search engines.

Figure 4. Results of ordering task: order the search engines from best search experience to worst search experience.

Note: Results close to 1 indicate the most preferred search engines. On the contrary, results close to 5 are the least preferred search engines.

The results of the task asking participants to order the search engines from best result ordering satisfaction to worst ordering satisfaction are reported in in Figure LABEL:figure_results_order_ordering. Similar to the results for search experience, search engines including explanations of results improve ordering satisfaction (Amsterdam and Copenhagen). In addition, HCPs ranked last the search engine with explainability based ordering without explanations (Berlin). This further demonstrates that failing to explain how the results’ ordering works leads to reduced result ordering satisfaction.

5. Discussion

When asked to report the preferred order of search engines, participants consistently preferred in all three dimensions the search engines Amsterdam and Copenhagen. The two search engines include explainability based sentences and ordering, and explainability based sentences, respectively. However, we noticed that search engine Berlin scored low, even last in the dimension of ordering satisfaction with HCPs. This suggests that explainability based ordering is preferred when explained with user-friendly sentences. This is in line with research in (Pu and Chen, 2006) as authors demonstrated that explainability overall increases trust. A reason is that without explainability sentences, users understand less the logic behind the model, and therefore attribute lower scores to the search engine Berlin. This reasoning is further emphasized by the significant difference in ordering preference between Dublin and Edinburgh for non-HCPs, where these prefer Edinburgh given that the logic of the search engine is straightforward, which can increase user satisfaction.

HCPs’ results on user experience, trust and ordering satisfaction could be influenced by their expertise. When assessing the different dimensions of the search engines, HCPs could be looking for familiarity and, therefore, search for clinical trials that are within their field of experience. For example, radiologists would pursue clinical trials related to radiology, etc. In addition, a common problem in the medical domain are discrepancies in patient diagnosis between HCPs. Research has shown that subjective preferences influence a diagnosis’ outcome (Gierada et al., 2008), which could explain why HCPs have high variety with their responses when diagnosing a patient. This follows that subjective preferences could have influenced the results of the evaluation of our explainability-based search engine.

Although the model is scalable and generalizable, the features created for this use case are not transferable to other search engines. Features need to adapt to other models’ use cases as most features created in this research are specific to clinical trials. For exmple, a search engine returning a list of travel destinations would not benefit from the feature ’the clinical trial is recruiting’. Transferring the model as-it-is to another set of data would, therefore, require adapting the method to the different use-case. In addition, developers would need to collect data on feature preferences for their use-case.

6. Future work

This research aimed to measure the influence of explainability based search engines on users’ trust, search experience and result ordering satisfaction. Two experiments were conducted, where the first experiment was created to order explainability based features based on importance. The results were translated to weights for features to order the results returned by the search engine. The second experiment evaluated the explainability based search engine by measuring users’ experience with the engine. Overall, the results suggest that search engines with explanations are more trusted, provide greater user experience, and increased ordering satisfaction, compared to search engines without explanations. In addition, users are satisfied with explainability based ordering of results if these have explanations, where not explaining the ordering of results decreases users’ trust, search experience and ordering satisfaction. Thus, the results urge developers to explain search engine results.

Recommended future work is to investigate if there is a need for different explanations for HCPs compared to non-HCPs. HCPs could require more detailed explanations such as additional information on sources such as UMLS and Pubmed. On the contrary, non-HCPs would rather have less medical terms in order to interpret the explanations. Therefore, investigate the different needs of explanations depending on the users’ background would benefit explainable research.

Although our model provides explainability sentences, these are not personalized to the profile of the HCP. In order to make it more personal, results could be ordered based on the HCP’s profile and preferences. To achieve this, future work should collect data on user profiles, and use machine learning to identify users’ personal preferences. Moreover, this could be combined with knowledge graphs to create a relationship between clinical trials and users’ profiles as shown in (Catherine et al., 2017), where users were provided personal explanations using a knowledge graph based on item reviews and user profile.

References

  • A. Adadi and M. Berrada (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6, pp. 52138–52160. Cited by: §2.
  • A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al. (2020) Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion 58, pp. 82–115. Cited by: §2.
  • O. Bodenreider (2004) The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32 (suppl_1), pp. D267–D270. Cited by: §3.1.
  • R. Catherine, K. Mazaitis, M. Eskenazi, and W. Cohen (2017) Explainable entity-based recommendations with knowledge graphs. arXiv preprint arXiv:1707.05254. Cited by: §1, §6.
  • [5] () Clinical trials guidance documents — fda. Note: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/clinical-trials-guidance-documents(Accessed on 09/14/2021) Cited by: §1, §3.2.
  • H. K. Dam, T. Tran, and A. Ghose (2018) Explainable software analytics. In Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results, pp. 53–56. Cited by: §1.
  • A. Das and P. Rad (2020) Opportunities and challenges in explainable artificial intelligence (xai): a survey. arXiv preprint arXiv:2006.11371. Cited by: §1, §1, §2.
  • D. Dave, H. Naik, S. Singhal, and P. Patel (2020) Explainable ai meets healthcare: a study on heart disease dataset. arXiv preprint arXiv:2011.03195. Cited by: §2.
  • B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller (2020) You shouldn’t trust me: learning models which conceal unfairness from multiple explanation methods.. In SafeAI@ AAAI, pp. 63–73. Cited by: §2.
  • F. K. Došilović, M. Brčić, and N. Hlupić (2018) Explainable artificial intelligence: a survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), pp. 0210–0215. Cited by: §2.
  • A. Dumitrache, L. Aroyo, C. Welty, R. Sips, and A. Levas (2013) Dr. detective”: combining gamification techniques and crowdsourcing to create a gold standard in medical text. In Proceedings of the 1st International Conference on Crowdsourcing the Semantic Web, Vol. 1030. Cited by: §3.2.
  • A. Dumitrache, L. Aroyo, and C. Welty (2017) Crowdsourcing ground truth for medical relation extraction. arXiv preprint arXiv:1701.02185. Cited by: §3.2.
  • A. H. Gee, D. Garcia-Olano, J. Ghosh, and D. Paydarfar (2019) Explaining deep classification of time-series data with learned prototypes. arXiv preprint arXiv:1904.08935. Cited by: §2.
  • D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg, and D. C. Strollo (2008) Lung cancer: interobserver agreement on interpretation of pulmonary findings at low-dose ct screening. Radiology 246 (1), pp. 265–272. Cited by: §5.
  • A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller (2019) Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (4), pp. e1312. Cited by: §2.
  • S. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874. Cited by: §1, §2.
  • T. N. Mundhenk, B. Y. Chen, and G. Friedland (2019) Efficient saliency maps for explainable ai. arXiv preprint arXiv:1911.11293. Cited by: §2.
  • P. Pu and L. Chen (2006) Trust building with explanation interfaces. In Proceedings of the 11th international conference on Intelligent user interfaces, pp. 93–100. Cited by: §1, §5.
  • A. Rosenfeld and A. Richardson (2019) Explainability in human–agent systems. Autonomous Agents and Multi-Agent Systems 33 (6), pp. 673–705. Cited by: §2, §2.
  • R. Sheh and I. Monteath (2018) Defining explainable ai for requirements analysis. KI-Künstliche Intelligenz 32 (4), pp. 261–266. Cited by: §2.
  • M. Verma and D. Ganguly (2019) LIRME: locally interpretable ranking model explanation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1281–1284. Cited by: §1, §2, §2.
  • Y. Zhang and X. Chen (2018) Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §2.
  • Y. Zhang, J. Mao, and Q. Ai (2019) SIGIR 2019 tutorial on explainable recommendation and search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1417–1418. Cited by: §2.