Log In Sign Up

Automatically Assessing Quality of Online Health Articles

by   Fariha Afsana, et al.

The information ecosystem today is overwhelmed by an unprecedented quantity of data on versatile topics are with varied quality. However, the quality of information disseminated in the field of medicine has been questioned as the negative health consequences of health misinformation can be life-threatening. There is currently no generic automated tool for evaluating the quality of online health information spanned over a broad range. To address this gap, in this paper, we applied a data mining approach to automatically assess the quality of online health articles based on 10 quality criteria. We have prepared a labeled dataset with 53012 features and applied different feature selection methods to identify the best feature subset with which our trained classifier achieved an accuracy of 84 semantic analysis of features shows the underpinning associations between the selected features assessment criteria and further rationalize our assessment approach. Our findings will help in identifying high-quality health articles and thus aiding users in shaping their opinion to make the right choice while picking health-related help from online.


page 1

page 2

page 3

page 4


Differences between Health Related News Articles from Reliable and Unreliable Media

In this study, we examine a collection of health-related news articles p...

Trust in robot-mediated health information

This paper outlines a social robot platform for providing health informa...

Diverse Online Feature Selection

Online feature selection has been an active research area in recent year...

Image Based Artificial Intelligence in Wound Assessment: A Systematic Review

Efficient and effective assessment of acute and chronic wounds can help ...

Examining the Role of Clickbait Headlines to Engage Readers with Reliable Health-related Information

Clickbait headlines are frequently used to attract readers to read artic...

Towards a corpus for credibility assessment in software practitioner blog articles

Blogs are a source of grey literature which are widely adopted by softwa...

I Introduction

The tremendous advancement of digital technology and widespread usage of Internet have made information accessible worldwide. Consequently, majority of people are turning to the Internet for searching a diverse range of health related information. According to a study by Australian Institute of Health and Welfare, 78% of Australian adults were found to search health-related information in 2015 [AustralianInstituteofHealthandWelfare]. However, the reliability of information from web sources are questionable due to the unregulated nature of Internet.

In this era of Internet, misinformation (dubious, low quality fabricated information) disseminates much faster like wildfire than the truth. A plethora of information from online health articles (OHA) and other sources (Blogs, Facebook, Twitter, YouTube, etc.) are available for health information quester. But all the information are not reliable as these stem from various individuals and organization [Sbaffi2017, Kitchens2014, Eysenbach2002]. Hence, the task of distinguishing unreliable health information from reliable one poses substantial challenges on individuals [Bhatt2017], [in1], [in2].

The extensive spread of unreliable information can negatively affect public health. Misinformation based wrong decision forces people to uphold erroneous belief and opinions instead of irrefutable evidence [Shu2017]. Sometimes, these types of articles fail to render envisioned information to pursuer. This may result misinterpretation of concept and eventually trigger fear and incite one to change regular habit overnight. However, the online network isn’t going anywhere and seeking and sharing health information online will not be stopped. Misinformation will prevail as well [l15]. For this reason, assessing and assuring the quality of health information on World Wide Web becomes a fundamental issue for users [Eysenbach]. The better the quality of health information, the more reliable and accessible it is and the more effective it will be in moulding user’s behaviour towards health-care.

In order to curb this situation, several approaches have been proposed to assess the quality of health related information. Among these, some of the approaches conducted assessment manually and demanded user’s perception to qualify a health news. A number of studies estimated the quality of the overall web sources rather evaluating each of the article published in it 

[l18, new1]. A few others tried to evaluate the quality of articles published in specific disease domain which narrowed down the scope of their work [l11], [l41], [l42]. Some studies proposed evaluation criteria framework and some tried to assess quality based on that proposed framework [l10, l15]. But in case of criteria selection, a question is always there about its specific application on medical domain as criteria selection for health specific articles necessitate the involvement of health professionals. However, given the ever changing landscape of Internet, no universal framework for automatically assessing the quality of OHA has been proposed to date. With this context in mind, this study attempts to automate the quality assessment process of OHA based on the ideas and effort from HealthNewsReview.org111 This organization manually evaluates health-related articles by a team comprised 50 experts from various disciplines including journalism, medicine, health services research, public health and patient perspectives. Performance of this organization is excellent but not scalable in comparison to the speed of information explosion worldwide. In this paper, we applied a data mining based approach to assess the quality of online health articles automatically. Our main contributions can be summarized as follows:

  • We have developed a labelled dataset of health related news articles which were finely annotated by health experts from So far, no generic health related dataset is available that is suitable for assessing the quality of OHA. Our dataset, once released, will be a valuable resource for health and research communities for conducting future studies on the topic of misinformation in the field of medicine.

  • We have explored multifaceted feature spaces through systematic content analysis to identify appropriate features to automate quality assessment process. We have also keyed out criteria-wise discriminating features by analyzing feature importance.

  • We have examined the applicability of various data mining techniques in assessing the quality of OHA automatically and achieved state-of-the-art performance on it.

  • We also have provided explanation of feature subset corresponding to each criterion to justify the value of the assessment.

Ii Related Work

Quality of the online health related information has been a major concern from the dawn of the World Wide Web (WWW) era [l27], [l28]. Numerous tools have been developed to alleviate the quality measurement of health related information most of which are based on a particular disease (e.g., cancer, diabetes, etc.) and lack in robust validity and reliability testing. In [l5], Keselman et. al. conducted an exploratory study with a view to developing a methodological approach to analyze health related web pages and apply it to a set of relevant web pages. This qualitative study analysed webpages about natural treatment of diabetes to accentuate the challenges faced by consumers in seeking health information. It has also underscored the importance of developing support tools so that this formative study could help users to seek, evaluate, and analyze information in the changing digital ecosystem.

We have summarized the relevant research along three categories. The first is characterizing quality assessment tools built on inter-rater agreements by expert panels. These studies aimed at judging the quality of written consumer health information [le], [l2], [l3], [l4], [l34] and health reports in lay media [l36], [l37], [l29], [l38], [l1], [l40]. The second is characterizing quality assessment approaches built on a checklist of factors. These studies focused on identifying appropriate criteria list for evaluating online health information [l30], [l25],[l31], [l15], [Eysenbach], [l6], [l32], [l7], [l11], [l12],[l10], [l33]

. And the third one is characterizing the approaches built on machine learning techniques to automate health information related analysis including health dataset preparation

[l19], improving veracity of medical information [l16], tracking misinformation for specific disease domain [l17], [l35] and reliable health media structure analysis [l18].

Ii-a Statistical Analysis Based Quality Assessment Approach

DISCERN [l2], a short instrument, was developed for judging the quality of written consumer health information about treatment choices by producers, health professionals and patients, and for facilitating the production of high quality evidence-based patient information. The DISCERN approach was a combination of qualitative methods and a statistical measure of inter-rater agreements among expert panel representing a range of expertise including production and use of consumer health information [le]. For establishing the face and content validity, and inter-rater reliability, this approach administered questionnaire to information providers and self-help organizations. Later, authors of [l2] developed an explicit scheme for calculating a 5-star quality rating system for consumer health information based on DISCERN [l3].

The Ensuring Quality Information for Patients (EQIP) [l4] is another tool to assess the presentation quality of all types of written health care information in a more rigorous way, and to prescribe the action that is required following the evaluation. EQIP tool was demonstrated through several processes of item generation, testing for concurrent validity, inter‐rater reliability and utility using large diverse samples of written health care information.

The Quality Index for health related Media Reports (QIMR) was developed as an evaluation tool to monitor the quality of health research reporting in the lay media, more specifically for Canadian media. Themes from interviews with health journalists and researchers were undertaken to develop QIMR [l1]. However, QIMR approach is limited in sample size and scope, and failed to evaluate quality of news sources having content of varying quality.

However, specific focus on treatment information or particular media has narrowed down the scope of these approaches on different applications and questions their applicability to online content about other aspects of health and illness. On the contrary, our approach is applicable to all health related information domains. Moreover, the existing approaches were conducted through manual labour, whereas ours is fully automated system to assess quality of health articles in a shorter possible time.

Ii-B Criteria Based Quality Assessment Approach

To date, there is no clear universal standard to assess the quality of web based health information [l25]. Kim et. al. conducted extensive review to identify criteria that were already proposed or employed specifically for evaluating health related information world wide [l26]. Eysenbach et. al. conducted a systematic review to compile criteria actually used to measure the quality of health information on the Web and synthesized evaluation results from studies containing quantitative data on structure and process [Eysenbach]. Comparing the methodological frameworks of existing approaches authors concluded with the need for defining operational criteria for quality assessment. [Sbaffi2017] is another systematic review where authors reviewed empirical studies on trust and credibility in the use of web-based health information (WHI) with an aim to identify factors that impact judgments of trustworthiness and credibility, and to explore the role of demographic factors affecting trust formation.

The Code of Conduct for medical websites (HONcode), initiated by the Health On the Net Foundation, was the first attempt to propose guidelines to information providers for raising the quality of medical and health information available on the World Wide Web [l6]. Adopting a set of eight criteria to certify websites containing health information, its creators also developed a Health Website Evaluation Tool, which offered users with an indication of commitment to quality from the providers.

There are several criteria-based assessment tools and few of them have proper validation [l7]. Quality Evaluation Scoring Tool (QUEST) is the first quantitative tool that supports a broad range of health information and had undergone a validation process [l10]. Based on a review of existing tools [l11, l12], QUEST quantitatively measures six criteria: authorship, attribution, conflicts of interest, currency, complementarity and tone which can be used by health care professionals and researchers alike. QUEST’s reliability and validity were demonstrated by evaluating online articles on Alzheimer’s disease . In an Fuzzy VIKOR based approach, Afful-Dadzie et. al. [l15] proposed a new criteria framework for measuring the quality of information provided by each site. Authors demonstrated a decision making model to find out how online health information providers could be assessed and ranked based on their quality.

However, some proposed criteria across existing literature are specific for particular domain while some are common which can be considered for standardizing universal set of criteria. In, a group of expert reviewed ten criteria based on analysis from previous studies combined with viewpoint from health care journalism222 issues that a consumer should know for developing their opinions on health related interventions were addressed by these ten criteria. Characteristics of ten defined criteria from standards of health reporting perspectives and all possible basic points with a view to serve the interests of the public have convinced us to adopt these set as standard for evaluating health related articles.

Ii-C Machine Learning Based Analysis and Miscellaneous

Apart from aforementioned approaches, there are few more studies which are not directly aligned with our research but provide us with valuable insights.

In [l19], authors developed a new labelled dataset of misinformative and non-misinformative comments from a medical health forum, MedHelp, with a view to making a resource for medical research communities to study the spread of medical misinformation. Preliminary feature analysis of the dataset was also presented to develop a real-time automated system for identifying and classifying medical misinformation in online forums.

An applied machine learning based approach is proposed in [l16]

, where authors addressed the veracity of online health information by automating systemic approaches in conjunction with Evidence-Based Medicine (EBM). Based on EBM and trusted medical information sources, authors proposed an algorithm, MedFact, which would recommend trusted medical information within health related social media and empower online users to determine the veracity of health information using machine learning techniques. Their aim was to address the factual accuracy of online health information from social media discourse based on keyword extraction. Whereas, our objective is to evaluate the quality of online health realted articles from datamining perspective. we have focused on identifying the discriminating features of health related articles for assessing the quality in a automatic manner.

Ghenai et. al. [l17] proposed a tool for tracking misinformation around health concerns on Twitter based on a case study about Zika. The tool discovered health related rumours in social media by incorporating professional health experts through crowdsourcing for annotating dataset and machine learning for rumour classification. Our aim is different from this study. Rather than focusing on health related rumour, we focused on all types of health related articles available online to evaluate their quality so that people could be able to identify which articles to read or which to avoid for decision making.

A recent study by Dhoju et. al. [l18] has identified structural, topical and semantic differences between health related news articles from reliable and unreliable media by conducting a systematic content analysis. By leveraging a large-scale dataset, authors successfully identified some discriminating features which separate reliable health news from the unreliable one.

However, our study is quite different from these already existing methodologies. Our work is based on the initiatives of, which aimed to evaluate the quality of health related articles as soon as they appeared in the news media in a manual process. Keeping new information arrival rate in mind, our aim is to automate this quality assessment process from data mining perspective, which has not been examined to date according to our knowledge. Our goal is to use their finely tuned manually annotated health information in our study to examine the performance of automated quality assessment approach in health domain. This organization proposed ten criteria as a standard of judging the quality of health articles. Though various criteria framework have been proposed in literature for assessing health information quality, criteria list proposed by was more standardized. Considering existing frameworks and experts opinion regarding quality of health information, this organization aimed to address the basic issues of health interventions through ten criteria so that consumers could develop informed opinions about these interventions and how/whether they matter in their lives. Our objective is to address each individual criteria from data mining perspective and discover to what degree each criterion can be automated.

Iii Dataset Description

There is currently no single dataset for assessing the quality of online health articles (OHA). For this study, we have prepared a dataset based on 1720 health-related articles from The mission of this website is to introduce a significant step towards meaningful health care reform by evaluating the accuracy of medical news and examining the quality of evidence they provide. Since its foundation in 2006, provides reviews of health news reporting from major U.S. news organizations conducted by a multi-disciplinary team of reviewers from journalism, medicine, health services research and public health domain.

According to the editorial team of, all stories and press news releases about public health interventions should be evaluated by ten different criteria to ensure the quality of information in terms of accuracy, balance and completeness. Reviewers justified each of the criterion with ‘satisfactory’ or ‘Not Satisfactory’ scores based on their quality. In some cases, some criteria were rated as ‘Not Applicable’ where it was impossible or unreasonable for an article to address those. Below we provide a list of those criteria333


Criterion 1

Does the story adequately discuss the costs of the intervention?

Criterion 2

Does the story adequately quantify the benefits of the intervention?

Criterion 3

Does the story adequately explain/quantify the harms of the intervention?

Criterion 4

Does the story seem to grasp the quality of the evidence?

Criterion 5

Does the story commit disease-mongering?

Criterion 6

Does the story use independent sources and identify conflicts of interest?

Criterion 7

Does the story compare the new approach with existing alternatives?

Criterion 8

Does the story establish the availability of the treatment/test/product/procedure?

Criterion 9

Does the story establish the true novelty of the approach?

Criterion 10

Does the story appear to rely solely or largely on a news release?

Based on these ten specific criteria the reviewers have graded stories which are about surgical procedures, drugs or devices, dietary recommendations, vitamins or nutritional supplements, diagnostic and screening tests and psychotherapy/ mental health interventions. We have considered these reviews as gold standard records for our approach.

Iii-a Data Collection

To collect data from we have created GUI app using C#.Net framework. Since the website has no API, we have created our scraper using HTML Agility Pack444

, a free and open source tool to extract data from website, and stored data in MS SQL database. For each review, we gathered title of the original news, corresponding link of original news, category and criteria wise score as ‘Satisfactory’, ‘Unsatisfactory’ or ‘Not Applicable’. We have collected all reviewed stories from 2006 to 2018 and reviewed press news releases from 2015 to 2018 from the website and removed duplicity as same story may coexist under different categories. The source URL is then accessed using Newspaper3k

555, a python3 library for extracting and curating articles.

Overall, our dataset consists of three class labels: Satisfactory, Not Satisfactory and Not Applicable, for each of the ten criteria. Figure 1 shows the criteria wise distribution of class labels over 1720 data corpus.

Fig. 1: Criteria wise class distribution over entire corpus.

As we can see that the number of observations belonging to ‘Not Applicable’ class is significantly lower than that of other two classes in every criteria, we have omitted this class value for our initial study.

Iv Feature Engineering

In this section, we will explain the data pre-processing, feature extraction and feature selection process to establish baseline performance for our approach. All our data pre-processing and feature extraction have been conducted using python, and some other useful library, e.g., scikit-learn

666 and NLTK777

Iv-a Data Pre-Processing

Certain refinement of raw data is essential for removing irrelevant information and reducing the size of actual data [l46]. To enhance the accuracy and performance of our classification model, we have run step by step data pre-processing tasks on each article.

Fig. 2: Data pre-processing pipeline.

Figure 2 shows the text pre-processing pipeline for cleaning our data. Following three refinement steps are adopted:

Iv-A1 Contraction Expansion

Contractions are shortened version of words or syllables which pose problems in text analytics. To help text standardization with original form of words, each contraction has been expanded to its main form. For instance, expanded form of ‘i’d’ and ‘you’ve’ became ‘I would’ and ‘you have’ respectively.

Iv-A2 Noise Removal

Noise removal is one of the most important text pre-processing steps. Usually, URLs, special characters and symbols add extra noise in unstructured text. We applied punctuation removal, special character removal, html formatting removal and numbers removal to get rid of these noise. Because of having little significance in corpus, we removed stop words (words like: a, the, is, me, etc.) as well.

Iv-A3 Word Normalization

In text analytics, tokenization of document is required for identifying meaningful keywords. Apart from tokenizing documents, stemming and lemmatization have also been used for reducing inflectional forms of word (connet, connected, connection, etc.) and derivationaly related forms of word to a common base form.

Remaining chunks of cleaned text data are then fed for feature extraction.

Iv-B Feature Extraction

Multiple categories of features have been extracted for classifying criteria. For model construction, we have keyed out some features which might help in prediction of classes. The complete set of extracted features with their corresponding description is depicted in Table I.

Scope Feature Name Description Feature Number Output Type
Linguistic measure LIWC Measures textual features 93 Real
Word frequency TF-IDF Measures the importance of a word in a document 4000 Real
Word-category disambiguation POS Tag Counts the number of part of speech in a document 35 Integer
POS Word Defines the parts of speech of each word separately and then counts its number within the document 47450 Integer
Citation and Ranking Internal Links Defines the number of self-citations 1 Integer
External Links Defines the number of external citations 1 Integer
Alexa Rating Ranks every document having link according to alexa rating 1428 Real
Similarity measure Cosine similarity Measures the relation between headline and body 1 Real
Miscellaneous Normalized distinct word count Measures how many distinct words were used in the text 1 Real
Per_num_count Counts the number of person mentioned in the text 1 Integer
Org_num_count Counts the number of organization mentioned in the text 1 Integer
TABLE I: List of extracted features with brief description

We have considered the following features:

Iv-B1 Linguistic Inquiry and Word Count (LIWC)

To obtain a wide variety of psychological and linguistic features, we apply LIWC2015 [liwc], a transparent text analysis program to score words in psychologically meaningful categories, on original news texts in our dataset. LIWC calculates the following dimensions:

  • Summary Dimension (Consists of 8 features; e.g., word count, word per sentence)

  • Punctuation mark (Consists of 12 features; e.g., comma, colon, quote)

  • Function words (Consists of 15 features; e.g., Pronoun, article, conjunction)

  • Perceptual process (Consists of 4 features; e.g., see, hear)

  • Biological process (Consists of 5 features; e.g., Body, health)

  • Drives (Consists of 6 features; e.g., reward, risk, power)

  • Other grammar (Consists of 6 features; e.g., interrogatives, numbers)

  • Time orientation (Consists of 3 features; e.g., past, present, future)

  • Relativity (Consists of 4 features; e.g., motion, time)

  • Affect (Consists of 6 features; positive emotion, negative emotion (e.g. anger))

  • Personal concerns (Consists of 6 features; e.g., word, leisure, money)

  • Social (Consists of 5 features; e.g., Family, friend)

  • Informal language (Consists of 6 features; e.g., filler, swear)

  • Cognitive process (Consists of 7 features; e.g., Differ, Insight)

Iv-B2 Term Frequency and Inverse Document Frequency

We have used this weighting metric to measure the importance of a term in a document within entire dataset [l46]. Term Frequency (TF) is used to quantify the frequency of a word in a particular document. On the contrary, Inverse Document Frequency (IDF) measures the importance of a term within the corpus. Let, symbolizes the whole corpus with documents. If denotes the number of times term appears in a document , then TF, denoted by , can be calculated by equation (1):


And IDF, denoted by , can be calculated by equation (2):


For a particular term, the product of TF and IDF represents the TF-IDF weight of that term. The higher the TF-IDF score, the rarer the term is. We applied unigram tokenizer on our text and eliminated features with extremely low as well as extremely high frequency for achieving a better accuracy [tf1] [tf2]. Since too frequent or too rare words are not influential in characterizing an article, we ignored all words that have appeared in more than 90% of the documents and less than 3 documents. Again, to keep the dimensionality of our feature set to a manageable size, we set maximum feature count to top 4000 terms based on frequency.

Iv-B3 Part Of Speech Tagging

Part Of Speech Tagging (POST), also known as word-category disambiguation, is used to annotate word with appropriate part-of-speech based on both its definition and context to resolve lexical ambiguity [post]. To recognize POST, we have applied Stanford postagger888 We found 35 tagset (list of part-of-speech tags, e.g., CC, CD, NP, RBR, etc. ) in the corpus with which we derived two sets of features: POS tag count and POSWord count. For POS tag count, we measured the document wise count of words belonging to a particular POS tag and thus found 35 individual features. On the other hand, for POSWord count, we measured the count of tag associated with each individual word within a document and found 47,451 non-overlapping features. POSWord features are capable of performing rudimentary word sense disambiguation in situations where a word can represent several meanings.

Iv-B4 Citation and Ranking

We analysed the presence of hyperlinks to determine the credibility of an article. We extracted three features – internal link, external link and Rank from link attribute. We counted the number of internal links to inspect the amount of self-citation occurred in a document so that we can predict some biasness in it. Conversely, number of external links were counted to predict the citation network of an article. We derived the rank attribute to envisage the quality of the article by measuring the superiority of the webpages that particular article cited to. We considered Alexa Global Ranking999 as an indicator of superiority measurement of a webpage as it gives an estimation of a website’s popularity. We counted outgoing links from all the documents within the corpus and found 1428 distinct domains. We replaced each domain with its associated alexa rank value and thus, we found 1428 distinct rank features for overall corpus.

Iv-B5 Similarity Measure

Ambiguous and misleading headline can degrade the quality of an article. So, to measure the relevance between headline-body pair of each article, we used TF-IDF Cosine similarity metric to extract similarity feature [coss]

. It quantifies the similarity between headline and body of the document irrespective of their size by measuring the cosine of the angle between two vectors projected in a multi-dimensional space.

Iv-B6 Miscellaneous

We quantified normalized distinct word count as a feature to determine how rare a word contributes in the classification problem as health related articles comprise different medical terms. We have also counted the number of organizations and person mentioned in articles to predict biasness. We used Stanford Named Entity Recognizer (NER)101010 to extract these features.

Iv-C Feature Selection

We are aiming to predict ten different criteria using numerous features (total of ) some of which might redundant or irrelevant to make predictions. Dataset containing irrelevant features can result in over-fitting. It also can mislead the modelling power of a method. Thus, it is critically important to select most relevant features from the feature set. In order to select the features that contribute most in our classification task, we have employed three different automatic feature selection techniques. First, correlation-based attribute evaluation (), which evaluates worth of a feature by measuring Pearson’s correlation between it and the class. Second, Classifier-based attribute evaluation (

), which evaluates the worth of a feature using Logistic Regression classifier. Third, Classifier-based attribute evaluation (

), which evaluates the worth of a feature using Random Forest classifier. For each of the above three attribute evaluator, rank search method was performed which ranks features by their individual evaluations to find out the most correlated feature set.

V Experimental Evaluation

The core contribution of our work is to assess the quality of online health articles automatically applying various data mining techniques. In this section, we have quantified and evaluated the performance of a number of classification techniques, for different feature selection methods and variable feature sizes, to achieve the best result.

V-a Evaluate Classification Techniques

We have experimented four prominent classification techniques on our dataset and reported their results. We have performed a binary class (Satisfactory and Not Satisfactory) classification using three supervised learning methods and one ensemble method for obtaining better accuracy in assessing quality of OHA. First, Support Vector Machine (SVM) algorithm, which uses kernel trick to implicitly mapping inputs into a high-dimensional feature space


. We have used PolyKernel as kernel to control the projection and the amount of flexibility in separating classes in our dataset. Second, Naive Bayes classification algorithm which calculates the posterior probability for each class using a simple implementation of the Bayes theorem and makes the prediction for the class with the highest probability. For each numerical attribute, a Gaussian distribution is assumed by default


. Third, Random Forest classifier which constructs a multitude of decision trees at training time and merges them together to get a more accurate and stable prediction

[11]. We considered trees for Random Forest implementation. Forth, EnsembleVoteClassifier, a meta-classifier combining similar or conceptually different machine learning classifiers for classification via majority voting. We have combined three aforementioned classifiers to build our ensemble estimator and examined its performance on our dataset. All methods were evaluated by 10-fold cross-validation, where in each validation of dataset was used for training purpose and for testing. Various combinations of the extracted features have been experimented to evaluate how accurately our approach can automatically classify each criterion.

V-B Identify Feature Selection Method and Feature Size

To identify the feature selection method and the feature size that result best classification accuracy for our dataset, We have experimented the impact of different feature selection methods and varied feature sizes on classification accuracy.

V-B1 Identify Feature Selection Method

We ran three feature selection methods on our feature space, with a goal of determining which feature selection method performs best by selecting a best feature subset that results best classification performance. Table II presents the outcomes of the comparative study of three different feature selection methods over four different classifiers (SVM, Naive Bayes, Random Forest and EnsembleVote) carried out against a feature subset with feature size . Here, we have presented weighted Precision (), weighted Recall () and weighted F-Measure () from the Weka [Zeng2014] output for presenting a better estimate of overall classification performance. Weka calculates weighted average by taking average of each class, weighted by the proportion of how many elements are in each class. So, for our binary class problem, , and are calculated from equation (3), (4) and (5) respectively.


Where, and are the Precisions for class ‘Satisfactory’ and ‘Not Satisfactory’; and are the Recalls for class ‘Satisfactory’ and ‘Not Satisfactory’; and are the F-Measures for class ‘Satisfactory’ and ‘Not Satisfactory’; and are the number of instances in class ‘Satisfactory’ and ‘Not Satisfactory’ respectively.

For all criteria, clearly SVM performed well among all four classifiers. We also observe from the table II that, for all criteria (except criterion 3 and criterion 5), the Pearson’s correlation feature selection method () performs best for the SVM classifer. For criteria 3 and 5, the Logistic Regression feature selection performs slightly better than the Pearson’s correlation method.

Criterion FS Methods SVM Random Forest Naive Bayes Ensemble
1 0.901 0.903 0.899 0.794 0.786 0.718 0.861 0.756 0.774 0.868 0.803 0.816
0.887 0.888 0.881 0.826 0.786 0.711 0.813 0.684 0.708 0.859 0.827 0.836
0.877 0.881 0.875 0.831 0.792 0.724 0.792 0.637 0.666 0.848 0.813 0.823
2 0.855 0.857 0.854 0.739 0.707 0.621 0.793 0.721 0.730 0.799 0.738 0.747
0.826 0.828 0.821 0.765 0.707 0.615 0.766 0.686 0.696 0.789 0.739 0.747
0.795 0.801 0.794 0.751 0.703 0.610 0.691 0.567 0.587 0.729 0.674 0.685
3 0.835 0.837 0.832 0.752 0.720 0.655 0.790 0.712 0.720 0.793 0.727 0.735
0.851 0.849 0.841 0.779 0.722 0.651 0.764 0.698 0.707 0.773 0.727 0.735
0.804 0.808 0.803 0.780 0.721 0.648 0.704 0.612 0.624 0.729 0.674 0.685
4 0.847 0.848 0.846 0.707 0.695 0.635 0.783 0.713 0.718 0.790 0.728 0.733
0.824 0.824 0.818 0.746 0.693 0.615 0.758 0.689 0.694 0.769 0.720 0.726
0.779 0.783 0.778 0.741 0.692 0.615 0.680 0.583 0.592 0.704 0.643 0.650
5 0.894 0.887 0.843 0.769 0.877 0.819 0.864 0.722 0.767 0.873 0.890 0.863.
0.888 0.901 0.888 0.769 0.877 0.819 0.829 0.757 0.786 0.869 0.887 0.873
0.856 0.880 0.862 0.892 0.877 0.820 0.805 0.707 0.748 0.845 0.873 0.852
6 0.835 0.835 0.835 0.688 0.688 0.688 0.754 0.744 0.742 0.747 0.737 0.735
0.733 0.733 0.732 0.678 0.676 0.674 0.689 0.644 0.623 0.693 0.648 0.628
0.719 0.719 0.719 0.687 0.686 0.685 0.665 0.636 0.621 0.667 0.638 0.624
7 0.855 0.854 0.854 0.669 0.666 0.663 0.737 0.733 0.732 0.747 0.737 0.735
0.716 0.715 0.715 0.678 0.666 0.659 0.669 0.630 0.611 0.669 0.630 0.611
0.690 0.689 0.688 0.644 0.638 0.632 0.627 0.607 0.595 0.627 0.607 0.595
8 0.875 0.876 0.870 0.768 0.737 0.638 0.813 0.777 0.786 0.813 0.781 0.789
0.814 0.821 0.807 0.780 0.737 0.638 0.710 0.642 0.661 0.737 0.712 0.721
0.762 0.775 0.765 0.750 0.736 0.639 0.657 0.523 0.563 0.718 0.699 0.706
9 0.867 0.869 0.867 0.695 0.698 0.607 0.782 0.765 0.770 0.782 0.765 0.770
0.827 0.826 0.816 0.754 0.710 0.621 0.689 0.608 0.621 0.727 0.689 0.699
0.769 0.777 0.769 0.752 0.704 0.608 0.620 0.477 0.507 0.661 0.610 0.623
10 0.880 0.886 0.878 0.815 0.805 0.722 0.848 0.765 0.787 0.854 0.794 0.811
0.867 0.875 0.865 0.777 0.804 0.721 0.779 0.689 0.717 0.822 0.814 0.818
0.828 0.842 0.830 0.783 0.806 0.730 0.757 0.632 0.679 0.796 0.793 0.795

Legend: FS – Feature Selection; – Weighted Precision; – Weighted Recall; – Weighted F-Measure.

TABLE II: Comparison study of three feature selection methods over four classifiers (Feature Size: )

For all of the criteria (except criteria 6 and 7), we observed that, Random Forest classifier always misclassified the minority class (e.g., for criterion 1, minority class was ‘Satisfactory’) into majority class (e.g., for criterion 1, majority class was ‘Not Satisfactory’) which results a drop in recall value for the minority class. This happened due to the imbalanced class distribution of our dataset (see Fig. 1). In our dataset, criteria 6 and 7 are very close to balance and thus Random Forest classifier performed moderately.

V-B2 Identify Feature Size

We have also varied the feature sizes to see how this impact on the classification performance. In this part of the experiment, we have used the Pearson’s Correlation feature selection method () with the SVM classifer, as we found them best for our classification problem (see Table II). Figure 3 shows the performance of the SVM classifier combined with feature set under various feature sizes – 1000, 2000, 3000, 4000, 5000, 10000, and 53012.

Fig. 3: Changes of performance with respect to feature Size (Criteria 1-10)

We observed from the Figure 3 that, for all criteria, the feature set comprises all features (total ) performs lower due to having irrelevant and redundant features. We can see that there is an improvement in performance with reduced feature subset. For criterion 1, we achieved accuracy for the feature size in terms of F-measure. For criteria and , we achieved accuracy for the feature size . For criteria 3, we achieved accuracy for the feature size . Criterion 6 achieved accuracy for sized feature set. For criterion 8 and 9, and accuracy were achieved for the feature size and respectively. For the rest two criteria ( and ), we noticed that the performance curves are a bit different from other curves because of their imbalanced dataset nature. Criterion begins with the highest accuracy () at feature size and performance varied with feature size. Criterion achieved highest accuracy () for the feature size . Overall, all reduced features subset (varied in size) achieved at least accuracy with our explored feature combination.

V-C Class Balancing

We noticed some imbalanced class in our dataset. As we found that criteria and are most imbalanced class distribution, we have combat this class imbalance problem by adopting three class balancing techniques – Under sampling, Over-sampling and Synthetic Minority Over-Sampling Technique (SMOTE). As over-sampling duplicates the minority class instances, it can lead to model over-fitting. Similarly, under sampling can degrade performance if it leaves out important instances while cutting down. Thus, we also experimented our dataset with SMOTE which generates synthetic sample of minority class rather than using duplicates. However, SMOTE still does not prevent over-fitting as it generates synthetic data from existing data points.

(a) Criterion 5
(b) Criterion 10
Fig. 4: The ROC curve of balanced and imbalanced class for the feature size (Criteria 5 and 10)

Figure 4 shows the performance comparison of these three methods in terms of Receiver Operating Characteristics (ROC) curve. It is observed that, all of the sampling techniques performed better than the imbalanced dataset. For SMOTE, we have got and accuracy for criteria 5 & 10, respectively.

Vi Semantic Analysis of Features

In this section, we have semantically analyzed the co-relations between a criterion and its corresponding most significant feature set to show how the feature set justify our assessment of a criterion.

To analyze each criterion, we have itemized the top 16 most discriminating features by combining the results found from Pearson’s correlation, Logistic Regression and Random Forest feature selection algorithms. The top feature list is presented in the Table III. Overall, POSWord count and TF-IDF features are found most significant and other features get varied from criteria to criteria. Insights gained from determining relevant features are as follows:

Cr Most Correlated Features (Top 16)
1 , , ,, , , , ,, , , , , ,
2 , ,, ,, , ,, , , , , , , ,
3 , , , , , , , , , , , , , , , , ,
4 , , , , , , , , , , , , , , , , ,
5 , , , , , , , , , , , , , , ,
6 , , , , , , , , , , , , , , ,
7 , , , , , , , , , , , , , , ,
8 , , , , , , , , , , , , , , ,
9 , , , , , , , , , , , , , , ,
10 , , , , , , , , , , , , , , ,

Legend: Cr – Criterion – LIWC Feature; – TF-IDF Feature; – POSWord count; – Miscellaneous Features; Features common in all three feature set are indicated by Bold texts.

TABLE III: Most discriminating Features [Criteria 1-10]

As criterion 1 is about coverage of cost intervention, it is instinctive to have features associated with money, cost, price, dollars, amount of dollars (thousand, hundred), insurance etc.; each of which are found as top discriminating features in this study.

Inclusion of absolute number in quantifying benefit gives readers a better sense of understanding about an intervention. For example, the sentence ‘New drug reduces heart failure risk in half’ can give reader a peachy idea about the intervention. But if the sentence was like ‘ risk dropping to a ’ (showing risk halved), it would sound less significant to the reader with clear idea. From the top selected feature subset for criterion 2, we find TF-IDF feature ‘percent’ and LIWC feature ‘number’ to be pertinent to the usage of absolute number in article. Besides, TF-IDF feature ‘compar’, ‘trust’; LIWC feature ‘differ’, ‘quant’ are also relevant in explaining benefits.

When reading a story about a new intervention, it is expected to have explanation about the potential harms and side effects of the intervention. In our feature set, we have found POSWord feature ‘risks’, ‘cause’, ‘nausea’ (a common side effect of drug), ‘side’, ‘died’; TF-IDF feature ‘common’, ‘effect’, ‘high’; and LIWC feature ‘negate’, ‘tentat’ – are quite meaningful to describe criterion 3.

In order to grasp the quality of the evidence, a story needs to present an elaborate explanation of the study (source, size, type, limitation, etc.) it went through. For example, a report published in The Wall Street Journal on Ebola Vaccine stated that this was the ‘first placebo-controlled study of two vaccines against the Ebola virus’ and mentioned its shortcomings as well and the reviewers in the rated it as ‘Satisfactory’ for criterion 4. Our feature subset consists of POSWord Features ‘randomly’, ‘assigned’, ‘placebo’, ‘study’, ‘evidence’ and ‘group’, and TF-IDF feature ‘random-assign’ are stringently aligned with this criterion in describing evidentiary details.

It is a matter of judgement to identify disease mongering. From the feature subset found for criterion 5, we can relate TF-IDF features ‘dry-eye’, ‘suffer’, ‘need’, ‘inform’, and POSWord features ‘revealed’, ‘excessive’ are directly inflating the seriousness of a condition. For example, using rating scales to diagnose chronic dry-eye is simply an exaggeration of a common disorder. As only articles from our dataset were rated ‘Not Satisfactory’ on this criterion, we found less aligned extracted features to define disease mongering.

According to criterion 6, independent experts should be included in news stories about health care interventions and conflicts of interest in the people who are quoted should be explored and disclosed. In order to explore this criterion, we defined a new feature, ‘per_ner_count’ to count the number of person referred in a document and we found this feature to be the most relevant feature to describe this criterion. Same is the case with the feature ‘Org_ner_count’ which gives us the count of organizations cited in a document. Apart from these, POSWord features – ‘university’, ‘that’, ‘said’, ‘national’, ‘professor’, ‘study’, and ‘involve’ are also aligned with this criterion to describe it.

As criterion 7 is about comparing new intervention with existing alternatives, it is usual for a document to contain comparison words. From our feature subset, we found that the LIWC feature ‘differ’, and POSWord feature ‘than’, ‘not’, and ‘but’ are more relevant to describe this criterion.

We observe that the feature subsets we found for the criteria 8, 9 and 10 could not properly describe the properties of these criteria. We found that only , and stories from the criteria 8, 9 and 10 respectively were rated with ‘ Not satisfactory’ in our dataset which may have brought up the reason for our feature selection algorithm to fail in differentiating the discriminating features .

Vii Discussion

In this study, we have examined the application of machine learning approach to automate the quality assessment process for web based health related information. We found that it is feasible to apply machine learning classifiers to estimate the quality of health related articles if the classifier can be trained properly. This work is not directly comparable to the already existing studies because most of the studies examined the quality of health information from a single domain perspective (e.g., vaccination [l43], [l44], [l45]; diabetic neuropathy [l11]; reproductive health information [l41]; nutrition coverage [l42] etc.) through a manual process and statistical analysis. We have examined articles over entire health domain, ensuring its applicability to all possible health related category. In this context, our work will make manual reviewing process scalable and save manual labour and time. Our developed dataset will help researchers to contribute in the growing field of health care research. Overall, this automated quality assessment approach may help search engine to promote high quality health information and discourage low quality articles.

However, there are some limitations in our study. Experts from used three labels - ‘Satisfactory’, ‘Not Satisfactory’ and ‘Not Applicable’ for characterizing criteria . Cases where a number of criteria may be impossible or unreasonable for some of the stories were rated as ‘Not Applicable’ by the review experts. In our study, we deducted stories with ‘Not Applicable’ criteria from our training set as those stories constituted a small part of the whole corpus and trained our classifiers for two class labels - ‘Satisfactory’ and ‘Not Satisfactory’. That’s why we could not use all articles for each of the criteria and number of total dataset varied from criteria to criteria (e.g., our dataset for criterion 1 comprised of articles after removing class instances of ‘Not Applicable’ label). In our future study we plan to address this shortcoming.

Another limitation is, our dataset is not large enough to be compatible for deep learning framework. We trained deep learning classifier for our dataset though and found approximately

accuracy over all criteria. In our future work, we plan to enrich our dataset to examine its feasibility from deep learning perspective.

Viii Conclusion and Future Work

In this paper, we have applied data mining approach to automatically assess the quality of online health articles. We have prepared our dataset comprises health related articles extensively reviewed by a group of experts. Through a pipeline of data pre-processing steps, we have refined our data and extracted features to train classifiers. We have identified the best feature selection technique to select most relevant feature subset from our feature space, and have applied four different classifiers - SVM, Naive Bayes, Random Forest and EnsembleVote to train model. For our dataset, we found SVM is the best performer achieving accuracy upto to for ten different criteria. We have also analyzed top 16 most correlated features for each of the ten criteria to justify the feasibility of our assessment. We found that our selected features are capable of characterizing criteria successfully. From our experimental results and analysis, it can be concluded that it is feasible to apply data mining techniques to automate quality assessment process for online health articles. Following the richness of dataset and specific focus independent nature of analysis, proposed model may serve as a universal standard for appraising quality of OHA and wipe out the negative impact of misinformation dissemination to some extent.

As future work, we will further investigate this study with deep learning approach. We have also plan to explore multinomial classification problem to evaluate health related articles which cannot address some of the specific criteria.