A Machine Learning Application for Raising WASH Awareness in the Times of Covid-19 Pandemic

03/16/2020 ∙ by rohan-pandey, et al. ∙ IIIT Delhi 0

A proactive approach to raise awareness while preventing misinformation is a modern-day challenge in all domains including healthcare. Such awareness and sensitization approaches to prevention and containment are important components of a strong healthcare system, especially in the times of outbreaks such as the ongoing Covid-19 pandemic. However, there is a fine balance between continuous awareness-raising by providing new information and the risk of misinformation. In this work, we address this gap by creating a life-long learning application that delivers authentic information to users in Hindi, the most widely used local language in India. It does this by matching sources of verified and authentic information such as the WHO reports against daily news by using machine learning and natural language processing. It delivers the narrated content in Hindi by using state-of-the-art text to speech engines. Finally, the approach allows user input for continuous improvement of news feed relevance on a daily basis. We demonstrate a focused application of this approach for Water, Sanitation, Hygiene as it is critical in the containment of the currently raging Covid-19 pandemic through the WashKaro android application. Thirteen combinations of pre-processing strategies, word-embeddings, and similarity metrics were evaluated by eight human users via calculation of agreement statistics. The best performing combination achieved a Cohen's Kappa of 0.54 and was deployed in the WashKaro application back-end. Interventional studies for evaluating the effectiveness of the WashKaro application for preventing WASH-related diseases are planned to be carried out in the Mohalla clinics that provided 3.5 Million consults in 2019 in Delhi, India. Additionally, the application also features human-curated and vetted information to reach out to the community as audio-visual content in local languages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Raising healthcare awareness for primary prevention of diseases is a challenge all across the globe. Hygiene promotion is the most cost-effective health intervention if accurate content is delivered effectively. A majority of preventable diseases result from unhygienic practices. Water, Sanitation and Hygiene (WASH) measures such as hand-washing are also important in limiting the spread of pandemics such as the currently raging Covid-19. Further, the awareness raising content is often not available to those who need it the most and in a format that they easily understand leading to profoundly wide socio-economic impacts of this lack. In 2017, around 55% of the global population did not make use of a safely managed sanitation service effected in part due to lack of awareness in addition to the lack of facilities at home[8]. Around 827,000 people in low and middle-income countries die as a result of inadequate water, sanitation, and hygiene each year. A significant proportion of these deaths can be averted through dissemination of information about WASH practices and their critical role in preventing diseases by delivering authentic information content in local languages. This cuts across the Sustainable Development Goal 3 (Good Health and Well Being for All) and 6 (Adequate Sanitation and Hygiene for All).

India is the second-most populous country in the world, with more than 1 billion citizens where a staggering 344 million lack hygienic defecation facilities[2]. The World Health Organisation states that more than 500 children under the age of five die each day from diarrhoea in India alone[2]

and estimates that 21 per cent of communicable diseases in India are linked to unsafe water and the lack of hygiene practices


Ironically, India is also one of the largest and fastest-growing markets for digital consumers, with 560 million internet subscribers in 2018 [3] and about 60% of Indian users anticipate that the m-Health technologies will improve healthcare within the next three years [4]. This offers a unique opportunity to bridge the gap in information availability through m-Health technologies to reach out to those who need it the most, and in a medium that they understand the most, e.g. audios delivered in local languages, thus narrowing the divide between these resources and the masses.

The recent pandemic outbreak of Coronavirus (Covid-19) has demonstrated the need for proactive containment and prevention measures including repeated hand-washing. Every single day lost of proactive interventions has an exponential impact and countries that acted early were able to contain the disease effectively, thus saving thousands of lives and dollars[9]. Therefore, there exists a dire need for proactive information in addition to proactive testing while preventing the spread of misinformation.

In this work, we demonstrate an awareness raising solution WashKaro that uses NLP approaches, machine learning and m-Health to combine authentic sources of information with daily news and delivers these in Hindi, the most widely understood local language across India. The application also hosts human-curated and vetted information to reach out to the community as audio-visual content in local languages.

2 Dataset

We have validated our approach using the following datasets:

2.1 WHO Guidelines

This dataset comprises of WHO guidelines obtained from publically available WHO reports with special emphasis of Water, Sanitation and Hygiene (WASH) from various WHO reports published. The dataset comprises of more than 400 WHO articles manually scraped from individual reports owing to the varied format of each report. The dataset comprises of the title of the guideline, the guideline, a category in which it belongs, and the URL of the WHO published report.

Some of the broad categories of these reports are

  • Corona virus Prevention

  • Guidelines for safe recreational water environments

  • Water, sanitation and hygiene in health care facilities

  • Progress on Sanitation and Drinking Water

  • A practical guide to Auditing water safety plans

  • Progress on Drinking Water, Sanitation and Hygiene

2.2 News Article Dataset

This dataset comprises of news articles scraped from publically available news articles. The dataset comprises of the article headline, article text, URL of the article and the date of publishing. We have maintained the following news article datasets:

2.2.1 English

The English News Article dataset consists of news articles extracted from ’The Hindu’. The news articles are filtered using the following keywords: ’Handwash’, ’Hygiene’, ’sanitation’, and ’health’.

2.2.2 Hindi

The Hindi News Article dataset consists of news articles extracted from ’jagran’. The news articles are filtered using the following keywords: ’svachta’ , ’safai’, ’haath dhona’, ’ saaf’, ’haath ragad’ which are Hindi translations of ’Cleanliness’, ’handwash’, ’hygiene’ and ’sanitation’.

3 Methodology

In this section, we present our proposed public healthcare intervention workflow, designed and centred around imparting healthcare information effectively. Our methodology is represented in figure2 and further explained in the following sections.

3.1 Preprocessing

The dataset as described in section 2 needs to be preprocessed in order to transfer text from human language to machine-readable format for further processing. The following preprocessing is done:

Figure 1: Stages in preprocessing
Figure 2: Methodology. The pipeline takes in news articles and WHO reports and constructs two-level sentence similarity between titles and the full-text to construct a relevance score. The relevance score thresholds are continuously tuned as more user data are collected. Finally the relevant texts are subject to text to speech translation for consumption in local language (Hindi).

3.1.1 Removal of Unwanted Characters

All characters except A-Z and [’.’, ’,’ ,’:’] are removed.

3.1.2 Conversion to lowercase

The entire text is converted into lowercase. Lowercasing significantly helps with consistency of expected output.

3.1.3 Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

3.1.4 Stop word removal

Stop words are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

3.1.5 Stemming

Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are the Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). We have deployed the Porter Stemming algorithm.

Figure 3: Text Similarity Model

3.2 Text Similarity Model

3.2.1 Embeddings

Embedding is a technique to transform text and convert them into a form, such that a machine can process it. It is one of the most popular representations of document vocabulary. The transformation is done in such a way that machine level analysis can be carried out on them. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc. They are basically a form of word representation that bridges the human understanding of language to that of a machine. An embedding is a learned representation for text where words that have the same meaning have a similar representation. Our methodology employs the following embeddings:


: is a statistical method for efficiently learning a standalone word embedding from a text corpus.Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Two different learning models were introduced that can be used as part of the word2vec approach to learning the word embedding: Continous Bag of Words (CBOW) and skip-gram model. The CBOW model learns the embedding by predicting the current word based on its context. The continuous skip-gram model learns by predicting the surrounding words given a current word. The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).

GloVe (Global Vectors for Word Representation)[7]: is an extension to the word2vec method for efficiently learning word vectors. Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies. GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec. Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.

Google Sentence Encoder[1]: encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. The universal-sentence-encoder-large model is trained with a Transformer encoder.

3.2.2 TF-IDF (Term frequency — inverse document frequency)

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

This is composed of 2 parts: TF, which measures how frequently a term occurs in a document, and IDF, which measures how important a term is by giving higher weight to words occurring only in a few documents.


Where is the number of times i appears in a document and is the total number of terms in the document.

Model Embedding [HTML]FFFFFFPreprocessing [HTML]FFFFFFTF-IDF Similarity Metric
1 Word2Vec X X Cosine
2 Word2Vec X Cosine
3 Word2Vec X Cosine
4 Word2Vec Cosine
5 Word2Vec X X Word Mover Distance
6 Word2Vec X Word Mover Distance
7 Glove X X Cosine
8 Glove X Cosine
9 Glove X Cosine
10 Glove Cosine
11 Glove X X Word Mover Distance
12 Glove X Word Mover Distance
13 Google Sentence Encoder X X Cosine
Table 1: Combinations of approaches tested. These included pre-processing word-embedding models and similarity metrics for evaluation by eight human users.

Where N is the total number of documents and is the number of documents in which word i occurs.

3.2.3 Similarity Metric

In order to generate a similarity score, between the news article and the WHo guideline, the following similarity metrics have been used:

Cosine Similarity

: is a metric used to measure the similarity between the two documents.It is independent of the size of the documents. Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. This is calculated as:


Mathematically speaking, Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of is 1, and it is less than 1 for any angle in the interval radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Word Mover Distance: suggests that distances and between embedded word vectors are to some degree semantically meaningful. It utilizes the property of word vector embeddings and treats text documents as a weighted point cloud of embedded words. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. WMD shows that this distance metric can be cast as an instance of the Earth Mover’s Distance (a well-studied transportation problem for which several highly efficient solvers have been developed).

WMD enables us to assess the “distance” between two documents in a meaningful way, even when they have no words in common. (dis)similarity between the two sentences. The method also uses the bag-of-words representation of the documents (simply put, the word’s frequencies in the documents). The intuition behind the method is that we find the minimum “traveling distance” between documents, in other words the most efficient way to “move” the distribution of document 1 to the distribution of document 2.

3.3 Cut-off Score

A baseline cut-off similarity score is maintained. All news articles are mapped against the WHO guideliness which results in the generation of a similarity score. All the pairs above this cut-off score are accepted and published to the user ensuring than only relevant pairs are delivered. User feedback is obtained at the end of each pairing, the current cut-off score increases by a small margin when a user marks a pair irrelevant and decreases when a user marks a pair relevant.

3.4 Translation and Text2Speech

The news article and matched WHO guideline with relevance score greater than the cut-off, are converted into the local language(Hindi) using Google Could Platform. Google’s pre-trained neural machine translation delivers fast and dynamic translation results and Google Cloud’s Text-to-Speech converts text into human-like speech.

4 Evaluation Metrics

As our entire methodology is aimed at efficiently providing healthcare information to the masses, our success needs to be a measure of the acceptability by the masses. Inter-rater reliability is the extent to which two or more raters (or observers, coders, examiners) agree. It addresses the issue of consistency of the implementation of a rating system. Inter-rater reliability can be evaluated by using a number of different statistics. High inter-rater reliability values refer to a high degree of agreement between two examiners. Low inter-rater reliability values refer to a low degree of agreement between two examiners. We have evaluated our performance using the following commonly accepted statistics:

4.1 Percentage agreement

Percentage Agreement amongst raters is a statistic calculated as the number of agreement scores divided by the total number of scores. In the case of multiple users, this technique gets complex. If total raters are n then it has to check for n(n-1)/2 combinations. Also, it does not take the chance of agreement into account and overestimate the level of agreement. Hence it is not reliable alone.

4.2 Cohen Kappa

Cohen’s kappa coefficient [5]

is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since k takes into account the agreement occurring by chance. Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. To find the coefficient, we use the following formula:


where is the relative observed agreement among raters (identical to accuracy), and

is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category.

5 Experimentation

The first step of our experimentation involved the creation of the datasets mentioned in section 2. Manual scraping of the WHO Guidelines dataset was done and a total of nearly 400 WHO articles were stored in a local CSV file. English News Articles and Hindi News Articles of the current day were also scraped and stored in their respective datasets.

In the first set of experimentations, WHO Guidelines and English News articles dataset was used. These datasets were preprocessed by the steps removing unwanted characters as mentioned in the section 3.1.

After the preprocessing was done, pairs were generated where the current days news articles are mapped against the entire WHO guidelines database. Pair generation is followed by evaluating the sentence similarity amongst the News Article Text vs WHO article Text.

After the generation of pairs, the set of models mentioned in Table 1 are employed on each of these sets of pairs. Table 1 specifies the following technical details:

  • Embedding used for each model

  • If preprocessing was done

  • If TF-IDF weighting was done

  • Similarity Metric used

After running these models on each of the four sets of pairs generated, the relevance scores generated are obtained. The text similarity models work as shown in figure 3. These scores differ from one model to another.

The obtained relevance scores are sorted and the top 5 pairs obtained from each similarity model on a particular set of pair are filtered out. These top pairs were presented to reviewers to classify as relevant and irrelevant (binary classification). The reviewer will classify the pair relevant if the pair of news article and WHO guideline has some relation between them and can be clubbed while if they don’t have anything in common, a reviewer will mark that pair as irrelevant. Since the idea of relevance and irrelevance is subjective, therefore a total of 8 reviewers were used. The reviewers were given a set of total 65 stories(top 5 stories of each model) and were asked to press 1 if they think the news article and the WHO report are relevant and 0 otherwise. To find the best model, percentage agreement and Kappa score are calculated for each of the Models and the particular model which has the highest Kappa score among all these models is selected. The best model selected on the basis of these scores is used for future tasks also.

After the model is selected, we translate both the WHO dataset and the News articles into Hindi. The translated text is converted into speech for ease of access. The methodology incorporates a feedback system wherein for each News-article and WHO report presented to the user, they can classify the matching as relevant or irrelevant. After learning from the feedback, only the stories that have a similarity score above the learned baseline(cut-off score) are provided to the end-user. This decision line improves with each feedback and ensures delivery of effective content. With the feedback given by the user as relevant/ irrelevant, the value of threshold changes thus pushing only the relevant stories.

The same workflow is followed by scraping Hindi news articles, which are converted into English for running sentence similarity models. As done above, the resultant pairs are provided to the users in the form of Hindi text and speech.

Model No. Kappa Score % Agreement
2 0.51746 77.3809
5 0.47997 77.3809
4 0.400599 70.23809
3 0.391801 69.64285
8 0.352647 67.85714
10 0.296398 64.88095
11 0.262283 63.69047
7 0.240861 63.69047
9 0.238095 61.90476
13 0.206349 70.23809
1 0.184815 59.52380
12 0.085714 59.52380
6 0.020408 57.14285
Table 2: Results of inter-rater agreement between eight users on English News Articles. Cohen’s Kappa and Percentage agreement were the two metrics used to evaluate the models.

6 Observations and Results

To evaluate the performance of our healthcare intervention methodology. We relied on on various inter-rater reliability evaluation metrics as mentioned in section

4. After calculating the similarity scores for a set of News articles and WHO guidelines using models defined in table 1, 8 different users were asked to classify the matched pairs as relevant or irrelevant. At this time, the cut-off was set to 0.5, and all pairs with relevance score greater than the cut-off score were provided to these user’s for rating.

Figure 4: Percentage Agreement Score for models enumerated in Table 1. It is seen that Cohen’s Kappa provided a better discrimination among models and was further used for model selection.
Model No. Kappa Score % Agreement
2 0.54912 79.2420
5 0.47789 77.9810
4 0.389124 68.91235
3 0.41358 70.85732
8 0.38273 66.52417
10 0.332817 65.88290
11 0.261023 62.22067
7 0.278911 64.43729
9 0.258790 62.74201
13 0.23483 67.78201
1 0.167124 60.47632
12 0.10573 58.37529
6 0.0453067 55.78439
Table 3: Results of inter-rater agreement between eight users on Hindi News Articles. Cohen’s Kappa and Percentage agreement were the two metrics used to evaluate the models.

We calculated percentage agreement and Kappa Score for these 8 raters, on the pairs of News articles and WHO Guidelines and following results are shown in Table2 and Table3 for the Hindi and the English News article dataset as mentioned in section 2.

Figure 5: Cohen’s Kappa Score for models enumerated in Table 1. The model yielding highest agreement among humans was selected for deployment.

In both the cases, we can see that the model 2 ( Preprocessing + Word2Vec Embedding + Cosine Similarity) gave the best results with 0.54912 Kappa Score and 79.2420 Percentage Agreement. As seen in Fig.5 and Fig.4, Hindi News Article dataset provided better results in terms of both metrics.Hence for the purpose of this app, model 2 was chosen.

7 WashKaro Application

Our android application WashKaro is available for free download on Google Play Store at https://play.google.com/store/apps/details?id=inspire2connect.inspire2connect. The application feed displays various news articles related to sanitation and hygiene. Every news article has a corresponding WHO guideline matched with it. The entire news article is provided in the form of text and an audio file in the local language. User can switch between the news article and the matched WHO guideline. After going through the article, user can review the corresponding matching as relevant or irrelevant. The feedback from the user is used to improve the cut-off score.

As this application is targeted towards the lesser educated section of the society, and onboarding section is added to help the time users. This is followed by a an optional questionnaire comprising of a few basic questions on patient demographics to help us understand the prevalence and impact of interventions planned.

8 Conclusion

To the best of our knowledge, this is the first application that demonstrates the use of state-of-the-art machine learning and m-Health technologies to specifically address the issue of ongoing WASH awareness in a local language in India. This is a daily-learning platform that allows user feedback on the relevance of content. The results of the technical approach taken in this work were evaluated by a panel of eight human users to choose the most appropriate model. However, this study has several limitations. The models have been trained on a relatively small corpus and we have only implemented the approach in Hindi, which is the most widely understood language in India. We do plan to incorporate more languages and local context to the application. All the humans evaluating the models were from a similar educational background. We hope to overcome this limitation through the feedback obtained from users of the WashKaro app. We also plan to devise a ranking score for the feedback providers based upon their reputation score for Public Health published via an accompanying website. Finally, the most important limitation is the lack of assessment of the interventional impact of this application. We plan to address this through a phased roll out with the primary health clinics in Delhi and appropriate partnerships delivering digital health interventions on-ground. Regardless, our current work highlights the potential of machine learning, m-Health and natural language processing in addressing primary health challenges and provides a framework for replicating such studies in a variety of public health challenges including the Covid-19 pandemic.

9 Acknowledgements

This work was partly supported by the Wellcome Trust/DBT India Alliance Fellowship IA/CPHE/14/1/501504 awarded to Tavpritesh Sethi. Tavpritesh Sethi also acknowledges support from the Center for Artificial Intelligence at IIIT-Delhi, Mr. Rajesh Ranjan Singh from Wadhwani Initiative for Sustainable Healthcare (WISH, the knowledge partner of Delhi government’s Aam Aadmi Mohalla Clinics), Prof. Rakesh Lodha from AIIMS New Delhi and Mr. Roshan Shankar, advisor to the Government of NCT of Delhi. Rohan Pandey and Vaibhav Gautam acknowledge the Department of Computer Science, Shiv Nadar University.


  • [1] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §3.2.1.
  • [2] India’s water and sanitation crisis. Note: https://water.org/our-impact/india/ Cited by: §1.
  • [3] N. Kaka, A. Madgavkar, A. Kshirsagar, R. Gupta, J. Manyika, K. Bahl, and S. Gupta (2019) Digital india: technology to transform a connected nation. McKinsey Global Institute, March. Cited by: §1.
  • [4] D. Levy, C. Wasden, D. DiFilippo, and P. Sur (2012) Emerging mhealth: paths for growth. PwC M-Health, pp. 1–44. Cited by: §1.
  • [5] M. L. McHugh (2012) Interrater reliability: the kappa statistic. Biochemia medica: Biochemia medica 22 (3), pp. 276–282. Cited by: §4.2.
  • [6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.2.1.
  • [7] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.1.
  • [8] Progress on household drinking water, sanitation and hygiene 2000-2017. special focus on inequalities. new york: united nations children’s fund (unicef) and world health organization, 2019.. Note: https://www.who.int/water_sanitation_health/publications/jmp-report-2019/en/ Cited by: §1.
  • [9] C. J. Wang, C. Y. Ng, and R. H. Brook (2020) Response to covid-19 in taiwan: big data analytics, new technology, and proactive testing. JAMA. Cited by: §1.