Coronavirus disease (COVID-19) (WHO, 2020) is an infectious disease caused by a new virus that had not been previously identified in humans; this respiratory illness (with symptoms such as a cough, fever and pneumonia) was first identified during an investigation into an outbreak in Wuhan, China in December 2019 and is now rapidly spreading in the U.S. and globally. The novel coronavirus and its deadly outbreak have posed grand challenges to human society. As of March 26, 2020, there have been 85,377 confirmed cases and 1,293 reported deaths in the U.S. (Figure 1); and the WHO characterized COVID-19 - which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries - a global pandemic.
It is believed that the novel virus which causes COVID-19 emerged from an animal source, but it is now rapidly spreading from person-to-person through various forms of contact. According to the Centers for Disease Control and Prevention (CDC) (CDC, 2020c), the coronavirus seems to be spreading easily and sustainably in the community - i.e., community transmission which means people have been infected with the virus in an area, including some who are not sure how or where they became infected. An example of community transmission that caused the outbreak of COVID-19 in King County at Washington State (WA) is shown in Figure 2. The challenge with community transmission is that carriers are often asymptomatic and unaware that they are infected and through their movements within the community they spread the disease. According to the CDC, before a vaccine or drug becomes widely available (i.e., this is the case for COVID-19 by far), community mitigation, which is a set of actions that persons and communities can take to help slow the spread of respiratory virus infections, is the most readily available interventions to help slow transmission of the virus in communities (CDC, 2020d). A growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of COVID-19 and thus better respond with actionable strategies for community mitigation.
Unlike the 1918 influenza pandemic (CDC, 2020a) where the global scope and devastating impacts were only determined well after the fact, COVID-19 history is being written daily, if not hourly, and if the right types of data can be acquired and analyzed there is the potential to improve self awareness of the risk to the population and develop proactive (rather than reactive) interventions that can halt the exponential growth in the disease that is currently being observed. Realizing the true potential of real-time surveillance, with this opportunity comes the challenge: the available data are uncertain and incomplete while we need to provide mitigation strategies objectively with caution and rigor (i.e., enable people to select appropriate actions to protect themselves at increased risk of COVID-19 while minimize disruptions to daily life to the extent possible).
To address the above challenge, leveraging our long-term and successful experiences in combating and mitigating widespread malware attacks using AI-driven techniques (Ye et al., 2019; Hou et al., 2019; Li et al., 2019; Fan et al., 2018; Ye et al., 2017b, a; Hou et al., 2017; Chen et al., 2017a, b; Fan et al., 2016; Ye et al., 2011; Ye et al., 2010a, b, 2009, 2008, 2007), in this work, we propose to design and develop an AI-driven system to provide hierarchical community-level risk assessment at the first attempt to help combat the fast evolving COVID-19 pandemic, by using the large-scale and real-time data generated from heterogeneous sources. In our developed system (named -Satellite), we first develop a set of tools to collect and preprocess the large-scale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations, demographic data, mobility data, and user generated data from social media; and then we devise advanced AI-driven techniques to provide hierarchical community-level risk assessment to enable actionable strategies for community mitigation. More specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life.
The framework of our proposed and developed system is shown in Figure 3. In the system of -Satellite
, (1) we first construct an attributed heterogeneous information network (AHIN) to model the collected large-scale and real-time pandemic related data in a comprehensive way; (2) based on the constructed AHIN, to address the challenge of limited data that might be available for learning (e.g., social media data to learn public perceptions towards COVID-19 in a given area might not be sufficient), we then exploit the conditional generative adversarial nets (cGANs) to gain the public perceptions towards COVID-19 in each given area; and finally (3) we utilize meta-path based schemes to model both vertical and horizontal information associated with a given area, and devise a novel heterogeneous graph auto-encoder (GAE) to aggregate information from its neighborhood areas to estimate the risk of the given area in a hierarchical manner. The developed system-Satellite and the generated benchmark datasets have been made publicly accessible through our website.
2. Related Work
There have been several works on using AI and machine learning techniques to help combat COVID-19: in the biomedical domain,(Xu et al., 2020; Chen et al., 2020; Wang et al., 2020; Song et al., 2020b; Randhawa et al., 2020)
use deep learning methods for COVID-19 pneumonia diagnosis and genome study; while(Yan et al., 2020; Shi et al., 2020) develop learning-based models to predict severity and survival for patients. Another research direction is to utilize public accessible data to help the estimation of infection cases or forecast the COVID-19 outbreak (Hu et al., 2020; Hermanowicz, 2020; Jahanbin and Rahmanian, 2020; Majumder and Mandl, 2020; Song et al., 2020a; Rao and Vazquez, 2020; Zhu et al., 2020). However, most of these existing works mainly focus on Wuhan China; the studies of using computational models to combat COVID-19 in the U.S. are scarce and there has no work on community-level risk assessment to assist with community mitigation by far. To meet this urgent need and to bridge the research gap, in this work, by advancing capabilities of AI and leveraging the large-scale and real-time data generated from heterogeneous sources, we propose and develop an AI-driven system, named -Satellite, to provide hierarchical community-level risk assessment at the first attempt to help combat the deadly and fast evolving COVID-19 pandemic.
3. Proposed Method
In this section, we will introduce our proposed method integrated in the system of -Satellite to automatically provide hierarchical community-level risk assessment related to COVID-19 in detail.
3.1. Data obtained from Heterogeneous Sources
Realizing the true potential of real-time surveillance requires identifying the proper data sources, based on which we can devise models to extract meaningful and actionable information for community mitigation. Since relying on a single data source for estimation and prediction often results in unsatisfactory performance, we develop a set of crawling tools and preprocessing methods to collect and parse the large-scale and real-time pandemic related data from multiple sources, which include the followings.
Disease related data. We collect the up-to-date county-based coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from i) official public health organizations such as WHO, CDC, and county government websites, and ii) digital media with real-time updates of COVID-19 (e.g., 1point3acres222https://coronavirus.1point3acres.com/en). The collected up-to-date county-based COVID-19 related statistical data can be an important element for risk assessment of an associated area.
Demographic data. The United States Census Bureau333https://www.census.gov/quickfacts/fact/table/US/PST045219 provides the demographic data including basic population, business, and geography statistics for all states and counties, and for cities and towns with more than 5,000 people. The demographic information will contribute to the risk assessment of an associated area: for example, as older adults may be at higher risk for more serious complications from COVID-19 (CDC, 2020b; Surveillances, 2020), the age distribution of a given area can be considered as an important input. In this work, given a specific area, we mainly consider the associated demographic data including the estimated population, population density (e.g., number of people per square mile), age and gender distributions.
Mobility data. Given a specific area (either user input or automatic positioning), a mobility measure that estimates how busy the area is in terms of traffic density will be retained from location service providers (i.e., Google maps).
User generated data from social media. As users in social media are likely to discuss and share their experiences of COVID-19, the data from social media may contribute complementary knowledge such as public perceptions towards COVID-19 in the area they associate with. In this work, we initialize our efforts with the focus on Reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of COVID-19. In particular, we consider i) three subreddits with general discussion (i.e., r/Coronavirus444https://www.reddit.com/r/Coronavirus/, r/COVID19555https://www.reddit.com/r/COVID19/ and r/CoronavirusUS666https://www.reddit.com/r/CoronavirusUS/); ii) four region-based subreddits, which are r/CoronavirusMidwest, r/CoronavirusSouth, r/CoronavirusSouthEast and r/CoronavirusWest; and iii) 48 state-based subreddits (i.e., Washington, D.C. and 47 states). To analyze public perceptions towards COVID-19 for a given area (note that all users are anonymized for analysis using hash values of usernames), we first exploit Stanford Named Entity Recognizer (Finkel et al., 2005) to extract the location-based information (e.g., county, city), and then utilize tools such as NLTK (Bird et al., 2009)
to conduct sentiment analysis (i.e., positive, neutral or negative). More specifically, positive denotes well aware of COVID-19, while negative indicates less aware of COVID-19. For example, with the analysis of the post by a user (with hash value of “CF***6”) in subreddit of r/CoronaVirusPA on March 14, 2020:“I live in Montgomery County, PA and everyone here is acting like there’s nothing going on.”, the location-related information of Montgomery county and Pennsylvania state (i.e., PA) can be extracted, and a user’s perception towards COVID-19 in Montgomery county at PA can be learned (i.e., negative indicating less aware of COVID-19). Such automatically extracted knowledge will be incorporated into the risk assessment of the related area; meanwhile, it can also provide important information to help inform and educate about the science of coronavirus transmission and prevention.
3.2. AHIN built from Collected Data
To comprehensively describe a given area for its risk assessment related to COVID-19, based on the data collected from multiple sources above, we consider and extract higher-level semantics as well as social and behavioral information within the communities.
Attributed Features. Based on the collected data above, we further extract the corresponding attributed features.
A1: disease related feature.
For a given area, its related COVID-19 pandemic data will be extracted including the numbers of confirmed cases, new cases, deaths and the fatality rate, which is represented by a numeric feature vector. For example, as of March 22, 2020, the Cuyahoga County at Ohio State (OH) has had 125 confirmed cases, 33 new cases, 1 death and 0.8% fatality rate, which can be represented as .
A2: demographic feature. Given a specific area, we obtain its associated city’s (or town’s) demographic data from the United States Census Bureau, including the estimated population, population density (i.e., number of people per square mile), age distribution (i.e., percentage of people over 65 year-old) and gender distribution (i.e., percentage of females). For example, to assist with the risk assessment of the area of Euclid Ave in Cleveland at OH, the obtained demographic data associated with it are: Cleveland with population of 383793, population density of 5107, 13.5% people over 65 year-old, and 51.8% females, which will be represented as .
A3: mobility feature. Given a specific area, a mobility measure that estimates how busy the area is in terms of traffic density will be obtained from Google maps, which will represented by five degree levels (i.e., [1,5], the larger number the busier).
A4: representation of public perception. After performing the automatic sentiment analysis based on the collected posts associated with a given area from Reddit, the public perceptions towards COVID-19 in this area will be represented by a normalized value (i.e., [0,1]) indicated the awareness of COVID-19 (i.e., the larger value the more aware). For the previous example of the Reddit post of “I live in Montgomery County, PA and everyone here is acting like there’s nothing going on.”, a related perception towards COVID-19 in Montgomery County at PA will be formulated as a numeric vale of , denoting people in this area were less aware of COVID-19 on March 14, 2020.
After extracting the above features, we concatenate them as a normalized attributed feature vector A attached to each given area for representation, i.e.,
. Note that we zero-pad the ones in the elements when the data are not available.
Relation-based Features. Besides the above extracted attributed features, we also consider the rich relations among different areas.
R1: administrative affiliation. According to the severity of COVID-19, available resources and impacts to the residents, different states may have different policies, actionable strategies and orders with responses to COVID-19. Therefore, given an area, we accordingly extract its administrative affiliation in a hierarchical manner. Particularly, we acquire the state-include-county and county-include-city relations from City-to-County Finder777http://www.stats.indiana.edu/uspr/a/place_frame.html.
R2: geospatial relation. We also consider the geospatial relations between a given area and its neighborhood areas. More specifically, given an area, we retain its -nearest neighbors at the same hierarchical level by calculating the euclidean distances based on their global positioning system (GPS) coordinates obtained from Google maps and Wikipedia888https://en.wikipedia.org/wiki/User:Michael_J/County_table.
AHIN Construction. Given the rich semantics and complex relations extracted above, it is important to model them in a proper way so that different relations can be better and easier handled. To solve this problem, we introduce AHIN to model them, which is able to be composed of different types of entities associated with attributed features and different types of relations. We first present the concepts related to AHIN below.
Definition 0 ().
Attributed Heterogeneous Information Network (AHIN) (Li et al., 2017): Let be a set of entity types, be the set of entities of type and be the set of attributes defined for entities of type . An AHIN is defined as a graph with an entity type mapping : and a relation type mapping : , where denotes the entity set and is the relation set, denotes the entity type set and is the relation type set, , and . Network Schema (Li et al., 2017): The network schema of an AHIN is a meta-template for , denoted as a directed graph with nodes as entity types from and edges as relation types from .
In this work, we have four types of entities (i.e., nation, state, county and city, ), two types of relations (i.e., R1 and R2, ), and each entity is attached with an attributed feature vector as described above. Based on the definitions, the network schema of AHIN in our case is shown in Figure 4.
3.3. AHIN Enrichment by cGAN
Although the constructed AHIN can model the complex and rich relations among different entities attached with attributed features, there faces a challenge that there might be missing values of attributed features attached to the entities in the AHIN because of limited data that might be available for learning. More specifically, given an area, there may not be sufficient social media data (i.e., Reddit data in this work) to learn the public perceptions towards COVID-19 in this area. For example, for the state of Montana, as of March 22, 2020, in its corresponding subreddit r/CoronavirusMontana, there only have been 12 posts by seven users discussing the virus. To address this issue, we propose to exploit cGANs (Mirza and Osindero, 2014) for synthetic (virtual) social media user data generation for public perception learning to enrich the AHIN.
Different from traditional GANs (Goodfellow et al., 2014), a cGAN is a conditional model extended from GANs, where both the generator and discriminator are conditioned on some extra information. In our case, we propose to exploit cGAN to generate the synthetic posts for those areas where the data are not available. In our designed cGAN, given an area where Reddit data are not available, the condition composes of three parts: the disease related feature vector in this area , its related demographic feature vector and its GPS coordinate denoted as . As shown in Figure 5, the generator in the devised cGAN aims to incorporate the prior noise , with the conditions of , and as the inputs to generate the synthetic posts represented by latent vectors; while in the discriminator, real post representations obtained by using doc2vec (Le and Mikolov, 2014) or generated synthetic post latent vectors along with , and
are fed to a discriminative function. Both generator and discriminator could be a non-linear mapping function, such as a multi-layer perceptron (MLP). The generator and discriminator play the adversarial minimax game formulated as the following minimax problem:
The generator and discriminator are trained simultaneously: adjusting parameters for generator to minimize
while adjusting parameters for discriminator to maximize the probability of assigning the correct labels to both training examples and generated samples.
After applying cGAN for synthetic post latent vector generation, we further exploit deep neural network (DNN) to learn the public perceptions towards COVID-19 in this area. More specifically, we first usedoc2vec to obtain the representations of real posts collected from Reddit and feed them to train the DNN model; and then given a generated synthetic post latent vector, we use the trained model to gain its related perception (i.e., awareness of COVID-19).
3.4. Hierarchical Risk Assessment
Meta-path Expression. To assist with the risk assessment of a given area related to the fast evolving COVID-19, it might not be sufficient if only considering its vertical information (e.g., its related city, county or state’s responses, strategies and policies); the horizontal information (i.e., information from its neighborhood areas) will also be important inputs. To comprehensively integrate both vertical and horizontal information, we propose to exploit the concept of meta-path (Sun et al., 2011) to formulate the relatedness among different areas in the constructed AHIN.
Definition 0 ().
Meta-path. A meta-path is a path defined on the network schema , and is denoted in the form of , which defines a composite relation between types and , where denotes relation composition operator, and is the length of .
Figure 6.(a) shows our designed meta-paths (i.e., P1-P3). For example, P1 of denotes that, to assess the risk of a specific city, we not only consider the city itself, but also the information from its related county and nearby cities.
Heterogeneous Graph Auto-encoder. Given a node (i.e., area) in the constructed AHIN, guided by its corresponding meta-path scheme (i.e., city level guided by P1, county level guided by P2, and state level guided by P3), to aggregate the information propagated from its neighborhood nodes, we propose a heterogeneous graph auto-encoder (GAE) model to achieve this goal. The designed heterogeneous GAE model consists of an encoder and a decoder: the encoder aims at encoding meta-path based propagation to a latent representation, and the decoder will reconstruct the topological information from the representation.
Encoder. We here exploit attentive mechanism (Veličković et al., 2017; Fan et al., 2019; Wang et al., 2019) to devise the encoder: it will first search the meta-path based neighbors for each node , and then each node will attentively aggregate information from its neighbors. To learn the importance of the information from neighborhood nodes, we first present each relation type in the constructed AHIN by , where denotes the dimension of the attributed feature vector; and then the attentive weight of node (the neighbor of ) indicate the relevance of these two nodes measured in terms of the space , that is,
where and are the attributed feature vectors attached to node and . We further normalize the weights across all the neighbors of by applying softmax function:
Then, the neighbors’ representations can be formulated as the linear combination:
where the weight indicates the information propagated from to in terms of relation . Finally, we aggregate ’s representation and its neighbors’ representations by:
Decoder. The decoder is used to reconstruct the network topological structure. More specifically, based on the latent representations generated from the encoder, the decoder is trained to predict whether there is a link between two nodes in the constructed AHIN.
To this end, leveraging latent representations learned from the heterogeneous GAE, the risk index of a given area is calculated as:
where is the adjustable parameter that can be specified by human experts, indicating the importance of -th element in (e.g., the number of confirmed cases, population density, age distribution, mobility measure, etc.) in the rapidly changing situation.
4. System Development, Benchmark Datasets and Case Studies
Because of the critical need to act promptly and deliberately in this rapidly changing situation, we have deployed our developed system -Satellite (i.e., an AI-driven system to automatically provide hierarchical community-level risk assessment related to COVID-19) for public test. Given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life. The link of the system is: https://COVID-19.yes-lab.org, which also include the brief description and disclaimer of the system as well as the following benchmark datasets.
4.1. Benchmark Datasets for Public Use
Data Collection and Preprocessing. We have developed a set of crawling and preprocessing tools to collect and parse the large-scale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations and digital media, demographic data, mobility data, and user generated data from social media (i.e., Reddit). We have made our collected and proprocessed data available for public use through the above link. We describe each publicly accessible benchmark dataset (i.e., - ) in detail below.
: disease related dataset. According to simplemaps999https://simplemaps.com/, the U.S. includes 50 states, Washington, D.C. and Puerto Rico as well as 3,203 counties and 28,889 cities. We have collected the up-to-date county-based coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from official public health organizations (e.g., WHO, CDC, and county government websites) and digital media with real-time updates of COVID-19 (e.g., 1point3acres). By the date, we have collected these data from 1,531 counties and 52 states (including Washington, D.C. and Puerto Rico) on a daily basis from Feb. 28, 2020 to date (i.e., March 25, 2020).
: demographic and mobility dataset. We parse the demographic data collected from the the United States Census Bureau (data updated on July 1, 2019) in a hierarchical manner: for each city, county or state in the U.S., the dataset includes its estimated population, population density (e.g., number of people per square mile), age and gender distributions. By the date, we make the demographic and mobility dataset available for public use including the information of estimated population, population density, and GPS coordinates for 28,889 cities, 3,203 counties and 52 states (including Washington, D.C. and Puerto Rico).
: social media data from Reddit. In this work, we initialize our efforts on social media data with the focus of public perception analysis on Reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of COVID-19. In particular, we have collected and analyzed 48 state-based subreddits (i.e., Washington, D.C. and 47 states). By the date, we have crawled and automatically analyze 22,992 posts by 8,948 users on Reddit associated with 182,554 comments by 30,147 users on the discussion of COVID-19 from February 17, 2020 to date (i.e., March 25, 2020). Along with these data, this publicized dataset also includes the extracted locations of the posts using Stanford Named Entity Recognizer.
: constructed AHIN. Based on our designed AHIN network schema in this work (shown in Figure 4), the constructed AHIN has 32,145 nodes (i.e., 1 node with type of nation, 52 nodes with type of state, 3,203 nodes with type of county, 28,889 nodes with type of city) and 96,459 edges (including 32,144 edges with relation type of R1 and 64,315 edges with relation type of R2).
4.2. Case Studies
In this section, we evaluate the practical utility of the developed system -Satellite for hierarchical community-level risk assessment related to COVID-19 through a set of case studies.
Case study 1: real-time risk index of a given area. Given a specific location (either user input or automatic positioning by Google map), the developed system will automatically provide its related risk index (i.e., ranging from [0,1], the larger number indicates higher risk and vice versa) associated with the public perceptions (i.e., awareness) towards COVID-19 in this area (i.e., ranging from [0,1], the larger number denotes more aware and vice versa), demographic density (i.e., the number of people per square mile in its related county), and traffic status (i.e., ranging from [1,5], the larger number means more traffic and vice versa). Figure 7.(a) shows an example: given the location of Euclid Ave, Cleveland, OH 44106, the risk index provided by the system was 0.662 (with public perception of 0.529, demographic density of 1,389, and traffic status of 3) at 3:58pm EDT on March 24, 2020. At the same time, the risk indexes and public perceptions of corresponding county (i.e., Cuyahoga county with risk index of 0.665 and public perception of 0.585) and state (i.e., OH state with risk index of 0.554 and public perception of 0.557) will also be shown in a hierarchical manner to enable people to select appropriate actions for protection while minimizing disruptions to daily life.
Case study 2: comparisons of risk indexes on different dates. In this study, given the same area, we examine how the generated risk indexes change over time. Using the same location above, Figure 7.(b) shows the comparison results on different dates at the time of 3:58pm EDT, from which we have the following observations: 1) in general, its risk indexes increased over days from March 8, 2020 (i.e., 0.131) to March 24, 2020 (i.e., 0.662), as the confirmed cases in its related county (i.e., Cuyahoga county) and its related state (i.e., OH) continued to grow (i.e., from 0 case in Cuyahoga county on March 8 to 167 cases and 2 deaths on March 24, and from 0 case in OH on March 8 to 564 cases and 8 deaths on March 24); 2) after the first three case were confirmed in Cuyahoga county at OH on March 9, there was a sharp rise of risk index compared with March 8 (from 0.131 to 0.314); 3) the increases of risk indexes relatively slowed down after the public health and executive orders were issued in responses to COVID-19. For example, the risk indexes dropped to 0.605, 0.603 and 0.662 on March 15, 16 and 23 respectively, which might be because the government declared a state of emergency on March 14, ordered Ohio bars and restaurants to close on March 15 and issued a stay-at-home order on March 22101010https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home/public-health-orders/public-health-orders.
Case study 3: comparisons of risk indexes at different areas. In this study, given the same time, we examine how the generated risk indexes change over areas. When a user inputs the areas he/she are interested in (e.g., grocery stores near me) in the search bar, the system will display the nearby grocery stores using Google maps application programming interface (API) and automatically provide the associated indexes. For example, using the same time in the first study (i.e., 3:58pm EDT on March 24, 2020), Figure 8 shows the “grocery stores near me” (i.e., near the location of Euclid Ave, Cleveland, OH 44106) and their related indexes. From Figure 8, we can observe that the indexes of nearby areas might vary due to the factors of different public perceptions towards COVID-19 and different traffic statuses in specific areas. As shown in the right part of Figure 8, the system also provides related Reddit posts to users.
Case study 4: comparisons of different counties and states. In this study, we compare the indexes of different counties and different states given the same time. Using the time in the first study (i.e., 3:58pm EDT on March 24, 2020), Figure 9.(a) shows an example of comparisons. More specifically, at county-level, using OH state as an example, we choose the counties with top five largest numbers of confirmed cases on March 24 for comparisons: Cuyahoga (167), Franklin (75), Hamilton (38), Summit (36) and Lorain (30). Figure. 9.(b) illustrates the risk indexes associated with multiple factors versus the numbers of confirmed cases in these counties. For the comparisons of different states, we also choose five states: two most severe states (New York (NY) with 26,376 confirmed cases and 271 deaths, California (CA) with 2,628 confirmed cases and 54 deaths), two medium severe states (OH with 564 confirmed cases and 8 deaths, Virginia (VA) with 304 confirmed cases and 9 deaths) and one least severe state (West Virginia (WV) with 39 confirmed cases and 0 deaths). Figure. 9.(c) shows the risk indexes versus the numbers of confirmed cases in these states, from which we can see that there is a positive correlation between the numbers of confirmed cases and the risk indexes.
5. Conclusion and Future Work
To track the emerging dynamics of COVID-19 pandemic in the U.S., in this work, we propose to collect and model heterogeneous data from a variety of different sources, devise algorithms to use these data to train and update the models to estimate the spread of COVID-19 and predict the risks at community levels, and thus help provide actionable information to users for community mitigation. In sum, leveraging the large-scale and real-time data generated from heterogeneous sources, we have developed the prototype of an AI-driven system (named -Satellite) to help combat the deadly COVID-19 pandemic. The developed system and generated benchmark datasets have made publicly accessible through our website.
In the future work, we plan to continue our efforts to expand the data collection and enhance the system to help combat the fast evolving COVID-19 pandemic. We will continue to release our generated data and updates of the system to facilitate researchers and practitioners on the research to help combat COVID-19 pandemic, while assisting people to select appropriate actions to protect themselves at increased risk of COVID-19 while minimize disruptions to daily life to the extent possible.
Y. Ye, S. Hou, Y. Fan, Y. Qian, Y. Zhang, S. Sun and Q. Peng’s work is partially supported by the NSF under grants IIS-1951504, CNS-1940859, CNS-1946327, CNS-1814825 and OAC-1940855, and by DoJ/NIJ under grant NIJ 2018-75-CX-0032. This work is also partially supported by the Institute for Smart, Secure and Connected Systems (ISSACS) at Case Western Reserve University.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc”.
- CDC (2020a) CDC. 2020a. 1918 Pandemic (H1N1 virus). https://www.cdc.gov/flu/pandemic-resources/1918-pandemic-h1n1.html.
- CDC (2020b) CDC. 2020b. Are You at Higher Risk for Severe Illness? https://www.cdc.gov/coronavirus/2019-ncov/specific-groups/high-risk-complications.html.
- CDC (2020c) CDC. 2020c. How COVID-19 Spreads. https://www.cdc.gov/coronavirus/2019-ncov/prepare/transmission.html.
- CDC (2020d) CDC. 2020d. Implementation of Mitigation Strategies for Communities with Local COVID-19 Transmission. https://www.cdc.gov/coronavirus/2019-ncov/downloads/community-mitigation-strategy.pdf.
- Chen et al. (2020) Jun Chen, Lianlian Wu, Jun Zhang, Liang Zhang, Dexin Gong, Yilin Zhao, Shan Hu, Yonggui Wang, Xiao Hu, Biqing Zheng, et al. 2020. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study. medRxiv (2020).
- Chen et al. (2017a) Lingwei Chen, Shifu Hou, and Yanfang Ye. 2017a. SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks. In Annual Computer Security Applications Conference (ACSAC).
- Chen et al. (2017b) Lingwei Chen, Yanfang Ye, and Thirimachos Bourlai. 2017b. Adversarial Machine Learning in Malware Detection: Arms Race between Evasion Attack and Defense. In European Intelligence and Security Informatics Conference (EISIC).
- Fan et al. (2019) Shaohua Fan, Junxiong Zhu, Xiaotian Han, Chuan Shi, Linmei Hu, Biyu Ma, and Yongliang Li. 2019. Metapath-guided Heterogeneous Graph Neural Network for Intent Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2478–2486.
- Fan et al. (2018) Yujie Fan, Shifu Hou, Yiming Zhang, Yanfang Ye, and Melih Abdulhayoglu. 2018. Gotcha - Sly Malware! Scorpion: A Metagraph2vec Based Malware Detection System. In International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD).
- Fan et al. (2016) Yujie Fan, Yanfang Ye, and Lifei Chen. 2016. Malicious Sequential Pattern Mining for Automatic Malware Detection. Expert Systems with Applications 52 (2016), 16–25.
- Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 363–370.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
- Hermanowicz (2020) Slav W Hermanowicz. 2020. Forecasting the Wuhan coronavirus (2019-nCoV) epidemics using a simple (simplistic) model. medRxiv (2020).
- Hou et al. (2019) Shifu Hou, Yujie Fan, Yiming Zhang, Yanfang Ye, Jingwei Lei, Wenqiang Wan, Jiabin Wang, Qi Xiong, and Fudong Shao. 2019. alphaCyber: Enhancing Robustness of Android Malware Detection System against Adversarial Attacks on Heterogeneous Graph based Model. In International Conference on Information and Knowledge Management (CIKM).
- Hou et al. (2017) Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1507–1515.
- Hu et al. (2020) Zixin Hu, Qiyang Ge, Li Jin, and Momiao Xiong. 2020. Artificial intelligence forecasting of covid-19 in china. arXiv preprint arXiv:2002.07112 (2020).
- Jahanbin and Rahmanian (2020) Kia Jahanbin and Vahid Rahmanian. 2020. Using twitter and web news mining to predict COVID-19 outbreak. Medknow Publications (2020).
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.
- Li et al. (2019) Deqiang Li, Qianmu Li, Yanfang Ye, and Shouhuai Xu. 2019. Enhancing Robustness of Deep Neural Networks against Adversarial Malware Samples: Principles, Framework, and Application to AICS’2019 Challenge. In The AAAI-19 Workshop on Artificial Intelligence for Cyber Security (AICS).
- Li et al. (2017) Xiang Li, Yao Wu, Martin Ester, Ben Kao, Xin Wang, and Yudian Zheng. 2017. Semi-supervised clustering in attributed heterogeneous information networks. In WWW. International World Wide Web Conferences Steering Committee, 1621–1629.
- Majumder and Mandl (2020) Maimuna Majumder and Kenneth D Mandl. 2020. Early transmissibility assessment of a novel coronavirus in Wuhan, China. China (January 23, 2020) (2020).
- Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- Randhawa et al. (2020) Gurjit S Randhawa, Maximillian PM Soltysiak, Hadi El Roz, Camila PE de Souza, Kathleen A Hill, and Lila Kari. 2020. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. bioRxiv (2020).
- Rao and Vazquez (2020) Arni SR Srinivasa Rao and Jose A Vazquez. 2020. Identification of COVID-19 Can be Quicker through Artificial Intelligence framework using a Mobile Phone-Based Survey in the Populations when Cities/Towns Are Under Quarantine. Infection Control & Hospital Epidemiology (2020), 1–18.
- Shi et al. (2020) Weiya Shi, Xueqing Peng, Tiefu Liu, Zenghui Cheng, Hongzhou Lu, Shuyi Yang, Jiulong Zhang, Feng Li, Mei Wang, Xinlei Zhang, et al. 2020. Deep Learning-Based Quantitative Computed Tomography Model in Predicting the Severity of COVID-19: A Retrospective Study in 196 Patients. (2020).
- Song et al. (2020a) Peter X Song, Lili Wang, Yiwang Zhou, Jie He, Bin Zhu, Fei Wang, Lu Tang, and Marisa Eisenberg. 2020a. An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China. medRxiv (2020).
- Song et al. (2020b) Ying Song, Shuangjia Zheng, Liang Li, Xiang Zhang, Xiaodong Zhang, Ziwang Huang, Jianwen Chen, Huiying Zhao, Yusheng Jie, Ruixuan Wang, et al. 2020b. Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images. medRxiv (2020).
- Sun et al. (2011) Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4, 11 (2011), 992–1003.
- Surveillances (2020) Vital Surveillances. 2020. the epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19)-China, 2020. China CDC Weekly 2, 8 (2020), 113–122.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
- Wang et al. (2020) Shuai Wang, Bo Kang, Jinlu Ma, Xianjun Zeng, Mingming Xiao, Jia Guo, Mengjiao Cai, Jingyi Yang, Yaodong Li, Xiangfei Meng, et al. 2020. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). medRxiv (2020).
et al. (2019)
Xiang Wang, Xiangnan He,
Yixin Cao, Meng Liu, and
Tat-Seng Chua. 2019.
Kgat: Knowledge graph attention network for recommendation. InProceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
- WHO (2020) WHO. 2020. Coronavirus disease (COVID-19). https://www.who.int/.
- Xu et al. (2020) Xiaowei Xu, Xiangao Jiang, Chunlian Ma, Peng Du, Xukun Li, Shuangzhi Lv, Liang Yu, Yanfei Chen, Junwei Su, Guanjing Lang, et al. 2020. Deep Learning System to Screen Coronavirus Disease 2019 Pneumonia. arXiv preprint arXiv:2002.09334 (2020).
- Yan et al. (2020) Li Yan, Hai-Tao Zhang, Yang Xiao, Maolin Wang, Chuan Sun, Jing Liang, Shusheng Li, Mingyang Zhang, Yuqi Guo, Ying Xiao, et al. 2020. Prediction of survival for severe Covid-19 patients with three clinical features: development of a machine learning-based prognostic model with clinical data in Wuhan. medRxiv (2020).
- Ye et al. (2017a) Yanfang Ye, Lingwei Chen, Shifu Hou, William Hardy, and Xin Li. 2017a. DeepAM: A Heterogeneous Deep Learning Framework for Intelligent Malware Detection. Knowledge and Information Systems (2017), 1–21.
- Ye et al. (2019) Yanfang Ye, Shifu Hou, Lingwei Chen, Jingwei Lei, Wenqiang Wan, Jiabin Wang, Qi Xiong, and Fudong Shao. 2019. Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection. In 28th International Joint Conference on Artificial Intelligence (IJCAI).
- Ye et al. (2017b) Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar. 2017b. A Survey on Malware Detection Using Data Mining Techniques. Comput. Surveys 50, 3 (2017), 41.
- Ye et al. (2010a) Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. 2010a. Automatic Malware Categorization Using Cluster Ensemble. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). 95–104.
- Ye et al. (2009) Yanfang Ye, Tao Li, Qingshan Jiang, Zhixue Han, and Li Wan. 2009. Intelligent File Scoring System for Malware Detection from the Gray List. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). 1385–1394.
- Ye et al. (2010b) Yanfang Ye, Tao Li, Qingshan Jiang, and Youyu Wang. 2010b. CIMDS: Adapting Postprocessing Techniques of Associative Classification for Malware Detection. IEEE Transactions on Systems, Man and Cybernetics-Part C: Applications and Reviews 40, 3 (2010), 298–307.
- Ye et al. (2011) Yanfang Ye, Tao Li, Shenghuo Zhu, Weiwei Zhuang, Egemen Tas, Umesh Gupta, and Melih Abdulhayoglu. 2011. Combining File Content and File Relations for Cloud Based Malware Detection. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). 222–230.
- Ye et al. (2007) Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. 2007. IMDS: Intelligent Malware Detection System. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). 1043–1047.
- Ye et al. (2008) Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. 2008. An Intelligent PE-malware Detection System Based on Association Mining. Journal in Computer Virology 4, 4 (2008), 323–334.
- Zhu et al. (2020) Huaiqiu Zhu, Qian Guo, Mo Li, Chunhui Wang, Zhengcheng Fang, Peihong Wang, Jie Tan, Shufang Wu, and Yonghong Xiao. 2020. Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm. bioRxiv (2020).