Social Media Analysis for Crisis Informatics in the Cloud

06/06/2020 ∙ by Gerard Casas Saez, et al. ∙ 0

Social media analysis of disaster events is a critical task in crisis informatics research. It involves analyzing social media data generated during natural disasters, crisis events, or other mass convergence events. Due to the large data sets generated during these events, large scale software infrastructures need to be designed to analyze the data in a timely manner. Creating such infrastructures bring the need to maintain them and this becomes more difficult as these infrastructures grow larger and older. Maintenance costs are high since there is a need for queries to be handled quickly which require large amounts of computational resources to be available on demand 24 hours a day, seven days a week. In this thesis, I describe an alternative approach to designing a software infrastructure for analyzing unstructured data on the cloud while providing fast queries and with the reliability needed for crisis informatics research. Additionally, I discuss a new approach for a more reliable Twitter stream collection using container orchestrated systems. I finally compare this new infrastructure with existing crisis informatics software infrastructures and compare their reliability, scalability and extensibility with my approach and my prototype.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Project EPIC

Figure 2.1: Project EPIC old software infrastructure diagram .

Project EPIC (Empowering the Public with Information in Crisis) is a project at the University of Colorado Boulder. It conducts research on crisis informatics, an area of study that examines how members of the public make use of social media during times of mass emergency to make sense of a crisis event and to coordinate/collaborate around an event. Project EPIC has a long history of performing software engineering research. Since 2009, Project EPIC has been investigating the software architectures, tools, and techniques needed to produce reliable and scalable data-intensive software systems, aka big data software engineering[Anderson:2015b].

When large amounts of data are being collected, software engineers need to focus on structuring it so it is easy to perform analysis. They must work to ensure that the data is easily accessible to analysts. Project EPIC’s big data software engineering research has explored these issues in depth during the creation of the previous Project EPIC software infrastructure that consisted of two major components: EPIC Collect and EPIC Analyze. EPIC Collect[anderson2011design, schram2012mysql] is a 24/7 data collection system that connects to the Twitter Streaming API to collect tweets from various crisis events that need to be monitored in real time. Since 2012 to 2018, this software collected tweets with an uptime of 99% and collected over two billion tweets across hundreds of crisis events. EPIC Collect used Cassandra as its storage layer. This NoSQL database is focused on writes and provided high throughput to handle all incoming tweets. EPIC Analyze[anderson2015design, barrenechea2015getting] is a web-based system that makes use of a variety of software frameworks (e.g., Hadoop, Solr, and Redis) to provide various analysis and annotation services for analysts on the large data sets collected by EPIC Collect. In addition, Project EPIC maintained one machine—known as EPIC Analytics—with a large amount of physical memory to allow analysts to run memory-intensive processes over the collected data.

The software architecture of EPIC Collect and EPIC Analyze is shown in Figure 2.1. Note, this is a logical architecture that does not show how these systems are deployed. For instance, Cassandra was deployed on four machines that run separately from the machines that host the EPIC Collect software, Postgres, Redis, and the Ruby-on-Rails code that makes up EPIC Analyze. In all, the existing Project EPIC infrastructure was distributed across seven machines in a single data center maintained at the University of Colorado Boulder.

2.2 Rest Api

REST (Representational State Transfer) [fielding2000architectural] is a well known architectural style that defines a set of constraints to be used when designing and implementing a web service. It recommends a pattern for making use of HTTP verbs on URLs that follow certain conventions for designing a natural way to design web services that then provide many of the same benefits provided by the Web itself. REST APIs provide an abstraction to a set of resources that shield the client from the internal representations of the data. Interaction is done through a HTTP request, making it easy to implement web applications thanks to its adoption across the internet. Designed as stateless operations, REST services help separate the user interfaces of clients from underlying technologies.

URIs are used to identify resources. That is, URIs make it possible to find an object in the system. Each URI must be unique for a single resource. A good practice is to format each resource URI in an ordered path structure. A lot of times a request to a service for a particular resource will return hyperlinks to other resources allowing a service to be built around a well-known URI and then letting clients discover other resources via service interactions. Each resource accepts a set of operations that derive from the HTTP operations: GET, HEAD, POST, PUT, PATCH, DELETE, CONNECT, OPTIONS. These can be used in pair with other parameters to execute a variety of actions on a resource.

RESTful services allow for various CRUD (Create, Read, Update and Delete) operations to be performed easily. Here is a list of some CRUD operations mapped to an example REST service:

  • List all resources available GET /notes/: Note the use of a plural with respect to the resource name as there are multiple notes in the folder. Also, use / at the end, as it is functioning like a container or folder in a traditional desktop file system.

  • Get a representation of a specific resource GET /notes/1: This URI and HTTP verb will retrieve a unique representation for a single resource. We do not use / at the end as we are accessing a specific resource.

  • Create a new resource POST /notes/: This should add a new element to the list of resources available. If no unique ID is provided, a new ID should be assigned by the web service and returned to the client.

  • Remove a specific resource DELETE /notes/1: This should remove the specified resource from its containing resource.

  • Update a specific resource PATCH /notes/1: This should replace the existing resource with the new representation that accompanies this request.

These operations can be augmented with parameters. We could add a GET parameter to filter down the results returned when we list all available resources. An example would be to list all notes that have not been archived. We could do so via ‘GET /notes/?archive=false’. This helps us cache future responses if the same filter is used again. These parameters are also useful when searching resources or for any other operations that may filter the result returned to the user. For creating and updating elements, we use the body from the request to include a representation of the object we are creating or updating.

A common bad practice that should be avoided is using URIs for passing arguments. An example of such is adding the search term into the URI like GET /notes/search/hello. This should be avoided since this is a parameter for a function, not a static resource to be retrieved or updated. Instead it should use GET parameters like ‘GET /notes/?search=hello’ as described before.

For this project, I decided to use REST APIs to overlay an abstraction on top of my underlying infrastructure. Each API exposes a set of resources that allows the user to interact with the collection and analysis infrastructure. This abstraction layer allows changing the underlying infrastructure without affecting its clients. Also, thanks to the separation of data storage and user interface, I can change user interfaces individually independent of the underlying data. This separation allows having developers who are more front-end focused working on the user interface without the need to understand how the underlying system works.

2.3 Container Orchestration Technologies

As containerization technologies became more widely adopted—spurred by a recent migration to designing systems via microservices—large companies needed a way to manage their containers in a more friendly way. Interconnecting containers, managing their deployment, and scaling them to meet demand, were all possible, but were difficult to achieve. As a result, container orchestration systems were born.

To manage containers, these systems add an abstraction layer over containerisation technologies, making life cycle tasks related to containers (create container, launch container, pause container, etc.) easier to perform. There are a several container orchestration systems available at the moment. The most popular are Kubernetes and Apache Mesos.

In this project, I will make use of Kubernetes. The main reason for this is that—thanks to the open source community—it is easier to find tutorials and courses for Kubernetes. In addition, there are a lot of big companies backing this project and contributing to it; this activity provides evidence that this project will be supported well into the future. Finally, there is a lot of companies offering managed clusters on Kubernetes, which helps to migrate an infrastructure built on top if it if there was ever a need.

Google Cloud seems like the best fit to host Kubernetes as it has a managed cluster option that makes it straightforward to install and configure. In addition, thanks to Google being part of the maintenance team for Kubernetes, there is great support for Google Cloud services within Kubernetes.

2.4 Cloud infrastructures

Cloud infrastructures appeared in response to the need for big companies to maintain large sets of machines in their data centers for peak usage of their apps. The first company to create a cloud service division—and the one responsible for the popularization of cloud services—is Amazon. Amazon needed large data centers to be able to process the demand of the U.S. “Black Friday” holiday. However, during the rest of the year, they realized that their servers were mostly unused. They created the division of Amazon Web Services to lease the computational power of those idle machines to external users.

Other cloud providers have appeared over the years. Google Cloud, Azure, and DigitalOcean are examples. With time, their offerings have switched from only leasing computational power to more complex tools. Some services offered by all these companies include managed databases, object storage, machine learning services, and more. Price and service availability is usually similar between them.

With the rise of these services, software engineers have started to switch design practices to incorporate the cloud. Those architectures are designed to leverage cloud services for a more cost-effective system. Instead of using an in-house managed service they rely on cloud managed services for different pieces of their infrastructure. This allows savings on maintenance as the overall structure is simplified.

2.5 Big Data Storage Systems

For my thesis work, I will be collecting data from Twitter via the Streaming API. This API is limited to provide 1% of the total tweets generated in Twitter every minute. Based on a report from 2013[tweetsRecord], I know that rate is approximately 5700 tweets per second on average. As a result, I can calculate the total number of tweets per second that I estimate will flow through my collection infrastructure. That is approximately 57 tweets per second on average as a minimum bound. Since that corresponds to 4.9M tweets/day, I need a data storage technology that scales to handle large data sets.

Previous studies in Project EPIC pointed to the usage of NoSQL database. However, systems like this can become a huge bottleneck for analysis workloads. Collection storage needs to be separated from analysis storage in order to increase reliability and avoid analysis queries causing a computational bottleneck that can affect the overall performance of the collection pipeline.

In addition, I believe it is important for the future to keep data in the raw format in which it was received. Storing raw data will allow the evaluation of new tools. In addition, storing unstructured data as documents is better than converting the original format into a specific database representation since Twitter can (and does) change its data structures at any time.

2.6 Cloud Object Storage

Most cloud service providers offer object storage services, also known as blob storage services. These services are documents stores abstracted as file systems. They provide cheap and reliable file storage. In addition, they provide programmatic ways to add files, download, and list them from code. Cost is based on the total storage needed, which avoids having to pay for extra disk space when its not being used. At the same time, it functions as an abstraction on top of specific disk configurations to simplify management.

Thanks to the file system structure, folders can be used to provide fast filtering by key. This provides guidance in how data should be structured. A clear example of a use case would be to store all tweets collected during a day in a single folder. This would allow us to filter tweets by date quickly. Such keys can be used as a hash table index with prefix look ups.

Each object stored has internal metadata that works as a means of deciding how to store the file internally as well as how it should be accessed externally. Part of the metadata is used to decide where it should be stored. Files that are accessed rarely can be stored in cold storage for a cheap price. Files that need to be downloaded frequently are usually stored in fast SSD disks to reduce latency. Other options include replicated files across regions allowing the ability to serve files from the closest region to a client.

Given that in disaster analysis we tend to explore tweets for events individually, individual data elements can sit for a long time without being accessed. This characteristic makes this data perfect for cold storage as it allows to keep costs low while making sure we do not lose any of the data. In addition, having an interface to access the data whenever we really need it, makes this option worth it.

2.7 Cloud Document Analysis Tools

Due to the rise of unstructured data, there has been a lot of work within the data analysis world to bring SQL queries to large data sets. Various NoSQL technologies brought parts of SQL to life by providing abstractions on top of their own data access systems. The main issue with those systems is that it needed to have two separate machine clusters, one for collection and one for analysis workloads to ensure the two functions do not overlap with each other or compete for resources. In addition, storage can make use of different formats across the two tasks, which makes it difficult to switch data analysis tools on top of an underlying collection without having to migrate the whole data set to a new format. This can be a problem if a product used is discontinued or slowly abandoned over time.

To address that issue, different alternatives have been created. Hive on top of Hadoop was an attempt of bringing SQL to the MapReduce world. A more recent alternative has been Presto, an interactive query system that can operate quickly at petabyte scale. Created inside Facebook, it is similar to Hive while operating primarily in memory. Its main goal is to allow for interactive queries (i.e. fast response times) to be performed on top of unstructured large data sets.

Given that analysis workloads are sparse, adapting Presto to use server-less architecture on the cloud makes sense as it can help keep costs low. This allows for computing resources to be created when needed to resolve a query and also allows for high parallelism to reduce the time to compute the result of a query. Cloud services similar to Presto include Athena from Amazon Web Services and BigQuery from Google Cloud. They both offer similar products on a pay-per-data-element-analyzed basis and can be pointed to their own object storage services for data access.

Making use of this type of service, I can keep simple structures of files in object storage while allowing for quick yet complex analysis workloads that are run in parallel without having to support a large infrastructure to do so. This solution can bring great value, especially in fields like disaster analysis where data is not being analyzed continuously.

An example query that I can do in this system would be to get the number of tweets for each user in a data set and order the result. Using Google BigQuery, I can run this query with a simple SQL query on top of a zipped set of JSON files stored in Google Cloud Storage.

Storage (Gb) Number of tweets Query Time (seconds)

0.03
5638 2.2
0.472 72027 4.1
2.5 408422 5.3
23.1 3776311 25.2
204.2 34356642 111

Table 2.1: Table comparing query execution times for the top 50 most active users in an event with the size of the overall data set.

2.8 Microservice Architecture

Microservices is an approach to distributed systems that promote the use of small services with specific responsibilities that collaborate between them, rather than making use of big components with a lot of responsibilities that make interactions more difficult. They are thus more cohesive units of software with minimal dependencies between them. A system designed with loosely-coupled, highly-cohesive software components has always been a highly desirable goal in software design, and microservices help to achieve that goal within distributed software systems[gradybook].

Compared to monoliths—large software projects with a lot of responsibilities—microservices allow for faster iteration cycles and deployment. This is achieved thanks to the code base for each service being separated. Due to that, developers can works on parts of the bigger system while not overlapping with others. This flexibility also reduces overall system complexity [microservices]. Thanks to the popularization of container orchestrated systems, microservice architectures have become easier to manage and deploy. Container orchestrated systems provide an abstraction that matches really well with microservices.

Another advantage of microservices architecture is that it allows for different frameworks and languages to be composed into a single system architecture. Even though this can be interesting to accommodate developers, it is also acknowledged that having a really diverse code base can make it complex for a developer to jump between parts of the system. For that reason, it is recommended to maintain a single language and framework when it comes to designing a microservice architecture. It is still interesting to have the option to jump to a new language or framework to use custom features only available within them. An example would be having to create a stateless service that handles a large number of messages. It would be interesting to rely on frameworks and languages focused on parallel processing and message passing to improve performance. For the decision of what language should be the main framework for a given microservice-based architecture, developer adoption, reliability, and ease-of-use should be the principal criteria.

Finally, microservice architectures allow for different components to be updated over time and optimized independently. This can benefit the whole infrastructure by helping mitigate bottlenecks on a system individually. In addition, given that the structure is loosely coupled, it allows for services to switch their internal implementation while keeping the same external functionality. An example would be to change the database used by a specific service. The service could be replaced fully by itself without affecting the rest of the system. This simplifies the contracts between teams of developers and reduces dependencies to the APIs of other microservices.

Having small microservices do specific tasks makes development cycles faster and more independent. It also makes incremental deployment much easier and less dangerous. Finally, microservices make it easier to scale software systems since one can individually scale parts of the system depending on their usage

Most languages have frameworks that allow for easy REST microservices implementations to be created. Dynamic languages frameworks work great for prototyping and allow for fast development. However, they can become difficult to manage unless you make sure your code base is extensively tested. Any line of code can fail due to dynamic typing at execution time and this is dangerous in the long term. This makes it dangerous for production systems as you lose the advantage of analyzing the code when compiling. On the other hand, statically compiled languages make for a more reliable code base as it checks types at compilation time. This can make the code more stable and help detect errors on the code base earlier in the process.

An example of a framework for microservices would be Dropwizard. This framework, based on Java, provides a complete reliable backbone for microservice development. It includes several libraries to handle database connections and other service connections. Its design is also simple enough to work with so that the learning curve is low.

5.1 Data Collection and Persistence

5.1.1 Previous infrastructure

Figure 5.1: Diagram highlighting the data collection parts in the old infrastructure.

Data collection in the previous infrastructure is mostly composed by the monolithic EPIC Collect. The system is designed to be composable within itself by using the Spring framework to do dependency injection between different services. To collect tweets, it connects to Twitter’s Streaming API directly. It uses a separate thread to monitor the keywords needed and restarts the collection thread when the keywords are updated. Those keywords are updated through the EPIC Event Editor. This web application allows users to create and stop events and their associated keywords. Updates via this web application changes fields in the EPIC Collect database; these changes are then detected by the EPIC Collect thread which restarts its collection to make use of the new keywords.

To process the tweets and separate them into events, the system keeps an in-memory queue that separates the ingestion part of the infrastructure from the classification component. The classification mechanism checks events tracked at the moment and classifies tweets to the events it matches. This process, once finished, sends tweets to the persistence layer.

Data was stored first in MySQL and then in Cassandra. Project EPIC researchers were forced to switch to Cassandra after realizing that using MySQL created a bottleneck when storing tweets. In addition, to process data faster, there is a need for data to be distributed across multiple machines to increase parallelization and allow real-time computation. Data is stored using the event id as the key. This allows retrieving tweets by event faster. In addition, to allow for quick date range queries, it includes the day in the key such that time-slicing queries are fast. The design of the key also helps to balance data evenly across the Cassandra cluster [anderson2015design].

5.1.2 New infrastructure

Figure 5.2: Diagram highlighting the data collection parts in the new infrastructure.

In order to make collection faster, I designed the collection components to be stateless. That is, these components are designed to perform operations based only on the messages they receive. This makes the system more reliable as it allows the underlying container orchestration system to replace crashed services and scale up if needed.

Data collection from the API needs to be really reliable. It also needs to be available to ingest multiple events at the same time. One of the main concerns with this part of the infrastructure was how to avoid losing messages at peak times while avoiding having a huge infrastructure running all the time. For this, I designed the ingestion pipeline to use Kafka between stages so that it can cache messages during peak times and process them all at once after the rate of messages decreases. It also acts as another reliable mechanism that can help crashed services recover non-processed messages once they are back up. This acts similarly to the in-memory queue from the old infrastructure but is more reliable. I separated the pipeline into two services: ingestion and filtering.

For ingestion, there is the Twitter stream connector. Its function is to connect to the Twitter Streaming API using a defined set of keywords and send incoming messages to Kafka. The set of keywords are pulled from a Configuration Map in Kubernetes. This service is periodically checking a local file with the keywords to restart the collection if they change. Thanks to the microservice approach, I was able to leverage the Go language for this microservice instead of Java. This is because I needed increased reliability and good message processing capabilities. Go provides a good network interface and allows for quick message processing. Note that this service does not process tweets in any way, it only retrieves them and sends them down the pipeline to Kafka.

The next step in the pipeline is the filter service. This service is fully stateless, it relies only on the messages in Kafka. In addition, each service instance is event specific. The system only runs an instance of the filter service for each event active at any given time. If no active event is being tracked, there are no filter instances running, which means that it is only using computing resources when needed, allowing us to use more resources for analysis.

For reliability, when an instance goes down, it reconnects to the latest offset committed to Kafka. I use consumer groups from Kafka to distinguish between different filter services reading at the same time from the same topic. Once a tweet arrives at the filter service, the service decides whether or not it belongs to the current event. It does so by doing a simple string search on the tweet JSON. If any of the keywords from the event are in the JSON, that tweet is classified as pertaining to the event and buffered.

To store tweets, I rely on a cloud object storage service. This service serves as an online file system, abstracting away data distribution and replication. This allows me to work under the contract of reliability from the cloud provider and forget about implementing reliability measures like data replication. In addition, thanks to the availability of long term storage options, I can keep data storage costs low. I have programmed the data to fall into cold storage after a year, allowing the infrastructure to retrieve the data faster while it is being collected, and reduce costs after a year has passed since the need for that data is likely greatly reduced at that point.

The filter service is in charge of uploading tweets to the object storage service. Every hour, the service compresses all messages in its buffer and uploads them to the object storage service as a single file. To handle peak usage times while ensuring good performance for analysis, the service will also upload messages if the buffer fills to 1000 tweets.

To provide fast querying of tweets given an event and to provide support for time slicing queries, I use the file system abstraction to store metadata on the filename and path. On one hand, I store each file into their corresponding event folder. This allows listing all files with tweets for a specific event fast, as it is only listing all files under a folder. On the other hand, I store the date and the number of tweets in a file in its filename. Thanks to this design, I can later parse the filename and detect if I have any tweets that I need in the file. File names thus work similarly to an index. I keep the number of files low by setting the buffer size limit high. Thanks to this index, I also can create a visualization of ingestion rate per hour without needing to access the files.

Organizing all this, I developed the Events API. This service is in charge of performing CRUD operations to events. It uses a PostgreSQL database to store the events data and their status. It also keeps track of when an event collection is started or stopped, and which user is responsible for the action. This service is also in charge of updating the pipeline described above if needed. This service interacts with Kubernetes to instantiate new filter services when an event is created or a collection is restarted. It also manages the configuration map that maintains the set of keywords to be collected at any given point in time. This service is in charge of abstracting all event-related actions. It exposes a REST interface used by the infrastructure’s user interface.

5.2 Data Analysis

5.2.1 Previous infrastructure

Figure 5.3: Diagram highlighting the data analysis parts in the old infrastructure.

Previous work on Project EPIC on data analysis is presented across several papers. First most work started as a SQL query when data was still stored in MySQL. Due to the switch to NoSQL technologies for storage, the technical difficulty to access collected data increased. This, in turn, made it complicated to filter data sets and explore them. To address this situation, software engineers had to interact with analysts to narrow down data sets and expose them in a format that would be easy for them to import into third-party tools. This was a highly manual process.

To improve that process, EPIC Analyze was created. This monolithic system used a model of data sets associated with events to load and visualize the data. Due to Cassandra not being good at supporting random accesses and the main cluster needing to be available for peak ingestion times, the analysis needed to happen in a separate cluster. EPIC Analyze had its own Cassandra cluster where data sets to be analyzed were stored. The process of loading the data to be analyzed was done manually. The main functionality available in EPIC Analyze was browsing and pagination, filtering, and the job framework.

For browsing and pagination, EPIC Analyze allowed exploring the data set at the tweet granularity level. It paginated the data using Redis indexes created when data was imported to the system. Each entry in the index pointed at the column and key where a tweet was in Cassandra. Thanks to Redis, pagination was fast. This allowed analysts to interact with each other by finding important tweets and sharing pages where interesting tweets were located.

Filtering was aimed at reducing the size of the data to be explored. EPIC Analyze allowed analysts to specify a search query based on various fields from a tweet. This took advantage of the DataStax Enterprise integrated version of Apache Solr with Cassandra. Solr built indexes when data was loaded into the analysis cluster, allowing for sub-second queries when the index was finished.

Finally, for other analysis techniques, EPIC Analyze had a job framework available to extend the capabilities. This framework used Resque to queue jobs and perform them. Some of those jobs used the Apache Pig QL language to query the dataset for deeper insights. Output results were stored in Cassandra in the form of JSON. This was accessed by EPIC Analyze and showed in the UI either in RAW form or in visualizations if implemented.

A limitation on this design is the fact that data needed to be transferred to a separate cluster for analysis; this made the job of analyzing the data set asynchronous to the event itself. This can be dangerous, as analysts are blind to what is happening in an event until the event is loaded into EPIC Analyze. In addition, even though Datastax does a great job at integrating Hadoop with Cassandra, performance is hugely influenced by what is happening in the Cassandra cluster. This means that two MapReduce jobs running on separate clusters will find a bottleneck when they try to read from Cassandra at the same time, as the cluster will need to perform both reads simultaneously. This is similar to an issue I encountered with the infrastructure proposed in my undergraduate thesis. There Spark had issues loading data to memory while Cassandra was collecting data since Cassandra was running out of memory.

5.2.2 New infrastructure

Figure 5.4: Diagram highlighting the data analysis parts in the new infrastructure.

As previously discussed, there is a need to explore and analyze collected disaster data sets in a way that analysts can understand what people were discussing during the event. On one hand, my infrastructure needs to allow users to scroll through an entire data set with ease, abstracting the underlying storage layer. On the other hand, I want to provide a way to query the data set with a high-level language like SQL for interactive exploration of the data set. This would allow analysts to test theories and understand the data set at a deeper level than what browsing provides. In addition, my system needs to be able to do some analysis in batch mode after finishing a collection to help load an event’s data more quickly into the user interface.

For data exploration, I built the Tweets API. This microservice abstracts the underlying data storage layer. It uses the filenames with tweet counts to create a pagination index. This index is used to calculate what file is needed to retrieve particular tweets. This file is calculated from the index given a page number and the number of tweets per page. This index can be created given a list of all the files under an event’s directory. Due to Google Cloud Storage API’s being slow when the number of files to list starts to exceed a certain size, I decided to cache the index in the memory of the Tweets API microservice for 10 minutes. This allows analysts to quickly explore data sets, being able to load any page in seconds. Thanks to this API, I abstract away the storage layer from the user interface. In addition, as the event name is passed as a variable, this API is completely independent of the Events API. This follows the software engineering rule of reducing dependencies as much as possible. In addition, it also allows for performance patches to be deployed on the exploration part of the infrastructure independently from the collection-related components.

Similar to EPIC Analyze, I also added the ability to annotate tweets. Analysts can add arbitrary tags to tweets in a data set. These tags are color encoded using a hash function based on the text to enable fast visual exploration. These annotations are stored in PostgreSQL using the Events API.

Another factor that takes importance is time slicing during data exploration. Using the date on the filename, I can filter what part of the pagination index I want to use. This can be done in linear time. After the filtering has happened, I can calculate again what file to retrieve.

For filtering, I rely on Google Cloud BigQuery service. This service allows me to query the dataset in SQL interactively. I use the Filter API to abstract away the details of the interaction with BigQuery. This API returns data in a format similar to the Tweets API. The difference between the two is that the data returned is only the data needed for presenting an overview of a tweet. I did not implement all the filtering options that existed in EPIC Analyze, but I do know that the same functionality can be achieved over time. At the moment, I implemented only text search using the like operator on BigQuery. I use temporal tables to paginate through the results. To access details from a tweet, I provide the filename and tweet id to the Tweet API. That API can then retrieve the file and return the full tweet JSON to the user interface.

Thanks to BigQuery, I can also personalize and export data into CSV format on demand. Using SQL, I can narrow down the data set and choose the fields I want to expose. Once that is done, I can export the results into a CSV stored in Google Drive. This is especially useful for Project EPIC interactions with external collaborations.

For additional analysis jobs, I leverage Google Cloud Dataproc and Spark. I created a workflow template on Dataproc which gets triggered every time an event collection is stopped by the Events API. This workflow can be modified independently of the rest of the infrastructure by adding new Spark jobs. Jobs receive the event name as a parameter. Spark jobs are written in Java and stored in Google Cloud Storage as jar files. This jar is loaded on a cluster created when the workflow is triggered. Creating clusters on demand avoids having unnecessary servers sitting around when there are no analysis jobs running.

The first Spark job added is the mentions extraction job. I believe that extracting the most mentioned users from a data set can lead us to find the most important users in a data set. This information can be later used to extract their timelines and contextualize the collection further.

Developers must also create APIs to expose the data that Spark jobs extract. In this case, there is a Mentions API that knows how to access the resulting output from Spark and serve it in a paginated fashion.

Note that adding a new batch job does not need any interaction with the rest of the services. This allows for new developers to work on extensions without any need to change existing services. The only step needed to interact with other code is the user interface; I will that component in more detail below. In the mentions extraction case, to prove that this was a good approach for parallel development, it was developed on the side of the rest of the infrastructure by a different developer. This developer worked by himself to create this extension. The only part of the code base that he needed to change in my system was within the user interface. This provides a powerful way to extend the capabilities of the system in the future.

5.3 User Interface

5.3.1 Previous infrastructure

Figure 5.5: Diagram highlighting the user interface parts in the old infrastructure.

There have been several iterations on the user interface for EPIC. The latest was developed as part of EPIC Analyze. This monolithic app is built on Ruby on Rails. Coupled with Redis and Cassandra, it provided an interface for data set interactive exploration. The interface was done in the form of a web app. Designed with Ruby on Rails views, the code was highly coupled to the underlying infrastructure. This meant that to add new capabilities to the existing user interface, developers needed to expand on the already existing application, creating a highly coupled system that increased in complexity over time.

Focusing on the web application, we can find a list and detail view to explore tweets in a data set. Tweets can be explored page by page in the browser, and each tweet can be expanded to see all of its related metadata. On top of the page there is a timeline that shows the volume of tweets over time for the existing data set. The analyst is allowed to select a period of time by drag-and-drop interaction on the timeline, allowing for the data set to be time sliced. EPIC Analyze also provides a querying interface to perform complex queries with composed conditions. This functionality allows for analysts to filter down datasets by different elements from the tweet metadata. These queries also update the count timeline to reflect the tweet volume of the new filtered data set.

Another interface element provided was the EPIC Event Editor. This was a simple user interface to associate keywords to events. This was plugged in to EPIC Collect to determine what tweets to retrieve for each event. From this interface you could start and stop collections at any given time. This system was completely independent from EPIC Analyze, as the later needed data to be loaded into manually to analyze it as discussed above. This requirement was due to the design of the previous infrastructure.

Finally, the last element of the old user interface is Splunk. This is an application that manages and allows for queries to be written in a domain specific language on top of Cassandra. This allowed for real time queries. The main problem was that you needed to use the Splunk pipeline language to perform queries. Splunk was also completely independent from EPIC Analyze and the EPIC Event Editor interfaces.

On the user management aspect, EPIC Analyze exposed a set of abstractions to manage permissions and access to data sets. User affiliations restrained access to certain data sets. User accounts needed to be created on the Postgres database to support this. Splunk had its own authorization mechanisms as did the EPIC Event Editor, all independent from each other.

5.3.2 New infrastructure

Figure 5.6: Diagram highlighting the user interface parts in the new infrastructure.

For the new infrastructure, I looked for a front-end framework that would be able to modularize each user interface element independently. The main reason behind this decision was to allow for various developers to work on interfaces of their own APIs independently. There are many frameworks available to develop front end applications. My goal was to find one which is suitable for building real-time dashboards.

According to a study, the most popular front end frameworks are React JS, Vue JS, and Angular JS. Upon more research, in 2019, 78.1% of front end developers use react, 0.8% use Vue and 21% use Angular. The clear winner in terms of frameworks that I wanted to use was React JS. React JS also breaks down its pages into components, each of which can be developed in parallel. The way React works is that when a component receives new data, only part of the page refreshes which is exactly what I needed for my infrastructure’s user interface. Each API in our dashboard can be treated as a smaller independent application. This separation needed a framework for application state management, that is a tool which is responsible to segregate functions and data. The best application framework out there to currently work seamlessly with React is Redux.

This approach to our user interface design allows teams to work in parallel on different elements from the user interface and compose it all on a single application app. With this approach, an API developer, can decide on their own data visualizations and integrate them into the rest of the user interface separately. In addition, the interface is more robust as it is fully independent of the back-end code and it only interacts with the JSON data representations exposed by the various APIs.

Figure 5.7: Most mentioned users in a dataset, ordered by number of mentions descending.

An example of such a component is the Mentions integration (see Figure 5.7). It integrates the mentions from the events in a new tab from the events page. It was implemented separately and only the event name is sent to it. This allowed for development of the user interface in parallel with other aspects of the infrastructure.

Figure 5.8: Tweet list user interface in the new infrastructure.

To explore events, an interface was created similar to the one described for EPIC Analyze [barrenechea2015getting]. There is a list of tweets with a timeline visualization above for time slicing the data set (see Figure 5.8). Each tweet can be expanded to explore all the metadata associated with it, similarly to the functionality in EPIC Analyze. In addition, thanks to its composability, the user interface can also integrate external APIs easily. An example of this is the incorporation of National Weather Service alerts. Using their public REST API that provides all current alerts information, I created a visualization of the active alerts. This way, the user interface could act as good integration with an external API, avoiding inserting new dependencies in the back-end.

To allow more technical analysts to perform deeper analysis, the system also points to the internal BigQuery table in Google Cloud Console. This allows analysts to explore the data set in a more interactive way using Google BigQuery directly.

Finally, in order to simplify interaction with each internal API, an ingress gateway was created. This portal aggregates all APIs under the same IP allowing the frontend to work as if it is a single API. In reality, each API is independent of each other. This gateway works by establishing rules for forwarding traffic to the corresponding API. This ingress is created by Kubernetes and associated with a static external IP, to avoid IP changes between cluster restarts. I also added an SSL certificate to the gateway using Kubernetes managed certificate service. This helped secure the application.

User interface diagram
Figure 5.9: EPIC Dashboard interface diagram within React

See Figure 5.9 for the class diagram depicting the redux states and actions and their associated react components. I have segregated components into two types:

  • Containers: Containers are react components that are connected to redux i.e. use the “connect” component from the react-router-dom package.

  • Components: These are often referred to as dumb components or simply components. They render the data based on the properties that are passed from their containers.

Class diagrams are not exactly built for react-redux classes. We have modified the classical class diagram by adopting the following conventions:

  • The class can be a container or a component. This is depicted in the first line within the notation.

  • Just below the same notation, we have the class name specified.

  • The first block contains the variables from mapStateToProps and mapDispatchToProps. we distinguish between the action and the state variable in the following way: we append parenthesis with action names and the state variables are retained as is.

Authorization/Authentication

Since most of the back-end systems of my infrastructure are Google products (Google cloud, Google storage, etc.), it only made sense to continue relying on Google for authentication. I started to look into Google Firebase for the security and authentication of the system. It provides me with React UI components out of the box, thereby reducing development time. For sign up and sign in, I decided to use Google as the authorization mechanism. Given that the University of Colorado Boulder provides a personal Google account to all students and affiliates, I know that any Project EPIC analyst will have access to a Google account.

Once a login is performed, Firebase issues an OAuth 2.0 authorization in the form of a JWT token. This token can be modified from the back-end to include arbitrary key-value pairs. In this case, I use this storage to decide whether a user can access the application. A simple key is stored internally on demand to give access to a user. This is synchronized using Google Cloud Identity Platform and Firebase.

To manage this part, I designed the Auth API. This service queries the Google Cloud Identity Platform to list any users that tried to sign in with a Google account. In addition, the service checks whether a user has the specific key-value in their authentication permissions. This is returned in the form of JSON. Another endpoint allows for any signed-in user with authorization to revoke or enable access for any other user. This is a simple approach to user control and can fail for a large group of users. However, Project EPIC does not have a large number of internal users.

For authorization, the front-end sends the JWT token with each request to an API. This token is checked from the API using a Java library specific to Project EPIC. If the token is invalid or it does not have a good key-value, then the service does not allow the operation to proceed and returns a non-authorized error. This authorization process is injected into the application using the DropWizard authorization library. In addition to simplifying the development of the APIs, the authorization library allows quick disabling through a simple boolean. This feature allows testing APIs without a valid token and then requiring the token when they are deployed in production.

6.1 Data Collection and Persistence

The main difference in data collection is the separation of concerns in separate services within the new infrastructure. The collection and filtering steps are separated into two different services. To update the keywords being collected, the new infrastructure uses a similar approach to the previous one. This time though, the collection service checks a local file which is updated by Kubernetes. The file is updated every time that the collection needs to be restarted by the Events API.

On the persistence front, I switched from Cassandra to files in an object storage system. This is better for the long term as there’s no need for a database to be running to analyze the data. The primary reason for this change is the existence of unstructured scalable object analysis services provided by cloud services. The new system can still do data analysis yet make use of a simple persistence mechanism like file storage greatly simplifying the entire system and making it much more accessible to new Project EPIC developers in the future.

6.1.1 Extensibility

Thanks to the separation of stages in the collection pipeline, I can work on expanding the infrastructure’s filter capabilities separately from the service that connects to the Twitter streaming API. In addition, thanks to the fact that I use Kafka to pass messages around, I can add additional real-time processing services that feed on the incoming stream of tweets. This can be done independently of the collection pipeline which makes it easier than the previous infrastructure to extend.

Thanks to using a simple format for data storage, I can also add new analysis techniques on top of the collection service without having to worry about overwhelming a database. This also makes the process of data analysis simpler as it no longer requires having to manually copy data from our collection cluster to the analysis cluster. The data is always in the same place. In the future, new analysis software can be used to process all of the data. Since the system is storing the data in raw JSON format, it will be straightforward to connect new software with the existing data.

6.1.2 Reliability

Adding an external message passing software (Kafka) to use as a buffer increases reliability. If by any chance the filter service were to fail, Kafka will have a copy of all of the messages in its queue, so the filter service just needs to poll the buffer from Kafka and process those messages again. This ensures that the infrastructure does not lose messages due to filtering errors.

With respect to data storage, Google Cloud Storage’s contract ensures our data will be durable and available 99.9% of the time. Thanks to this service guarantee, I avoid having to set manually any process for data redundancy or replication. With the previous infrastructure, even if there was data duplication, the data durability depended on the maintenance of our campus cluster and since Project EPIC had to have people to maintain our cluster, the project was in danger of losing the students who knew how to perform that maintainence. As students graduated, this risk became true and the cluster moved to a state where it was not being properly maintained. Due to the low maintenance caused by these changes, it became uncertain if all of Project EPIC’s data would remain safely accessible. To mitigate this risk, I downloaded all of that data in early 2019 and stored it in Google Cloud Storage as a fail-safe. Thanks to this, all data sets for Project EPIC are stored in a similar pattern and are all stored in Google Cloud Storage and now data storage reliability is no longer a concern.

6.1.3 Scalability

On the data collection aspect, the infrastructure can now scale horizontally different pieces of the pipeline individually. Kubernetes uses its horizontal scaling service for automatically scaling filters when services have maxed their use of CPU resources. To manage workload partitioning between each instance of a filter service, the system relies on Kafka topic partitions, which uses internally a round-robin mechanism to assign each message to a partition. This enables a fast partition of data between nodes when scaling. In this fashion, the throughput of the ingestion pipeline can be increased as needed automatically.

Compared to the previous infrastructure, the only way to scale the ingestion pipeline was by adding more CPUs to the server that was executing it. This type of scalability, known as vertical scalability, even if effective, can prove more expensive than horizontal scaling. CPU intensive machines tend to be more specialized machines that require a more specific setup, and therefore, tend to be more expensive to buy and maintain. In addition, such changes are expensive and so scaling in this fashion never occurred with the previous infrastructure.

Finally, to store the data, the system no longer has a potential bottleneck with the database as all of it’s data are stored as files. This allows for a significant increase in the ability to store messages in parallel as multiple writes can occur to different files in the cloud storage system.

6.2 Data Analysis

With respect to data analysis, the main difference is the usage of BigQuery instead of the combination of Solr, Pig, abd Hive. BigQuery accepts SQL queries for the data sets without the need for using a traditional database. In addition, maintenance costs can be reduced thanks to not having to maintain the cluster that was running Solr, Pig, and Hive.

Google BigQuery can run more powerful filter queries than the ones we had in Cassandra. It can also replicate most of the capabilities available when using Solr in the old infrastructure, as my prototype can express EPIC Analyze queries as SQL in the new infrastructure.

For the pagination and exploration of data sets, I was able to replicate similar results with the ones in EPIC Analyze by using filenames as the index to support pagination. Thanks to the index, I can also do quick time slicing of the data set similar to what is available in EPIC Analyze.

Finally, using Google Cloud Dataproc I have been able to replicate and indeed significantly outperform the capabilities of the simple job framework that was built for EPIC Analyze.

6.2.1 Extensibility

As mentioned before, thanks to having the collected data stored in a raw format, new analysis techniques can be added to the system with relative ease. In addition, I do not need to worry about causing a bottleneck on the database that could affect the collection pipeline.

Thanks to Google Cloud Dataproc workflows, any developer can add new analysis jobs that provide new insights after collecting an event. New Spark jobs that export insights to Google Cloud Storage can be added. Paired with a new service, any developer can expose the results outside and further improve the user interface. This process has been proved that it can be done separately, an example of such is the mentions API discussed previously.

6.2.2 Reliability

BigQuery is reliable based on the service contract that Google Cloud provides. Thanks to making this service external, I reduced once again our maintenance costs, as I do not need to maintain a cluster for Hive or Pig as we did in the old infrastructure. This increases system reliability as I am outsourcing this part to a third party.

In that same way, having a single source for the data helps to avoid problems of loading data between the collection system and the analysis system as was done in the old infrastructure. This increases reliability in the long term as it is reducing the number of steps that need to be performed before an analyst can analyze a data set.

6.2.3 Scalability

Thanks to BigQuery being a server-less component from Google Cloud, it works by scaling up for our query to match the size of the data. This is done internally by Google. For example, I can get the top 50 users who tweeted in a data set with 32 million tweets in under 2 minutes. However, compared to Solr, I can not perform filter queries as fast, the system is giving up performance from an index to gain performance on overall interactive queries. I could have used internal tables from BigQuery for increased performance, but given the number of queries launched to the data set over the life of a data set, it did not seem cost effective to do this yet.

Google Cloud Dataproc can also be configured to add more Spark nodes on demand. This allows adapting the post-processing batch workflow to be faster if needed. For the moment I left it at two nodes only, most data sets can be analyzed quickly under this configuration.

Finally, thanks to the usage of a Load Balancer on top of the external services, I could scale up the external services using Kubernetes. This would allow to serve more data and adapt to peak usage of the user interface. This has not been implemented, as the current configuration should be more than enough for the average usage from Project EPIC.

6.3 User interface

The user interface is the component that has accumulated the most changes in the new infrastructure. I added a new abstraction layer which is the service layer. This new layer separates the user interface from the data representation of the internal storage layers.

In addition, I switched to a more modern approach for designing web applications using a single page application framework. This helps to avoid bad practices of past web applications, separating views and controllers completely. The user interface is delegated to the browser, leaving the service layer in charge of abstracting the storage as a REST API.

6.3.1 Extensibility

Thanks to React and how the dashboard app is structured, the user interface is easy to extend. New services can be integrated into the existing app by creating a new component to work as the user interface. Each component is in charge of managing its own API access, avoiding coupling between components.

To avoid retrieving data over and over, Redux is used to share state between components. Which means that new components using data which is already specified, can work without having to worry about how to retrieve the data from an API.

Finally, the design is separated in separate css files from the main javascript file for each component. This separation allows for people to be more focused on designing interaction and visualization to work separately from the main logic. This can help in the future to create new data visualizations.

Compared to the previous user interfaces, the new user interface allows gathering under one roof all of the various steps needed to perform data collection and analysis. Previously an analyst had to use separate applications to collect data and analyze it. Now everything can be done under a single app. This helps to extend the capabilities from the user interface and to normalize the data across the analysis and collection sides of the system.

6.3.2 Reliability

The user interface is deployed separately from the services, which makes it more reliable. It is deployed to the Firebase CDN from Google. This allows the user interface to work, even if the internal services are having issues. This separation allows increasing reliability for the front-end, as before it was tied to the issues that may happen in Ruby on Rails applications.

6.3.3 Scalability

Views are served statically from a content delivery network as mentioned above. This means that the user interface is served with low latency across the planet and loaded fast into the browser. Compared to the previous infrastructure, this is a huge leap forward. Previously, the system was tied to the same server that was accessing the data internally.

8.1 Data Collection and Storage for Crisis Informatics

At Project EPIC there has been several papers describing how to collect tweets at scale. These have been related to a system known as EPIC Collect and are described in the earliest chapters of this thesis. In [anderson2011design], an early infrastructure based on MySQL to store data is proposed. This infrastructure is built using Spring to be scalable verically. Later in [schram2012mysql] there is a switch from MySQL to NoSQL (Cassandra). This is due to the discovery of a bottleneck in the storage layer. However, the only way to scale this system remains vertically scalability as EPIC Collect needs a powerful machine to run its threads. A similar approach was exposed in [kumar2011tweettracker], without getting in depth on how the system was actually implemented.

The microservice approach in Project EPIC was proposed in my undergraduate thesis [casas2017big]. The collection pipeline is pretty similar, changing to object storage instead of Cassandra. A similar approach is adopted by [khaleq2018cloud], the main difference being that their approach only allowed for a single event to be collected at a time.

On the filtering side, most work has been done to collect using keywords. [kejriwal2019pipeline] and [khaleq2018cloud] propose different ways to automatically classify tweets as relevant for the event using machine learning. Other systems proposed suggest the usage of user timelines for a more contextual data set [anderson2019incorporating]. For this most recent paper, I participated by developing microservices for the collection pipeline as well.

Other work has been using collection after an event using services like GNIP [ashktorab2014tweedr]. It is important to note that due to recent changes in the Twitter API usage policy, these have become more expensive to use or simply unavailable. For this reason, live collection is important to keep costs down, even if analysis are only done after the event.

8.2 Data Analysis for Crisis Informatics

Project EPIC has done a lot of research on supporting analysts on their work to analyze large data sets. MongoDB was proposed in [anderson2013architectural], even though it was acknowledged that it had its limitations and the queries were slow to resolve. EPIC Analyze [anderson2015design, barrenechea2015getting] addressed some of these issues by switching to an integration between Cassandra and Solr.

In [oussalah2013software] an approach for geospatial analysis is proposed. This work still limits the exploration capabilities of the analyst over the data set. Compared to the proposed system in this thesis, the new infrastructure expands to allow querying as a more generic usage while avoiding falling into complex systems difficult to maintain like the one proposed in [oussalah2013software]. This approach was heavily influenced by the design of EPIC Analyze [barrenechea2015getting].

Other approaches have been to provide Spark on top of Cassandra stored tweets [casas2017big]. However, this limits the fields to those defined on ingestion and makes the system fail if the schema ever changes. In addition, it has been found that Cassandra can act as a bottleneck when trying to use it for ingestion and analysis at the same time.

Finally, in [madhavilatha2016streaming] it is suggested to use interactive programmatic approaches to streaming collections and their analysis. This work inspired me to open the usage of BigQuery for analysts to use, as sometime programmatic and flexible interfaces are required for more specialized analysts to be more productive. In this case, I do not try to reinvent the wheel and just allow the analyst to see the data and work with it using Google Cloud and BigQuery.

References

Appendix A Code

All the code and deployment instructions are available online on opensource repositories in GitHub. The code is separated into 2 repositories. One for all infrastructure and backend and one for the user interface. All code is under version control on git.

a.1 Infrastructure and service code

Each service is organized into their corresponding folder. Inside a folder there’s the code, the service documentation and instructions to upload the docker image to docker hub.

This repository also includes the Dataproc workflow definition. There you can add new analysis tasks to execute once an event has stopped collection. The Dataproc workflow needs to be overwritten and commited to maintain the file in the repository updated with the one deployed. There’s also a separate folder for each Spark job. Each contains a pom.xml file that describes how to compile and package the job code into an executable jar file.

Then there’s the kubernetes folder. In this folder, there’s all the YAML definitions needed to deploy the system into a Google Cloud Kubernetes cluster.

a.2 User interface code

All react code is contained inside the repository. It’s structured following the sidebar navigation. Each navigation link on the sidebar has a folder in components. The repository follows a traditional structure for a React application.

a.1 Infrastructure and service code

Each service is organized into their corresponding folder. Inside a folder there’s the code, the service documentation and instructions to upload the docker image to docker hub.

This repository also includes the Dataproc workflow definition. There you can add new analysis tasks to execute once an event has stopped collection. The Dataproc workflow needs to be overwritten and commited to maintain the file in the repository updated with the one deployed. There’s also a separate folder for each Spark job. Each contains a pom.xml file that describes how to compile and package the job code into an executable jar file.

Then there’s the kubernetes folder. In this folder, there’s all the YAML definitions needed to deploy the system into a Google Cloud Kubernetes cluster.

a.2 User interface code

All react code is contained inside the repository. It’s structured following the sidebar navigation. Each navigation link on the sidebar has a folder in components. The repository follows a traditional structure for a React application.