When I work on data science projects, it helps to imagine what the final product will look like. At the end of the rainbow, what’s my pot of gold? In most projects, the final product is defined for us; your boss wants a report, your software engineer friend wants a script, your advisor wants a paper, and so on. But we’ll forget these constraints here, and instead describe an idealized final product for a data science project, called a Continuously-Updated Data-Analysis System (CUDAS).
The CUDAS concept isn’t new, and you’ve probably seen it before. For example, an incredibly popular CUDAS was the 2016 election forecast byFiveThirtyEight. What makes this a CUDAS? I’ll explain next.
2 The CUDAS Concept
Broadly speaking, a CUDAS has three components:
The data pipeline processes the raw data so it’s ready for analysis, the data-analysis converts the processed data into the results we’re interested in, and we continuously update our results as new data comes in. These components don’t need to happen in a sequence. For example, we may need to update our data pipeline after realizing that our data-analysis is missing a covariate.
These three components closely follow definitions of data science described elsewhere (e.g. Silver (2014), Donoho (2017), and Wickham and Grolemund (2016)). For example, I adapted the definition given in Chapter 1 of Wickham and Grolemund (2016) by grouping the import and tidy boxes into a single component, called the data-pipeline. I’ve also grouped transform, visualize, and model into a single component, called data-analysis (Tukey (1962)). Finally, I changed communicate to continuously-updated results, because a CUDAS updates when new data becomes available.
This CUDAS definition also closely follows the definition of Greater Data Science given by Donoho (2017). In fact, I think about a CUDAS as an implementation of this framework, since each of Donoho’s six divisions, even science about data science, can be viewed in terms of their impact on a CUDAS. More on this later.
2.1 Three example CUDAS projects
I got the idea for a CUDAS by studying successful data science projects, and trying to abstract what they had in common. I’ll walk through my three favorite examples next.
2.1.1 FiveThirtyEight’s 2016 Election Forecast
The 2016 election forecast from FiveThiryEight was (perhaps) the most popular CUDAS of all time. Their system collects polling data, uses this data to forecast the probability of each candidate winning, and continuously-updates the forecast on a beautiful interactive web page. Figure 1 shows two key screen shots of the project.
2.1.2 The Global Burden of Disease
Another great CUDAS comes from the Global Burden of Disease (GBD) study, produced by the Institute of Health Metrics and Evaluation (Lopez and Murray (1998)
). The GBD is an extremely ambitious project, with a goal of collecting and synthesizing all the world’s health data, and providing continuously-updated estimates of disease burden. The GBD is a scientific triumph, and the bookEpic Measures by Smith (2015) chronicles the story from its beginnings.
Let’s think about the GBD in terms of a CUDAS. First, the GBD employs a team whose goal is collecting all the health data they can get their hands on, from surveys, to scientific literature, to vital registration systems, and more. Next, the GBD has a team of disease experts, statisticians, computer scientists, epidemiologists, etc. to model the burden of each individual disease. Then, the individual disease estimates are combined into a single metric, called the Disability-Adjusted Life Year (DALY, Murray and Acharya (1997)). Finally, the GBD provides spectacular interactive visualizations of their results, which they update annually, an example of which is shown in Figure 2.
2.1.3 WAR on Ice
Our final CUDAS example is the war-on-ice project, available online at:
Although the project is no longer active111The two creators are now employed by professional hockey teams., at it’s apex, war-on-ice provided advanced hockey statistics that updated after every night of games. Notably, their statistics and visualizations were fairly advanced (e.g. Thomas et al. (2013)), compared with statistics available on other websites.
An interesting twist on the war-on-ice CUDAS is that the author’s made an critical piece of their data pipeline, the nhlscrapr R package, available. In later years, one of the author’s has been behind a similar nflscrapR package for American football (Horowitz (2017)), which shows the early signs of a generalizable idea.
The 2016 election forecast, the GBD, and war-on-ice come from completely different contexts (politics, global health, and hockey). But when viewed from a CUDAS lens, the projects are similar. The next section provides more detail on the similarities between these three systems.
2.2 Elements of a CUDAS
What do the 2016 Election forecast, the GBD, and war-on-ice have in common? For starters, each project has a data pipeline, data-analysis, and continuously updated results. But each project also understood the dependencies between these components: the data-analysis is nothing without the data pipeline, and the data-analysis isn’t as valuable without the continuously-updated results. Let’s go into more detail on what made these three CUDAS systems stand out.
Multiple data sources are synthesized in a purposeful way. In each of our three examples, the data was available online, but the data wasn’t formatted for data-analysis. For example, the 2016 election forecast collects polls from many different sources, the GBD combines different data types from many different diseases, and war-on-ice collects play-by-play data, images, box score statistics, and more.
But these projects didn’t just collect the data, they also knew what to do with it. The data was rigorously extracted and transformed into the precise format required for the data-analysis. Nate Silver provides a detailed user guide to the 2016 election forecast (FiveThirtyEight (2016b)), in which he describes the critical steps of adjusting the polls, and combining the polls with other data sources, such as economic data. So, it’s not enough for a data pipeline to collect the data, a good pipeline must also know how the raw data needs to be processed in order to produce the results the CUDAS is ultimately interested in.
The results are interesting and interpretable. The data-analysis performed in our three examples isn’t extremely complicated, but it’s not trivial either. Each of our examples uses some sort of statistical model: FiveThirtyEight (2016a) uses a Bayesian approach to forecast who will win the election, the GBD uses many models (e.g. DisMod (Flaxman (2019)), and war-on-ice implements the adjusted plus-minus methodology described in Thomas et al. (2013).
In my view, these models are successful because they’re interesting and interpretable. By interesting, I mean that each model was able to gain a large following of users who wanted to know how the results changed as new data came in. By interpretable, I mean that the output of the model was easy to understand: FiveThirtyEight (2016a) gives each candidate a probability of winning, the GBD summarizes disease burden into a single metric (the DALY), and war-on-ice ranks players based on their contribution to winning. In each case, it doesn’t take rocket scientist to understand the results.
The results are continuously updated in a highly intuitive display. Finally, and I think most important for their success, each example continuously-updates their results. And they don’t just update their results, they display their results in highly intuitive web applications, which gives users a simple way to stay up to date. If there’s one thing we’ve learned in the information age, it’s that people like checking their devices for updates (think: Facebook notifications).
So now we know what a CUDAS is, and we’ve analyzed three examples. Let’s use these insights to create some CUDAS systems of our own.
3 Building CUDAS systems
I’ve luckily been involved in developing several CUDAS systems. Full disclosure: I worked on the GBD project (Section 2.1.2) for a year. Since then, I’ve had a larger role in developing two other CUDAS projects: one for infectious disease modeling, and the other for ranking soccer players. In this section, I’ll go through the details of building them both.
3.1 Synthetic Populations and Ecosystems of the World
In The Signal and the Noise (Silver (2012)), Nate Silver overviews the state of disease forecasting. After discussing the limitations of compartment models, Silver discusses the potential of a new approach, called agent-based models. For disease modeling, agent-based models simulate the daily interactions of people (e.g. when they talk, when they’re in the same room), and track how a disease spreads based on these interactions.222Silver interviews a group of agent-based modelers from the University of Pittsburgh, who work on an infectious disease model called FRED: A Framework for Reconstructing Epidemic Dynamics (Grefenstette et al. (2013)). In the CUDAS described in this section, we essentially worked to create a synthetic ecosystems for the FRED model.
For agent-based models to work, they need a dataset with a record for each person in the population. The dataset should also include where each person lives, where they go to school, and other information relevant to disease modeling. Agent-based modelers refer to these datasets as synthetic ecosystems.
Synthetic ecosystems are tricky to build: you need data from different sources, you need to integrate these data sources together, you need to make sure the synthetic ecosystem represents the population, and you need a large computer. And because this is a tricky problem, there’s a demand in the agent-based modeling community for high quality synthetic ecosystems. That’s where we came in. As part of the MIDAS research network, a group of us were tasked with generating synthetic ecosystems. Our goal: build a CUDAS for synthetic ecosystems.
3.1.1 Data Pipeline
To create a synthetic ecosystems, we need to know:
How many people to create. For example, to create a synthetic ecosystem for Pittsburgh, we need to know how many people live in Pittsburgh.
Geography. Continuing the Pittsburgh example, we need to know how the neighborhoods are organized, where the roads are, where the schools are, and so on.
The characteristics of the people (age, gender, occupation, etc.).
All of this data is available online, but different pieces are available in different locations. So the first part of our data pipeline consisted of scripts to collect data and store them on our computing cluster, hosted by the Pittsburgh Supercomputer Center (Center (2016)). Next, we laboriously ensured that each data source shared a common geography. This is difficult, because each data source partitions countries into smaller regions. But unfortunately, each data source differs in how it partitions countries. For example, the left side of Figure 3 shows how the website GeoHive (Geohive (2016)) splits Italy into 20 regions, and the right side of Figure 3 shows how IPUMS (Center (2014)) splits Italy into 20 regions. While these two data sources are close, it still took a lot of work to make sure that both datasets had the same geographies. And Italy was easy compared with the rest of the countries.
Thus, a substantial element of our data pipeline involved matching the geographies of different data sources. We did this manually, but in the next section, I’ll discuss a more general solution to this problem.
Once we (finally) had the data ready, we needed a method to turn this data into synthetic ecosystems. For this, we developed the SPEW framework for synthetic ecosystems, which is described in Gallagher et al. (2018). The framework samples people, assigns these people to households, schools and workplaces, then assigns locations to the households, schools, and workplaces. We used intuitive algorithms for each of these tasks. For example, we sampled the characteristics of people using microdata, where microdata is simply a representative sample of the population. And we assigned people to schools based on the location of their household, the location of their school, and the size of each school.
To implement the SPEW framework, we created the spew R package. We used this package to generate all of our synthetic ecosystems, and it was designed to work on the Pittsburgh Supercomputing Cluster, where our data was stored. The package is available online at:
3.1.3 Continuously-Updated Results
After generating the synthetic ecosystems, we needed to make them available to agent-based modelers. First things first, we created a website where users could simply download the complete ecosystems, which is available online at:
But these results weren’t very intuitive, so Shannon, a fellow PhD student on this project, wrote a general markdown script that produced summary reports for each synthetic ecosystem (see Figure 4 for an example). Not only did these reports help agent-based modelers understand their synthetic ecosystems, but they also helped us debug our software, and ensured that our synthetic ecosystems passed the intraocular (“hits you in the eyes”) test.
In terms of continuously-updated results, the idea was that whenever the Census released a new data sample, or GeoHive released new population counts, or when a new data source became available, a user would be able to pass the new data source through the SPEW framework, and obtain a synthetic ecosystem that accounted for the new data.
Although we released several versions of synthetic ecosystems, the newest of which used more recent data, we were never able to reliably and efficiently produce continuously-updated synthetic ecosystems in this idealized manner. But the dream lives on. As a silver lining, we developed a diagram that describes the SPEW framework, shown in Figure 5. And as the figure shows, this process cleanly decomposes into a CUDAS.
3.2 A CUDAS for Soccer Ratings
The second CUDAS we’ll walk through is for a recently developed soccer metric, called Augmented Adjusted Plus-Minus (AAPM, Matano et al. (2018)). The details are in the paper, but the basic idea is that AAPM combines two data sources: FIFA ratings and play-by-play data, and uses these data sources to rate each player. In Matano et al. (2018), it’s shown that AAPM predicts game outcomes better than other statistics.
Why is AAPM a good statistic for a CUDAS? Earlier, I noted that the data-analysis results for a CUDAS should be interesting and interpretable. In principle, the AAPM statistic should be interesting to soccer fans, especially since it’s tied to predictive accuracy. The AAPM statistic is also interpretable, since it ranks each player, and can be easily displayed in a table.
3.2.1 Data Pipeline
We need two data sources to compute AAPM: play-by-play information, and FIFA ratings from the beginning of each season. With these two sources, we need to produce a design matrix, which is the input required for our statistical model that computes AAPM. In the design matrix, each column represents a single player, and Figure 6 shows what the design matrix looks like. We also need to link each player (column in the design matrix) with a FIFA rating.
There are several complications to building the pipeline, such as:
Finding websites with the data, then writing scripts to extract it.
Matching the names of soccer players from multiple data sources.
Automating the process so that it works across seasons, leagues, etc.
You may have noticed that these challenges are the same we faced when we built our CUDAS for synthetic ecosystems. In each case, we needed to collect data from multiple sources, and match names across each source. For synthetic ecosystems, we matched geographic names, and for soccer ratings, we matched player names.
We overcame these challenges more effectively for the soccer CUDAS. Here, we developed two R packages: the first extracts the play-by-play and FIFA data, and the second matches player names using active record linkage. The data collection package is similar to the nhlscrapR package developed by war-on-ice, and we actually used a similar package, called fcscrapR, for some parts of the collection.
The name matching package, called arl, is a bit more interesting. The arl package automatically matches the names that are identical in both data sources, partially matches the names that were close, using probabilistic record linkage methods, and manually matches all the remaining names. In short: we automated as much as we could, and manually matched the rest. Given the differences between some data sources, sometimes this is the best you can do.
Similar to our CUDAS for synthetic ecosystems, the majority of the work was building the data pipeline. For seasoned data scientists, this is an obvious point.
Once our data pipeline produced the design matrix with linked FIFA ratings, we were ready to compute AAPM. The AAPM statistic is calculated with a Bayesian regression model, where FIFA ratings are the prior distribution for each player. We developed another R package to fit our model:
which relies standard Bayesian software (Carpenter et al. (2017)). After various model checks and tweaks, we verified that:
Our results passed the intraocular test (the best/worst players made sense).
Our model predicted game outcomes better than baseline and comparison statistics.
With our results in hand, the final step was producing the continuously-updated results.
3.2.3 Continuously Updated Results
To produce continuously updated results, we need to answer two questions:
What’s the best way to display our results?
How can we continuously-update them?
To display the results, we followed another successful CUDAS: ESPN’s Real Plus-Minus statistic (RPM, Ilardi and Engelmann (2019)). ESPN displays the RPM statistic in a simple table, which users can sort by offense, defense, or position. We created a similar table, which is available online at:
Finally, we need to make sure our results continuously update. Just as the 2016 election forecast updates after each poll, we want our AAPM statistic to update after each soccer match. I’m not an expert here, but here’s three ways you can continuously update results:
Manually run a script every time you want new results.
Set up a cron job (Wikipedia contributors (2019)) to run every night.
Since our AAPM CUDAS is in an early stage, we simply run our scripts manually. But moving forward, we play on switching to an automated workflow.
And that’s how our AAPM CUDAS works. To reiterate, we chose the AAPM statistic because it’s interesting and interpretable. Then, we built a data pipeline that retrieves data from the web, links together multiple data sources, and produces a design matrix linked to FIFA ratings. We computed AAPM for each player with a Bayesian model, and this provides a ranking of each player in our dataset. Finally, we displayed our results as sortable tables online, and showed how our results can be continuously updated.
I’ve described a CUDAS as my idealized final product for a data science project. A CUDAS includes a data pipeline, data-analysis, and continuously-updated results, and works for any context. I walked through three examples of successful CUDAS projects, then I described what I thought made them successful. I then described the creation of two CUDAS systems I’ve been involved with: one for synthetic ecosystems, and one for soccer ratings.
Now that I’ve explained what a CUDAS is, discussed several examples, and described how they can be built, I want to make the case for thinking about data science projects in terms of a CUDAS.
The key feature of a CUDAS is that it applies to any context. Let’s say you’re a statistician who is passionate about the economy. You notice that the GDP statistic is flawed, and you think of a clever way to improve it. What better way to communicate your metric than building a CUDAS?
Now take a more intricate example. In 2018, I gave a presentation where I proposed a CUDAS for my fantasy basketball team (Richardson (2018)). In fantasy basketball, you need to decide who to play, but I couldn’t find any high quality forecasts tailored to the specifics of my team. So I started making the forecasts myself, but every time I needed an update, I had to manually copy the data from my league into a spreadsheet, manually run the forecast, and only then could I make an informed decision on who to play. This took way too much time, which makes it a perfect opportunity for a CUDAS. In this case, the data pipeline would download my league’s data, the data-analysis would prepare the forecasts, and I could display the results in an easy-to-read web page.
While this took a lot of work, data science tools are quickly maturing, and higher quality tools should enable higher quality CUDAS systems. For example, the R package shiny has enabled users to create interactive web applications, while requiring zero knowledge of how the web works. And as I mentioned earlier, data pipeline tools have allowed data engineers to streamline and stress-test their data pipelines.
Further improvements should also come from increased collaboration between data scientists, engineers, computer scientists, web developers, database developers, psychologists, and more. For instance, most of the people I worked with on building our CUDAS systems were statisticians (data scientists?). Our backgrounds were great for developing the statistical models, but our skills were stretched when building the data pipeline, and developing the web applications to display our results. And as I went along, it became clear how valuable data modeling, data visualization, and web development expertise were to building high quality CUDAS systems. I came to see the projects as less about statistical modeling, and more about building aninformation system (Figure 8). In this way, the CUDAS concept provides a unifying framework for data centered professionals with different skills to rally around.
As a final point, the rise of the Internet has profoundly changed the way we consume information. This has led to echo chambers, which Wikipedia describes as:
“a metaphorical description of a situation in which beliefs are amplified or reinforced by communication and repetition inside a closed system.”
In echo chambers, people lose access to a common set of facts, which makes communication difficult. Can a CUDAS help?
One hypothesis is that high quality CUDAS systems could constrain the public discourse around a common set of facts. If we all agreed that we want unemployment to be low, and GDP to be high, then we could build a CUDAS to track how well we’re doing.
Would this work? As a thought experiment, consider the effect of FiveThirtyEight’s CUDAS on Donald Trump’s approval rating (FiveThirtyEight (2018)). This shows that Trump’s approval has ranged between 36.4% to 47.8% over the course of his presidency. Now think, when is the last time you heard someone claim that Trump’s approval rating is either at 20% or 80 %? And how much did this happen in the past? Viewed this way, a CUDAS is analogous to a scoreboard, since it provides political junkies of all stripes a way to monitor a common set of facts. In spirit, CUDAS systems could complement the ideas of Tetlock and Gardner (2016), who advocate for score keeping of forecasts in the public square.
One of my favorite parts of Donoho’s 50 Years of Data Science is the quote given by Cleveland (2001):
…[results in] data science should be judged by the extent to which they enable the analyst to learn from data.
It’s a great quote. But what if we replaced data analyst with CUDAS:
…[results in] data science should be judged by the extent to which improve a CUDAS.
I think this quote works just as well. To me, it’s hard to think of any data science research that wouldn’t, indirectly or directly, demonstrate it’s utility in a CUDAS.
Thanks to Francesca Matano and Taylor Pospisil for collaborating to build a CUDAS for augmented adjusted plus-minus. Thanks to Shannon Gallagher, Sam Ventura, Bill Eddy, Jeremy Espino, Shawn Brown, Jay Depasse, and everyone at the Pittsburgh Supercomputer Center who helped in creating SPEW synthetic ecosystems. Thanks to the sports reading and research group, in particular Sam Ventura and Ron Yurko, for their encouragement to initially present the concept.
- Apache (2019) Apache (2019). Apache Airflow. https://airflow.apache.org/.
- Bostock et al. (2011) Bostock, M., Ogievetsky, V., and Heer, J. (2011). D data-driven documents. IEEE transactions on visualization and computer graphics, 17(12):2301–2309.
- Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of statistical software, 76(1).
- Center (2014) Center, M. P. (2014). Integrated public use microdata series, international: Version 6.3. [Machine-readable database].
- Center (2016) Center, P. S. C. (2016). Olympus computing cluster.
- Cleveland (2001) Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International statistical review, 69(1):21–26.
- Donoho (2017) Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745–766.
- FiveThirtyEight (2016a) FiveThirtyEight (2016a). 2016 Election Forecast. https://projects.fivethirtyeight.com/2016-election-forecast/.
- FiveThirtyEight (2016b) FiveThirtyEight (2016b). A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast. https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/.
- FiveThirtyEight (2018) FiveThirtyEight (2018). How Popular is Donald Trump? https://projects.fivethirtyeight.com/trump-approval-ratings/.
- Flaxman (2019) Flaxman, A. (2019). An Integrative Metaregression Framework for Descriptive Epidemiology. https://github.com/ihmeuw/dismod_mr.
- Gallagher et al. (2018) Gallagher, S., Richardson, L. F., Ventura, S. L., and Eddy, W. F. (2018). Spew: Synthetic populations and ecosystems of the world. Journal of Computational and Graphical Statistics, 27(4):773–784.
- Geohive (2016) Geohive (2016). http://www.geohive.com/.
- Grefenstette et al. (2013) Grefenstette, J. J., Brown, S. T., Rosenfeld, R., DePasse, J., Stone, N. T., Cooley, P. C., Wheaton, W. D., Fyshe, A., Galloway, D. D., Sriram, A., et al. (2013). Fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations. BMC public health, 13(1):940.
- Horowitz (2017) Horowitz, M. (2017). nflscrapr: R package for scraping nfl data off their json api.
- Ilardi and Engelmann (2019) Ilardi, S. and Engelmann, J. (2019). NBA Real Plus-Minus. http://www.espn.com/nba/statistics/rpm.
- Lopez and Murray (1998) Lopez, A. D. and Murray, C. C. (1998). The global burden of disease, 1990–2020. Nature medicine, 4(11):1241.
- Matano et al. (2018) Matano, F., Richardson, L. F., Pospisil, T., Eubanks, C., and Qin, J. (2018). Augmenting adjusted plus-minus in soccer with fifa ratings. arXiv preprint arXiv:1810.08032.
- Murray and Acharya (1997) Murray, C. J. and Acharya, A. K. (1997). Understanding dalys. Journal of health economics, 16(6):703–730.
- Richardson (2018) Richardson, L. F. (2018). A continuously updated data-analysis system for fantasy basketball. Presentation in the CMU sports and statistics reading group. Given June 6, 2018.
- Silver (2012) Silver, N. (2012). The signal and the noise: why so many predictions fail–but some don’t. Penguin.
- Silver (2014) Silver, N. (2014). What the fox knows. FiveThirtyEight http://fivethirtyeight. com/features/what-the-fox-knows.
- Simsion and Witt (2004) Simsion, G. and Witt, G. (2004). Data modeling essentials. Elsevier.
- Smith (2015) Smith, J. N. (2015). Epic measures: one doctor, seven billion patients. Harper Wave.
- Spotify (2019) Spotify (2019). Luigi. https://github.com/spotify/luigi.
- Tetlock and Gardner (2016) Tetlock, P. E. and Gardner, D. (2016). Superforecasting: The art and science of prediction. Random House.
- Thomas et al. (2013) Thomas, A., Ventura, S. L., Jensen, S. T., and Ma, S. (2013). Competing process hazard function models for player ratings in ice hockey. The Annals of Applied Statistics, pages 1497–1524.
- Tukey (1962) Tukey, J. W. (1962). The future of data analysis. The annals of mathematical statistics, 33(1):1–67.
- Wickham (2016) Wickham, H. (2016). Package ‘rvest’. URL: https://cran. r-project. org/web/packages/rvest/rvest. pdf.
- Wickham and Grolemund (2016) Wickham, H. and Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. ” O’Reilly Media, Inc.”.
- Wikipedia contributors (2019) Wikipedia contributors (2019). Cron — Wikipedia, the free encyclopedia. [Online; accessed 12-June-2019].