Analysis & updates | (The Perils of) Automatically Collecting COVID Data

The COVID Tracking Project began collecting data in March 2020, with a few volunteers entering three metrics for 50 states into a spreadsheet manually. It grew into a large organization collecting data across 35 metrics for 56 states and territories up until March 2021. For much of the past year, this data was the most complete COVID-19 dataset across all US states.

Because of the heterogeneous and ever-changing nature of states’ data reporting, manual data collection remained at the core of our process throughout the year. However, our data collection process did evolve to include significant automation, in two ways. One was to double-check human entry of numeric data: While our primary daily collection from each state relied on people, we used automation to verify the manually entered values where possible. The second automation win was when there was a lot of data that was changing frequently. This was the case for time series data when whole time series were continuously updated by the states. These updated time series were automatically fetched daily and used to augment daily collection checks, and for special research projects.

In this post, we’ll dig into the details of how the automation worked and the issues that we encountered, both from a technical and procedural perspective. We’ll first cover the actual collection of the data, and then the verification and application of metadata, how we kept the data and metadata fresh, and how best to use the data. Post-processing—such as anomaly detection—is outside of the scope of this post, but it is enabled by the creation of a unified view of data that is clearly and correctly tagged.

The data collection pipeline (or really… a series of tubes)

The process of collecting the data started with identifying the data sources. This was a manual process of looking for pandemic data on states’ data portals, dashboards, and pandemic information sites.

Many states maintain an open data portal (e.g., data.ca.gov). Data portals aim to provide well structured and annotated data, through APIs (i.e., CKAN, Socrata, and ArcGIS) that allow for complex querying and aggregations. However, the existence of data portals did not guarantee that COVID-19 data was available, and, if it was available, that it was well documented and tagged or even correct. For example, we found that some states’ data portals included placeholder data with random numbers, while other states served their public data from datasets that still had “test_this” in the name. Out of 56 states and jurisdictions, we found that only 9 states had COVID-19 data available through their data portals. (Those states were Alaska, California, Connecticut, Colorado, Florida, Indiana, Maryland, New York, and Vermont).

So, most states’ data portals were not useful for identifying data sources, but there was some good news: we were able to extract data from the (often undocumented) APIs and other data sources that backed state dashboards. This was especially easy for states that used ArcGIS to build their dashboards since ArcGIS natively exposes all underlying data through APIs. But we were also able to find underlying APIs or data sources for states that built their own dashboards (such as Delaware, Illinois, and New Mexico). The downside of these dashboard-backing data sources is that they were not documented, and while in some cases (especially ArcGIS) the API provided fields for data definitions, the states usually did not fill them in. Often the original meaning of a field changed over time (e.g., a metric called “tested” started as a metric for tested individuals and later became a record of test specimens).

To make matters worse, there were a couple dashboard solutions that stymied us: Tableau and PowerBI. These are quite popular with states, but they do not expose underlying APIs. Instead, the software seems designed to obfuscate the data. These dashboarding solutions do support some forms of data download (e.g. via CSV) but are not configured by default to allow downloads, and state dashboard administrators often did not enable those features. The end result was that data in Tableau and PowerBI was some of the least accessible to automation, even using great libraries such as TableauScraper and custom JavaScript clickers.

Finally, some states provided their data in PDF documents or plain web pages rather than dashboards. These formats were not designed for data extraction, and scraping them was quite susceptible to changes in the source. But because of their popularity, there are many tools available for scraping. In these cases, we wrote custom scrapers using libraries such as BeautifulSoup and pdfminer.six or pyPdf4 to extract the data.

In the end, this was the approximate breakdown of data sources:

Source Category	Examples	% of CTP Sources
Source CategoryAPI	ExamplesArcGIS, CKAN, Socrata, DKAN	% of CTP Sources65%
Source CategoryStructured Data	ExamplesCSV, JSON, XLSX	% of CTP Sources22%
Source CategoryRequire Custom Scrapers	ExamplesPDF, HTML, Tableau, PowerBI	% of CTP Sources13%

It’s worth noting that we reached out to states to ask them to share their data in more machine-readable formats, and had some success. However, the process was slow, and the results were limited.

After locating these data sources, we stored them in a YAML configuration file, with each source following this pattern:

url: https://data.ca.gov/api/3/action/datastore_search
params: {resource_id: b6648a0d-ff0a-4111-b80b-febda2ac9e09, 
limit: 1, sort: date desc}
type: ckan
description: PCR Testing (total)

We implemented a Python project that iterated over the sources, made the request, and parsed the responses. The aggregated and tagged responses were then uploaded to a Google Spreadsheet (to assist with manual data collection) and stored in a database (for other projects). It was scheduled with cron, and the frequency depended on the use case. For data that people reviewed interactively in order to catch updates and validate data entry, we needed the data very fresh so we decided on a six minute frequency. For data for special projects (i.e., fetching time series), we scheduled it to run a few times a day.

Validating, tagging, and storing

The previous section described the technical part of collecting the data, which is complicated, but at least could be naturally described in code, which makes it the easy part. Next was the mentally labor intensive and iterative process of making sure that the data we had was the right data, and putting it into a storage schema that allowed for analysis (e.g., for comparisons between states).

The central problem we were trying to solve was to align terminology and definitions used by each state. For example, when a state provided a number of “cases,” does that mean people with a positive PCR test? Or people showing symptoms and exposure to a known COVID-19 patient? This process required reviewing the sources, finding notes in data dictionaries, reading all dashboard clarification text, reaching out directly to government contacts, and then “tagging” each metric appropriately in our data to describe the real meaning in a consistent way. We found that the quality of the tagging was extremely important to the quality of the resulting dataset.

This tagging work was also a continuous process—with 56 independent entities, changes to the meaning of data points happened on a daily basis. The tags for metrics changed for many reasons, with or without the changes to the underlying source. For example, tags may have been updated because we found a new piece of information about the published numbers; or the state updated their data dictionary and added clarifications; or CSTE updated their guidelines; or because we refined our tagging (e.g., distinguishing between test specimens versus test encounters) and needed to make adjustments.

We found that it was important to store data in a form that was clearly tagged, and as close as possible to the original data reported by states. This way when tagging changes occured, we didn’t have to “unwind” the data to incorporate the change. For example, if we wanted to publish a value that we computed by summing two other values, we focused on storing those two values rather than the calculated number.

Serving the data

In addition to collecting and validating the data, we of course wanted to share it out for others to use, and thus we became a data provider. We did our best not to reproduce the same problems that we saw from the states when serving the data, aiming to make our data accessible, transparent, and clearly annotated.

First, we served the data in both CSV and JSON formats via a REST API. These formats are the most transparent and universally useful. JSON is particularly appropriate because it allows for metadata to sit alongside data at multiple levels.

Next, we were careful to provide detailed notes along with the data, describing the meaning of each datapoint for each state, and enumerating any changes that we made to the data.

Finally, we kept detailed logs of changes to both data and tagging, so that anyone could follow along with our changes. Manual data changes were logged as issues in GitHub, and changes to tagging were logged as part of the history of the YAML configuration file in GitHub.

Keeping data up to date

Our next concern was how to keep the data up to date. This was a repetitive process of finding and retiring data sources, (re)tagging them, and keeping an eye on updates and modifications to existing sources.

Very few states notified ahead of time about changes to their datasets. For example, Vermont and Colorado had good messaging about changes, but most other states made changes without notifications, so it was important to be able to detect the changes as fast as possible.

Some breaking changes could be detected automatically. For example, we kept a log of how many values were fetched for each state, and how many values fetched overall. Watching for changes in these numbers was a quick and simple way to be notified about values we failed to fetch.

But a bigger challenge we had was dealing with silent failures. For example, a data source would become stale but still available, leading us to keep pulling stale data without realizing it. Or the underlying meaning of the data would change but we had no way of knowing, and therefore didn’t know that we needed to update our tagging to match.

This is a big part of why we made our scraping logic as simple as possible. In most software development practices, it’s desirable to make the code robust by handling errors and changes. But in this case, we decided that failing fast and then fixing each issue after a manual review was the better approach.

Unfortunately, the net result was that in order to keep the automation working, we had to perform regular manual checks of the data source, looking for changes and updates as well as keep an eye out for better sources. This monitoring was largely incorporated into our manual data collection process. Absent that ongoing review, the data collection process would have broken within a week.

Lessons from a year of automation

To summarize, here are the key lessons we learned when trying to automate the collection of COVID-19 data from states’ websites. We hope that these lessons will be useful to others collecting COVID data from states. We believe many of these lessons are likely also applicable to automating data collection for any large set of heterogeneous sources.

Be prepared for the significant amount of manual work that is required to gather the data sources.
Be very flexible on formats. Waiting for states to expose APIs can be futile. Resorting to scraping is inevitable.
Contact your data providers as early as possible to ask for improvements to the formats. But don’t count on those improvements—look for workarounds instead.
Take your time to do the work required to understand the meaning of the data at each data source. Carefully analyze the meaning of the data—don’t just take the name of a field and assume a meaning. This must be an ongoing process.
Store the original data when possible (e.g., zip, json, csv) and within your own database. Store the data in a common format that’s as close to the original as practical, without any logic applied.
If you serve the aggregated data that you’ve collected so that others can use it, use a standard data format and include all the metadata that you have.
Reserve plenty of time and manual resources to keep the data and the metadata fresh.

Additional contributions from Michal Mart.

Theo Michel is a Project Manager and contributor to Data Quality efforts at the Covid Tracking Project.

@theotayo

Rebma is an infrastructure engineer who likes puzzles, solving problems, and who did the data scripting and automation for The Covid Tracking Project.

More “How We Made The COVID Tracking Project” posts

See all analysis & updates

(The Perils of) Automatically Collecting COVID Data

The data collection pipeline (or really… a series of tubes)

Validating, tagging, and storing

Serving the data

Keeping data up to date

Lessons from a year of automation

More “How We Made The COVID Tracking Project” posts

20,000 Hours of Data Entry: Why We Didn’t Automate Our Data Collection

Dating Data: How We Used Multiple Dating Schemes to Provide the Most Complete Picture of the Pandemic

How and Why The COVID Tracking Project Built a Screenshot System