As you might imagine, The COVID Tracking Project loves spreadsheets. Since we announced our plans to wind down and end our daily data collection, we’ve heard from many of our data users who have been importing our CSV file downloads into tools like Excel and Google Sheets to perform their own analysis of COVID data. We know that transitioning these spreadsheets to use federal data can be difficult because of the organization and sheer size of some federal datasets. Here are a few suggestions on the best ways to use federal data for our spreadsheet and API users.
First, the pandemic data landscape has dramatically improved since we first started publishing CSV files a year ago. Many of the questions that used to require our data downloads and API are now better answered using the resources available from the federal government and other organizations. We’re written a guide on where to find simple COVID-19 data for the US highlighting the CDC COVID Data Tracker, which allows you to view testing, case, hospitalization, and death data at both the state and county levels, allowing for more detailed local views of the data. The federal Community Profile Reports and State Profile Reports provide public access to the same daily and weekly data analysis used by policymakers. Data visualization tools from The New York Times and many other news organizations offer maps and charts.
But we know that sometimes there’s no substitute for a spreadsheet when you’re doing your own data analysis. We’ve written guides to using the federal testing, hospitalization, and case and death datasets, but using them all at once can be a challenge. To help, we’ve built a tool that combines all this data into one state-level time series, suitable for easy use in spreadsheets, no programming experience required.
Note: The COVID Tracking Project will not be maintaining this script, which may break if federal data pages meaningfully change. Please consider this post a set of pointers for interested data users.
How to get the data
Go to the Google Colab script page. This is a program we’ve built that collects COVID-19 data from many federal data sources and combines it all into a single file.
From the menu, choose “
Runtime -> Run all”
Run Anyway” (You will see a warning message, but rest assured that the script does not access any of your Google account data.)
Wait a moment. The script will download the data from federal sites and combine it into a single file.
A file download box will open. Save the CSV file to your computer. You can import this file into Microsoft Excel, Google Sheets, or another tool of your choice.
Advanced users familiar with programming using Python and pandas may find the code useful as a starting point for further analysis of federal COVID datasets.
Sources and definitions
Data is retrieved from three federal datasets. Please see the details for each dataset for data definitions and explanatory notes.
Testing data: Diagnostic Laboratory Testing (PCR Testing) Time Series. See our Testing Data 101 guide for a detailed description of this dataset.
Hospitalization data: Reported Patient Impact and Hospital Capacity by State, using both the timeseries (which updates weekly) and provisional daily data updates which extend beyond the most recent timeseries update. See our guide to working with hospitalization numbers.
Cases and deaths data: Cases and Deaths by State over Time. See our guides to case and death data.
What you can do with this data
Once you’ve downloaded this CSV using the script above, you can build visualizations in nearly identical fashion to the current COVID Tracking Project visualization gallery. Here are a few examples of what’s possible:
Notes and caveats
In the federal case and death dataset, New York and New York City are given as separate entities. We have summed them into one to match the other federal datasets. To disable this feature, set the
Falsein the code.
Federal testing data data is incomplete for the most recent days, and therefore usually shows a steep drop in testing. This is due to a backfill process which assigns the tests to the date at which they were performed (whenever possible). This is in contrast to CTP’s testing data, which was arranged by date of public report. Assigning tests to the date of test specimen collection is more epidemiologically correct but takes more time and effort to prepare and revise. Be cautious in interpreting data for the most recent 3-5 testing days, as it will contain a reporting artifact of steeply dropping test numbers due to the method of assigning tests to the date they were performed. You may also spot changes to the timeseries in prior months, as tests are precisely backfilled to their original collection dates when possible.
Users should keep an eye out for new notes on data and schema changes in the Community Profile Reports as these may change over time. In the "User Notes" tab, visit “High-Visibility Data Notes” and “Recent Changes”; in the "Date Notes" tab, visit the “Dynamic Data Notes” section.
The dating schemes for the various data points (tests, cases, etc.) can be subtly different. Please read our Federal 101 series for more detailed information.
Federal hospitalization data contains two distinct patient types: confirmed and suspected COVID-19 patients. Both metrics are important. However, our state-level data was a mix of these two types as states varied in their reporting: some reported only confirmed, others combined confirmed plus suspected, while still others used another definition of hospitalized COVID-19 patients. It is therefore difficult to create a one-to-one match between our currently hospitalized figures and those from the federal data. See our Hospitalization Data 101 post for a state-by-state comparison between the state metrics and the federal data.
Federal hospitalization data started to become much more complete and broader in scope after July 15, 2020 when the daily reporting system used by hospitals switched to HHS Protect and reporting of most fields became mandatory. Incomplete, voluntarily reported hospitalization data before July 15th does exist in the federal data file and comes from CDC’s National Healthcare Safety Network (NHSN) system. We advise caution when interpreting hospitalization trends from earlier in the pandemic when hospital reporting was less complete. See the HHS Protect Public Data Hub page on hospital reporting for granular details on the current completeness of hospitalization data down to the individual facility-level and how that reporting has improved over time. We have proactively removed all hospitalization data before July 15th from the above file, but you can modify the script to include all available data points.
Zach Lipton is a San Francisco software developer and Data Infrastructure co-lead for the COVID Tracking Project.
Dave Luo has an MD/MBA and is a Data Science and Data Viz lead at The COVID Tracking Project.
Peter Walker is Head of Marketing & Growth at PublicRelay and Data Viz Co-Lead at The COVID Tracking Project.
More “Federal COVID Data 101” posts
Federal COVID-19 Testing Data Is Getting Better
The federal government improved its state and county-level COVID-19 PCR testing data since we analyzed it in February. Here’s an update on those changes and what we hope to see next for the data.
Federal COVID Data 101: What We Know About Race and Ethnicity Data
Publicly available federal race and ethnicity COVID-19 data is currently usable and improving, although it shares many of the problems we’ve found in state-reported data.
How Not to Interpret COVID-19 Data
Beware of dating schemes, data dumps, weather events and other issues that can lead to mistakes that confuse the public.