Analysis & updates | Federal COVID Data 101: Working with Testing Data

At the start of the COVID-19 pandemic, the federal government had no system that could track how many tests for SARS-CoV-2 were being performed in the United States. In April of 2020, the government created a program to collect that data directly from states, and the CDC publicly released topline counts of how many tests had been conducted in each state in May. Now, multiple federal agencies post that testing data, in more detail than the CDC initially did—you can get full historical testing data for each state and even some county-level data.

This expansive public federal testing data is in some ways better than the state-provided data The COVID Tracking Project collects, but also comes with unique problems: On the one hand, federal data is more standardized—and more detailed—than the patchworked COVID Tracking Project dataset could ever be. On the other hand, there are large gaps between state-provided testing data and the federal government’s testing data, even after accounting for definitional differences.

Because of these different but serious problems with both the state-reported data we compile and the testing data reported by the federal government, neither dataset is a completely faithful reflection of the reality of testing in the US. But whereas COVID Tracking Project data could only ever be as good as the data states are willing to post on their dashboards, the federal data can improve substantially—and we think it will—as state and federal public health agencies work together to improve reporting.

We recently posted a detailed analysis of the discrepancies between state and federal data—and how we hope the federal data might get better. In this post, we’ll explain the basics: where federal testing data comes from, how to get it, and how you can use it as responsibly as possible.

Where federal testing data comes from

Though the federal government posts its testing data in several places, it all comes from the same source: The COVID Electronic Laboratory Reporting Program (CELR), which was started by the federal government in April to fill an infrastructural gap in federal laboratory reporting systems.

CELR primarily rests on longer-standing pipelines for collecting testing data, which exist at the state health department level. The program mostly aggregates data provided directly by state health departments to the federal government: all but five state health departments submit testing data directly to the program. Within that group, most states send line-level data to the federal government—i.e., granular information on tests conducted in their state such as the location and demographic information about the recipient. Two states—Ohio and Wyoming—send aggregate data instead, or simple counts of how many tests have been conducted in their state and the result of those tests.

Five jurisdictions—Maine, MIssouri, Oklahoma, Puerto Rico and Washington—do not send any data to the federal government because of technical obstacles to submission. For those states, the federal government instead relies on data submitted by laboratories directly to CELR, which only capture a portion of known testing volume.

A flowchart shows the path of testing data from laboratories to the federal government. In five states—Maine, Missouri, Oklahoma, Puerto Rico, and Washington—the data flows directly from a portion of laboratories to the federal government. In all other states, the data goes from laboratories to the state health department before making its way to the federal government.

Whether CELR data comes directly from laboratories or is submitted by states, it is all standardized to the same definition, unlike The COVID Tracking Project’s patchworked testing dataset:

Federal data contains PCR tests only: By contrast, The COVID Tracking Project data in some states also contains antigen tests.
Federal data uses units of tests, not people. While all but two states now post their data in units of tests, only a portion of those post historical data in those units. As a result, The COVID Tracking Project has been unable to switch many states’ main testing data to units of tests away from unique people.
Historical federal data is organized by the date the test was administered or date the test received a result. Because not all states make this data available, The COVID Tracking Project’s data is by date of report: the day a test was reported on a state’s dashboard. As a result, COVID Tracking Project data exhibits artificial spikes due to data dumps which the federal data smooths over.

Where to get federal testing data

CELR data powers four testing data products from the federal government, each of which offer slightly different data:

The HHS State-Level PCR Timeseries: This dataset is what most COVID Tracking Project data users will find most familiar. It provides machine-readable, historical data broken down by day and test result for every US state and territory (except American Samoa, which has no COVID-19 cases) and the Marshall Islands. In most states, you can get data back to March 1, 2020.
The HHS Community Profile Reports: These reports, which have guided the federal COVID-19 response but were first publicly released on December 18, 2020, feature detailed COVID-19 state and county data and risk assessments. They consist of two documents: A PDF report that shares data analysis based on testing data (and other indicators), and an Excel file that provides raw 7-day positive and total testing data by county, as well as other testing-related data like median test turnaround time per county. (You’ll see the downloadable files if you scroll down to the bottom of the page below “Attachments.”)
The HHS State Profile Reports: These weekly PDF reports feature testing data from CELR for each state, including a historical trends graph for testing at the state level, one of the only places to find this graph in federal data outputs. However, they do not feature any machine-readable data.
The CDC COVID Data Tracker: This interactive dashboard allows you to interface with the underlying data the HHS publishes in its time series and the community profile report. In the map views, you can select options to view tests and test positivity by state and county. Clicking on a state will also allow you to see testing tends.
The CMS Test Positivity Dataset: Since August, the Centers for Medicare and Medicaid Services has shared 14-day average county-level test positivity figures on their website. This data is meant to provide long-term-care facilities with an estimate of viral prevalence in their area in order to guide their testing surveillance programs. More recent reports also include the total number of tests for the past 14 days in each county.

What to watch out for

The biggest problem with federal testing data is revealed in its discrepancies with state data, which often point to incompleteness and quality problems with the federal data. First, data users should be wary of testing data coming from the five jurisdictions that do not submit testing data directly to the federal government: Maine, Missouri, Oklahoma, Puerto Rico, and Washington. Federal testing data from these states comes from only a portion of labs, undercounting their true testing volume by as much as 80 percent.

Two more states, Ohio and Wyoming, do not submit county-level data to the federal government, so their county-level data in the federal dataset is incomplete. Because the county-level testing totals come from a different, less comprehensive source, they sum to a lower number than the state-level total test count, which Ohio and Wyoming submit to the federal government directly.

Finally, data users should also be wary of federal testing data in the following six jurisdictions: Guam, Mississippi, North Dakota, New Hampshire, South Dakota, and Utah. All of these jurisdictions have differences of greater than 20% between state and federal testing data that cannot be explained by differences between state and federal data definitions.

Many of the above states provide downloadable historical data that can be used as an alternative to federal data:

Guam provides the data in its GIS layer, although not broken down by test type.
Maine’s data is available for download from its Tableau.
Missouri’s data is available for download from its Tableau.
North Dakota provides a Public Data Download CSV linked on its dashboard.
New Hampshire provides a full downloadable history of tests on its dashboard.
Utah provides a full history of tests in its data download at the top of its dashboard.

What we hope for from the federal government

The federal government is the only entity capable of creating a comprehensive, high-quality, standardized COVID-19 testing dataset for the United States. But it needs to improve both the context around its COVID-19 dataset and the data itself in order to get there.

To accomplish this, the government should first improve public documentation of its COVID-19 testing dataset. Right now, only the Community Profile Reports feature any documentation on data sources, incompleteness, or delayed reporting. As a result, data users that have not consulted it may be unaware that they’re looking at incomplete information in certain states.

Next, the federal government should onboard the five states that are not submitting data to CELR onto the program as soon as possible. Data in these states is woefully incomplete, and though they may face obstacles in transmitting line-level data, these states should be able to submit at least aggregate counts to the CDC.

Third and finally, the government should make an investment in improving the quality of data that states submit to the program. Overstretched state health departments need guidance in how to best submit the testing data, federal support in managing the process, and programs of continual validation to ensure that their testing data remains high-quality.

This data is critically important to decision-making at every level of American society. The federal government directs school administrators and policymakers to consult its testing data in crucial decisions like school reopening, shares the data daily with state politicians, and uses it to guide its own COVID-19 response. Yet our analysis suggests that few resources have been devoted to ensuring and maintaining the quality and completeness of the data.

As a result, the confidence the government puts in the data in these contexts is unmerited for many states and counties—and may encourage misguided policy decisions. Without improvements to the data—beginning with its documentation—the federal government’s testing dataset cannot accurately inform an equitable, comprehensive pandemic response, and may in some cases work against it.

Although our data compilation work at The COVID Tracking Project is coming to an end on March 7, we will continue to monitor the federal testing data and make periodic comparisons to state-reported testing data over the coming weeks and months. We’ll post the results of our research here, and hope to be able to announce that the data has substantially improved before our analysis work is complete in May.

Graphics by Peter Walker