The COVID Tracking Project exists because every person, newsroom, and government agency in the United States deserves access to the most complete COVID-19 data that can be assembled.
Data is critical to understanding the COVID-19 outbreak. We track three related kinds of data that can help us understand the US outbreak and the disease itself: testing data, hospitalization and outcomes data, and race and ethnicity data.
Without testing data, we can’t understand the true scale of the outbreak or whether we’re testing enough people.
Testing data refers to the numbers of people tested, the number of positive tests, and, where available, the number of pending and negative tests.
Taken together, the number of people who tested positive (the “case count”) and the number of diagnostic tests performed give us a rough understanding of how many people in a given area are likely to be infected. A state that reports 2 cases of COVID-19 after testing 1,000 people is probably in a very different stage of its outbreak than a state that reports 2 cases after testing only 10 people—but if all you have is a case count, those states look exactly the same.
Dividing the case count by the total number of tests produces a positivity rate for the sample population. In our hypothetical first state, 0.2% of people tested were positive for COVID-19, but in the second, that figure was 20%. Without the total number of tests, we can’t tell whether low case counts mean fewer cases or inadequate testing.
Without hospitalization and outcomes data, we can’t understand how deadly the virus is proving to be in US communities, how bad outbreaks are in areas with insufficient testing, or where hospitals may be overwhelmed.
When we know how many patients are so ill they must be hospitalized (or admitted to the ICU, or put on a ventilator), and how many people have recovered or died from COVID-19, we get a different view of how big a local outbreak is. In New York and New Jersey, the epicenters of the US outbreak to date, testing capacity was overwhelmed, and the majority of people with COVID-19 symptoms were not tested for the virus. At the outbreak’s peak, hospitalization and death numbers provided a more accurate picture of the region’s outbreak than testing data.
Hospitalization and death data also provide vital measures of how deadly the virus actually is in US communities, particularly when we can combine it with demographic information.
Without complete race and ethnicity demographic data, we can’t understand who is most endangered by the outbreak, both nationally and in specific areas.
Many US states and territories provide at least partial demographic data listing race and ethnicity for COVID-19 deaths and case counts. A few states and territories provide none at all, and many states provide incomplete and infrequently updated data. This information is vital to discovering which communities within US regions are most affected by the outbreak.
The COVID Racial Data Tracker, a partnership between The COVID Tracking Project and the Center for Antiracist Research, is tracking race and ethnicity data from every state and territory that publishes it. That’s why we are also pushing for more transparency from the states and territories that don’t provide this vital data.
Note: The demographic information we’re most often asked for—age and sex—is not available from state public health authorities for most of the metrics we track. Or, they are not in consistent formats that would allow us to assemble national statistics. Each state and territory makes its own decision about how to divide ages into ranges, and those ranges vary from state to state, making summary impossible. The CDC provides a “Provisional COVID-19 Death Counts by Sex, Age, and State” dataset that includes relatively up-to-date age and sex statistics for COVID-19 deaths, which is perhaps the best available surrogate for having this information for all COVID metrics.
Tracking these three areas of data has given our team a unique perspective on the progress of the pandemic through the United States.
In the early stages of the COVID-19 pandemic, some wealthy countries pursued a strategy of widespread testing that allowed them to successfully pursue a containment strategy. Others, including the US, were much slower to implement mass testing. As has been documented elsewhere, the US testing effort started late and rolled out slowly and unevenly.
At the same time, federal public health authorities have elected not to publish complete testing data. From March through mid-May, the CDC published a case count for identified cases of lab-confirmed COVID-19 confirmed by testing. However, it significantly lagged behind other sources of this data, like the gold-standard Johns Hopkins University tracker.
The week of May 9, 2020, the CDC began publishing case counts, deaths, and basic testing data in a new dashboard. Our team compared the CDC’s data with what we were compiling from state sources. We found that although case and death counts were very similar in the two datasets, there were substantial mismatches in the testing data. The same week, an investigation by two of our founders at The Atlantic revealed that not only were several states mixing viral (diagnostic) and antibody (past infection indicator) test numbers in their public data reporting, the CDC was also mixing these test numbers, while labeling the data “viral tests.”
How Could the CDC Make That Mistake? by Alexis Madrigal and Robinson Meyer
We continue to monitor and archive the CDC’s case, death, and testing data, and we expect that the CDC’s national and date-level testing data will eventually come closer to matching the data reported by states. In the interim—and probably for some time after the two datasets are brought into closer agreement—we will continue daily data collection from states and territories.
Hospitalization data for US COVID-19 cases remains another area in which no other public source appears to be compiling information from the states and territories. For a total of 99 counties in 14 states, the CDC provides detailed hospitalization (and demographic) data through COVID-NET. However, the CDC does not publicly report state- or county-level hospitalization data for the rest of the United States.
Because we have not had a complete official account of COVID-19 testing data in the US, we collect the data directly from the public health authority in each US state and territory (and the District of Columbia). Each of these authorities reports its data in its own way, including online dashboards, data tables, PDFs, press conferences, tweets, and Facebook posts. And while many states and territories have slowly moved toward more standard ways of reporting, the actual categories of information are still in flux.
Our data team uses website-scrapers and trackers to alert us to changes, but the actual updates to our datasets are done manually by devoted volunteers who watch press conferences, follow social media tips about changes in data definitions, double-check each change, and extensively annotate areas of ambiguity.
When we started the project, building on two independently created reporting spreadsheets, we expected to be updating the data for a few days—maybe a week—until complete federal data emerged. It never did, so we’re still here.
We also recognize that part of our work is the creation and maintenance of a historical record of the US outbreak and the government’s response. Since the first week of March, we’ve been here building the most accurate record we can of what actually happened, day by day, state by state.