Analysis & updates | Federal COVID Data 101: What We Know About Race and Ethnicity Data

For the past year, our project compiled scattered, unstandardized COVID-19 data from 56 US states and territories, each of which made their own decisions about which data to report. Given funding, clear standards, and political will, federal public health agencies have the power to produce something we never could: truly comparable, nationwide data. Now that our data compilation has wound down, we’re looking at a series of federal datasets that in some cases match very well with what we’ve compiled (as for cases, deaths, and hospitalizations), and in other cases either diverge from our datasets for complex reasons (as with testing data) or cover only a portion of what our compiled datasets attempted to include (as with long-term-care data).

Federal race and ethnicity COVID-19 data is not comprehensive enough to represent people’s experience of the pandemic in the United States. As detailed below, the CDC does present information about cases and deaths at the national level. Geographic information (including state) is provided in both public and restricted-access line-level data, which is usable only by people with the ability to analyze 21 million data points, or at the state level for deaths only in the National Center for Health Statistics’ provisional death counts. No federal testing data and very little federal hospitalization data includes race and ethnicity information, and no federal vaccine data includes race and ethnicity information except at the national level and in an incomplete and ill-documented form.

The federal data available today is not sufficient. It’s not possible to tell from outside the data-producing agencies how much of the race and ethnicity information we are seeking is internally available but not published, and how much simply hasn’t been collected, so in this post, we will simply describe what has been made public (or semi-public).

Case and death data

The CDC reports aggregate data for race and ethnicity for cases and deaths for the entire country on the CDC COVID Data Tracker. But this data is both incomplete and insufficiently contextualized.

The topline CDC data is accompanied by this somewhat confounding note: “These data only represent the geographic areas that contributed data on race and ethnicity. Every geographic area has a different racial and ethnic composition. These data are not generalizable to the entire US population.” But no information is given about which geographic areas contribute data into the aggregate figures, nor is the information broken down by state or territory.

Two bar charts from the CDC site, one showing cases by race/ethnicity, the other deaths by race/ethnicity. Each demographic group has information about the total number of cases or deaths, as well as the percentage of the total with known data.

The source of this data appears to be the line-level datasets discussed below, but this is not completely clear, and the difference in dating schemes in the aggregate and line-level data prevent any meaningful comparisons. A footnote on the page where the aggregate data is presented reads, in part:

Demographic data for COVID-19 cases and deaths is based on a subset of individuals where case-level data are reported by state and territorial jurisdictions to the Centers for Disease Control and Prevention (CDC) since January 21, 2020. Demographic data have varying degrees of missing data and are not generalizable to the entire population of individuals with COVID-19.

As of March 16, the CDC COVID Tracker includes race and ethnicity data for 11,974,497 (53 percent) of 22,434,545 cases included in the demographic trends area of the tracker and for 296,435 (75 percent) of 393,272 deaths included in the demographic trends area of the tracker. The topline national case and death counts in the tracker are considerably higher, and should not be used in comparisons with the tracker’s demographic information.

Charts of national cases and deaths per 100,000 by race or ethnicity since March 1, 2020 are now available, though the data for the chart is unfortunately not made available.

A line chart from the CDC site, showing COVID-19 weekly deaths per 100,000 population by race/ethnicity in the United States. The chart gives options for viewing cases instead, and for seeing the chart by sex or age rather than race/ethnicity. The chart shows reported data for March 1, 2020 through March 15, 2021.

Detailed information about cases and deaths is available from the COVID-19 Case Surveillance Public Use Data, the COVID-19 Case Surveillance Public Use Data with Geography, and from the more complete COVID-19 Case Surveillance Restricted Access Detailed Data.

Note: All work that refers to information in the restricted access dataset—including this post—is required to make the following disclaimer:

The CDC does not take responsibility for the scientific validity or accuracy of methodology, results, statistical analyses, or conclusions presented.

These three datasets offer line-level data, which we’ve also discussed in our introduction to federal case data. The public versions of the dataset includes a line for every known case of COVID-19 in the United States, and available data fields for cases include sex, age, and other information for cases, as well as hospitalization and death status—though not all cases have information in all available fields. The COVID-19 Case Surveillance Public Use Data with Geography reports whether the case is laboratory confirmed or probable, if the person had underlying conditions, and both race and ethnicity. The COVID-19 Case Surveillance Public Use Data and Case Surveillance Restricted Access Detailed Data report a combined race and ethnicity field. Additional fields in the restricted dataset include symptoms and healthcare worker status.

As noted in our introduction to federal death data, information about deaths by race and ethnicity is provided by the National Center of Health Statistics (NCHS), which reports this data from death certificates. NCHS data is presented in multiple ways on the CDC website. As with nearly all race and ethnicity data available from the US government, the categories reported are inconsistent between datasets, as shown in this table. Of the data from NCHS, the Provisional Death Counts for Coronavirus Disease (COVID-19): Distribution of Deaths by Race and Hispanic Origin is the most comprehensive starting point, as it provides state-level cumulative deaths for each group.

Federal data should be comprehensive and consistent, and with sufficient commitment and resources can help us better understand and respond to the pandemic.

Testing data

The federal government does not provide any race and ethnicity data related to COVID-19 testing. Good testing data by race and ethnicity is crucial to understanding testing accessibility and availability. Without demographic details on daily tests, we also risk underestimating the true burden of COVID-19 among people of specific races or ethnicities.

The COVID Tracking Project collected data about the race and ethnicity of people tested from the nine states reporting it.

Hospitalization data

The CDC's line-level case surveillance dataset includes fields indicating whether an individual has been hospitalized, or if that information is not available.

CDC’s COVID-NET provides hospitalization data, by race and ethnicity, is collated and publicly shared. This data is from specific counties in 14 states, representing approximately 10 percent of the US population, so it is not descriptive of the national situation, nor does it provide complete information about what is happening in specific states it does contain. Seventeen of the 24 states publicly reporting race and ethnicity data on hospitalizations on their own dashboards, for example, are not represented at all in this dataset.

The public version of the HHS Hospitalization dataset, which is generally excellent, includes no race and ethnicity information.

The COVID Tracking Project collected race and ethnicity for hospitalizations for the 24 states reporting it. (Our downloadable race and ethnicity dataset includes this hospitalization data.) Two federal data sources offer some race and ethnicity information: The line-level dataset and COVID-NET.

Vaccination data

The COVID Tracking Project doesn’t compile data on vaccine doses, but we do collect vaccination metadata—which means that we track what metrics each state reports, including their race and ethnicity categories.

The CDC provides a bar chart and accompanying table for their vaccine data, with the race and ethnicity of people with one or more doses administered, and people with two doses administered.

Two bar charts from the CDC site, one showing the race/ethnicity of people with 1 or more vaccine dose administered, the other showing the race/ethnicity of people with 2 doses administered.

Both metrics have race and ethnicity data for just over half of all people included in the vaccination dataset.

Unfortunately, this data does not include any geographic information, so there is no way to know what jurisdictions are included in the data, nor whether those jurisdictions are the same as those reporting race and ethnicity data publicly.

How the federal data could be (much) better

Based on our eleven months of research, we have three recommendations on how the federal data could be improved:

Collect and publish more comprehensive race and ethnicity data: The CDC’s Data Tracker notes that the CDC is working with states to obtain more demographic data about cases and deaths. The same should be true for testing, hospitalization, and vaccine data.

Present data in clear, accessible ways: Much of the data is currently so general as to be nearly useless (the nationwide situation since the beginning of the pandemic in a single bar graph) or so specific as to be unwieldy and inaccessible (20,565,345 records which must be downloaded in three parts and combined into a nearly 5GB file). Presenting the case surveillance data by state will help people understand what is happening in specific areas with differing population characteristics. Including graphical presentations of both the NCHS data and the aggregated case surveillance data will help researchers and members of the general public understand the knowable facts about race and ethnicity in the US pandemic to date.

Report race and ethnicity categories in a consistent way: The data reported and collected should include the same categories as are used for other federal data, including the Census. Consistency between datasets allows for analysis of how vaccination efforts are affecting cases, for example. Using the categories for which we have population data allows for comparisons between groups and understanding what is happening per capita.

Be transparent about data sources and contexts: The federal government should clearly state the source of all demographic data collected in any federal dataset and should state whether line-level data is supplemented by data from any other source. When data is presented as a national summary, the federal government should note which jurisdictions are represented in the data—and should provide additional context about the percentage of cases and deaths for which race and ethnicity data is available in each state and territory represented in the data.

The data we have collected from states and territories has always been incomplete and shown inequities in who is being affected by the pandemic and to what degree. Both of these characteristics are also true of the currently available federal COVID-19 data on race and ethnicity. The charter of the newly assembled federal COVID-19 Health Equity Task Force states the following:

To address the data shortfalls identified in Section 1 of E.O. 13995, and consistent with applicable law, the Task Force shall:
(1) collaborate with the heads of relevant agencies, consistent with the Executive Order entitled "Ensuring a Data-Driven Response to COVID-19 and Future High-Consequence Public Health Threats," to develop recommendations for expediting data collection for communities of color and other underserved populations and identifying data sources, proxies, or indices that would enable development of short-term targets for pandemic-related actions for such communities and populations;
(2) develop, in collaboration with the heads of relevant agencies, a set of longer-term recommendations to address these data shortfalls and other foundational data challenges, including those relating to data intersectionality, that must be tackled in order to better prepare and respond to future pandemics; and
(3) submit the recommendations described in this subsection to the President, through the COVID-19 Response Coordinator.
We hope that the task force’s identification of immediate and long-term actions to improve data collection will soon manifest in more complete—and better presented—data at every level of government.

Charlotte Minsky contributed additional research for this article.

Note: This post has been updated to include the release of the CDC’s COVID-19 Case Surveillance Public Use Data with Geography.

Alice Goldfarb leads The COVID Tracking Project’s part in The COVID Racial Data Tracker, and is a Nieman Visiting Fellow.

@afgoldfarb