In the US, health data infrastructure has always been siloed. Fifty-six states and territories maintain pipelines to collect infectious disease data, each built differently and subject to different limitations. The unique constraints of these uncoordinated state data systems, combined with an absence of federal guidance on how states should report public data, created a big problem when it came to assembling a national picture of COVID-19’s spread in the US: Each state has made different choices about which metrics to report and how to define them—or has even had its hand forced by technical limitations.
Those decisions have affected both The COVID Tracking Project’s data, assembled from states’ public data releases, and still affect the CDC’s data, which mostly comes from submissions from state and territorial health departments. And they have had real consequences for the numbers: A state’s data definitions might be the difference between the state appearing to have 5% versus 20% test positivity, between labeling a COVID-19 case as active versus recovered, or between counting or not counting large numbers of COVID-19 cases and deaths at all.
Because state definitions affect the data we collect, COVID Tracking Project researchers have needed to maintain structured, detailed records on how states define all the testing and outcomes data points we capture in our API (and a few we don’t). Internally, we call this constantly evolving body of knowledge “annotations.” Today, we are for the first time publishing our complete collection of annotations.
How can I use the annotations?
You can find all our annotations and documentation on how to use them in our new annotations reference page.
Since we are winding down data collection this week, we are releasing the annotations as a one-time snapshot of our research into state and territorial definitions as of March 3, 2021, rather than a constantly-updating source of information. As you use or build on our work, remember that state COVID-19 information changes quickly—some of our information will have already fallen out of date. And unusually for The COVID Tracking Project, we’ve chosen to release some information in the annotations that hasn’t been double-checked by experienced contributors before its release, so it may contain classification mistakes. Given the timing of our shutdown, we decided that providing a comprehensive look at our annotation structures was more important than releasing only a subset of information that we were sure was 100% accurate.
We hope that this full view of our annotations will be of use to researchers and data users aiming to understand state-level COVID-19 data and the methodologies we have used to collect and analyze it.
What’s coming next?
One of the most important roles the annotations have played is supporting our analysis of state and federal reporting practices. We’ll be publishing many more findings about COVID-19 data stemming from our research over the coming months as we wind down.
We also plan to share more about the processes that we’ve used to create and maintain the annotations, for any future data compilation efforts that need to solve similar kinds of problems to those we faced tracking state COVID-19 data.
Finally, since annotations shaped the choices we made about what data to report and how, we plan to compile resources that explain how these annotations, along with other data and metadata compilation efforts, guided our decision making. Look out for these resources on our site in the coming months.
The COVID Tracking Project’s annotations required thousands of person-hours to build and maintain. We would like to thank the CTP contributors who made this meta-dataset possible.
Data quality researchers: Jesse Anderson, Joseph Bensimon, Madhavi Bharadwaj, Kathleen Birch, Jennifer Clyde, John Downey, Elizabeth Eads, Amanda French, Rebecca Glassman, Jonathan Gilmour, Lauran Hazan, Matt Hilliard, Hannah Hoffman, Marta Jacenyik, Noah Kim, Nicole King, Betsy Ladyzhets, Camille Le, Brian Li, Daniel Lin, Michal Mart, Barb Mattscheck, Theo Michel, Quang Nguyen, Daria Orlowska, Brandon Park, Kara Schechtman, Anna Schmidt, Sara Simon, Erika Thomson, Nadia Zonis
State outreach: Laura Bult, Artis Curiskis, Elizabeth Eads, Jaclyn Jeffrey-Wilensky, Glen Johnson, Ryan Kailath, Erin Kissane, Alexis Madrigal, Rowan Moore Gerety, Kara Oehler, Judith Oppenheim, Isha Pasumarthi, Sara Simon
Website: Kevin Miller, Andrew Schwartz
The COVID Tracking Project is a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States.
More “Testing Data” posts
How Probable Cases Changed Through the COVID-19 Pandemic
When analyzing COVID-19 data, confirmed case counts are obvious to study. But don’t overlook probable cases—and the varying, evolving ways that states have defined them.
20,000 Hours of Data Entry: Why We Didn’t Automate Our Data Collection
Looking back on a year of collecting COVID-19 data, here’s a summary of the tools we automated to make our data entry smoother and why we ultimately relied on manual data collection.
A Wrap-Up: The Five Major Metrics of COVID-19 Data
As The COVID Tracking Project comes to a close, here’s a summary of how states reported data on the five major COVID-19 metrics we tracked—tests, cases, deaths, hospitalizations, and recoveries—and how reporting complexities shaped the data.