Each day for a year, The COVID Tracking Project gathered the latest testing, case, and other related metrics from the COVID-19 dashboards of 56 states and territories. Over time, these daily numbers formed a time series that helped reveal the progression of the pandemic.
Time series data can be easily visualized in a line or bar chart. In this bar chart, for example, the height of each bar represents the number of new cases reported each day.
How a time series can get tricky
Unfortunately, there is always some delay between when an event occurs (like when a new case is identified) and when that event makes its way into the numbers reported on a state’s website. In fact, it can take days or weeks for some types of data to flow from sources like doctors’ offices, testing centers, or hospitals to a state website. As long as the delay is reasonably consistent, it is still very useful to capture and report the information when it becomes available, since the overall trend of the data will track with the trend of the pandemic.
For example, COVID-19 testing data might take three days to be reported, which means we might see Monday's tests on Thursday, Tuesday's tests on Friday, and Wednesday's tests on Saturday, but we will still get a sense of how many tests took place recently, albeit with a slight delay.
A more serious issue occurs when the flow of data to a state’s website becomes uneven, with large batches of data delayed longer than normal. This can happen for several reasons, but technical issues, holidays, and disruptive weather events were probably the most common. Delayed reporting can make it appear that the pandemic is improving or stabilizing, even when it isn't. And when delayed data is eventually delivered, it can seem like disaster has struck, when the reality is that there were more gradual changes over the preceding days or weeks.
One method for mitigating the effects of data delays and to even out the trends is to include a “rolling average” line. This is done by calculating a number for each day based on the average value over the previous 7 or 14 days. This helps drive home the fact that reality does not have the dips and spikes that many time series have.
How states presented time series
Many states avoided showing these gaps and spikes on their dashboards by creating their own time series that assigned each test, case, and outcome to the date it occurred rather than the date it was reported. States were able to do this when they had access to the full details (such as date of occurrence) for each case, test, and outcome.
For example, below are two time series showing data from Connecticut. Above is The COVID Tracking Project’s time series, which was based on the data that Connecticut reported each weekday. Below that is a time series that Connecticut published, based on the day that each case’s test specimen was collected. Although these two graphs represent the same total number of cases (note the different y-axis ranges), they show remarkably different patterns of cases over time, partly due to the lack of weekend reporting.
We refer to these different assignments of events to days in a time series as dating schemes. When we build a time series using the most recent value reported each day by the state on their dashboard, we say that data is dated by report date, or that it uses a report date dating scheme. On the other hand, a time series that a state built which assigned cases to days based on the positive samples’ collection date would be described as using a specimen collection date dating scheme.
Dating schemes like specimen collection date are useful and important for historical analysis of the pandemic, but unfortunately not all states use the same dating schemes for the same metrics on their dashboards, making it challenging to compare these time series between states. This is part of the reason that we primarily use the report date dating scheme: it’s available from all states.
Beware the taper
Of course, these dating schemes that states use on their own time series still suffer from the underlying delays in gathering the data. For example, if you’re building a time series of positive tests by date of specimen collection, you at least have to wait for a test result to come back before you can associate it with the day that its specimen was collected. So, if the state publishes a time series leading up to the current day, it will usually have incomplete data for the last few days or weeks.
In the example below, Washington’s case counts appear to be falling at the end of the chart, which would be great news if it were true—but in this case they are probably falling because of incomplete data for those days and not a change in the pandemic. Washington has made this clear by “graying out” the incomplete days, but not all states did that for all metrics, leading to confusion when interpreting time series.
Time series like this are continuously updated by the states as the latest data comes in. We call this updating of past data a “backfill.” Backfills often occur as a result of reporting delays, but they can also happen when a state finds errors in their data and takes steps to correct those errors. The result is that the completeness of the data on any given day increases over time. The chart below shows that. The height of each bar represents the number of tests performed on each day.
How we addressed dating scheme challenges in our own data
As we’ve discussed above, The Covid Tracking Project primarily collected data and maintained time series based on a dating scheme of “date of report,” whereas states maintained time series using various other dating schemes, and they revised their data continuously through backfills.
This resulted in the shape of our time series graphs diverging from the shape of the states’ regularly backfilled graphs, even though our total numbers came from the states every day. At first, we were tempted to make our time series smoother by replacing them with state time series. But we were concerned about presenting data with a misleading “taper” appearing on recent days, leading people to conclude that things were getting better even when they were not. We ultimately concluded that charting data by date of report was the best and clearest way to communicate trends in the pandemic.
However, there were some situations where we found that the benefits of backfilling our data using a state’s time series outweighed the downsides.
The most common situation was the introduction of a new metric that had not previously been reported. For example, a state might start reporting probable cases in addition to the previously reported confirmed cases. In these situations, states often provided a time series going back to the beginning of the pandemic. (They had been collecting the data all along, just not reporting it.) We used these time series to backfill our data for the new metric, and from that point forward, we collected the new metric using our normal method of taking the latest numbers from the dashboard each day.
Another common situation occurred when states improved their reporting to separate out subtypes of a specific metric. For example, a state might have initially lumped antigen tests in with PCR tests, and then later began reporting these tests separately, providing separate antigen and PCR-only time series. In this type of situation, we separated the two metrics into their own fields, and backfilled these fields using time series from the state. So, in the antigen/PCR tests example, we backfilled our field for antigen tests using the new antigen time series from the state; and we also backfilled our PCR tests field, overwriting our previously collected data (which had included antigen tests) with the state’s new PCR-only time series.
The final situation that required us to backfill was when metrics depended on one another, such that the backfill of one metric required that we backfill another metric. The most common example of this was the relationship between total tests and positive tests. While we repeatedly cautioned about the unreliability of test positivity, we found our data users, public health departments, and many newsrooms often wanted to calculate it anyway. As such, it was important that both total tests and positive test metrics used the same dating scheme. So if we needed to backfill positive tests, we also backfilled total tests, to keep them in sync.
When we did backfills, we used the state’s entire time series for the metric, from the beginning of their data up to the day of the backfill. This meant that each of our resulting time series only contained one transition from the state’s dating scheme to our report date dating scheme and only one period where the data tapered off due to incompleteness. While any mixing of this type was not ideal, we found that the overall trends were still correct, and that it was a reasonable compromise.
Finally, we were careful to note these backfills in our public notes, and often with notes directly on charts, highlighting the point of transition from one dating scheme to another.
What we learned
This mixing of dating schemes was something that we avoided when possible, given the potential for confusion from the taper and the fact that it makes the data set less consistent. But we believe that it was a net positive, allowing us to provide an up-to-date, unified view of the pandemic across 56 states and territories.
As an extra layer of research, in parallel with maintaining our by report date time series with some mixing of states dating schemes, we also undertook an effort to automatically collect full time series from as many states as possible on a daily basis, to allow for broader historical analysis based on other dating schemes. We were able to collect useful time series data from 47 states, and have made that data available in GitHub.
In an ideal world, all states and territories would consistently release full time series with consistent dating schemes in this way, allowing for all analysis to be done using appropriate dating schemes for each metric. But in the meantime, we believe that date of report remains the best dating scheme to use for comparison of all states and territories.
Additional research from Michal Mart.
Theo Michel is a Project Manager and contributor to Data Quality efforts at the Covid Tracking Project.
Rebma is an infrastructure engineer who likes puzzles, solving problems, and who did the data scripting and automation for The Covid Tracking Project.
More “How We Made The COVID Tracking Project” posts
20,000 Hours of Data Entry: Why We Didn’t Automate Our Data Collection
Looking back on a year of collecting COVID-19 data, here’s a summary of the tools we automated to make our data entry smoother and why we ultimately relied on manual data collection.
How and Why The COVID Tracking Project Built a Screenshot System
A system for regularly capturing static images of state COVID-19 websites helped us produce an archive and verify our published data.
Behind The COVID Tracking Project’s Public Help Desk
Volunteers and staffers at The COVID Tracking Project replied to thousands of messages from the public last year. Here’s why we took the trouble, and here’s what people wanted to know.