Analysis & updates | Dating Data: How We Used Multiple Dating Schemes to Provide the Most Complete Picture of the Pandemic

Each day for a year, The COVID Tracking Project gathered the latest testing, case, and other related metrics from the COVID-19 dashboards of 56 states and territories. Over time, these daily numbers formed a time series that helped reveal the progression of the pandemic.

Time series data can be easily visualized in a line or bar chart. In this bar chart, for example, the height of each bar represents the number of new cases reported each day.

A bar graph with 7-day average line overlaid, showing the three major waves of the pandemic in Maryland and peaking in May 2020, August 2020, and January 2021.

Time series showing daily new cases in Maryland from March 2020 to March 2021, with the solid line representing the 7-day average, based on The COVID Tracking Project’s data collected daily from Maryland’s dashboard.

How a time series can get tricky

Unfortunately, there is always some delay between when an event occurs (like when a new case is identified) and when that event makes its way into the numbers reported on a state’s website. In fact, it can take days or weeks for some types of data to flow from sources like doctors’ offices, testing centers, or hospitals to a state website. As long as the delay is reasonably consistent, it is still very useful to capture and report the information when it becomes available, since the overall trend of the data will track with the trend of the pandemic.

For example, COVID-19 testing data might take three days to be reported, which means we might see Monday's tests on Thursday, Tuesday's tests on Friday, and Wednesday's tests on Saturday, but we will still get a sense of how many tests took place recently, albeit with a slight delay.

A more serious issue occurs when the flow of data to a state’s website becomes uneven, with large batches of data delayed longer than normal. This can happen for several reasons, but technical issues, holidays, and disruptive weather events were probably the most common. Delayed reporting can make it appear that the pandemic is improving or stabilizing, even when it isn't. And when delayed data is eventually delivered, it can seem like disaster has struck, when the reality is that there were more gradual changes over the preceding days or weeks.

One method for mitigating the effects of data delays and to even out the trends is to include a “rolling average” line. This is done by calculating a number for each day based on the average value over the previous 7 or 14 days. This helps drive home the fact that reality does not have the dips and spikes that many time series have.

A daily bar graph with 7-day average line overlaid showing data from Nov 1 to December 31. Several days are missing data, and those are followed by much taller bars. The missing days are November 1st, 11th, 13th through 15th, 26th through 28th, and December 25th. The bar values on the days following those gaps are 10000, 17000, 28000, 25000, and 13000, which are all significantly higher than the daily average, which is around 7000.

Time series showing spikes in tests reported for New Hampshire, the result of holiday and weekend reporting delays.

How states presented time series

Many states avoided showing these gaps and spikes on their dashboards by creating their own time series that assigned each test, case, and outcome to the date it occurred rather than the date it was reported. States were able to do this when they had access to the full details (such as date of occurrence) for each case, test, and outcome.

For example, below are two time series showing data from Connecticut. Above is The COVID Tracking Project’s time series, which was based on the data that Connecticut reported each weekday. Below that is a time series that Connecticut published, based on the day that each case’s test specimen was collected. Although these two graphs represent the same total number of cases (note the different y-axis ranges), they show remarkably different patterns of cases over time, partly due to the lack of weekend reporting.

Graphic showing two bar charts. The date of report chart has zero data on weekends and huge spikes on Mondays. The highest spike on the chart by date of report is over 8000 new cases, where the highest value on the chart by specimen collection date is around 3200 cases.

Charts showing cases in Connecticut by date of report (above) and by date of specimen collection (below). The date of report chart makes it appear as if there were no new cases over the weekends, followed by a spike on Mondays, but this is a reporting artifact.

We refer to these different assignments of events to days in a time series as dating schemes. When we build a time series using the most recent value reported each day by the state on their dashboard, we say that data is dated by report date, or that it uses a report date dating scheme. On the other hand, a time series that a state built which assigned cases to days based on the positive samples’ collection date would be described as using a specimen collection date dating scheme.

Dating schemes like specimen collection date are useful and important for historical analysis of the pandemic, but unfortunately not all states use the same dating schemes for the same metrics on their dashboards, making it challenging to compare these time series between states. This is part of the reason that we primarily use the report date dating scheme: it’s available from all states.

Beware the taper

Of course, these dating schemes that states use on their own time series still suffer from the underlying delays in gathering the data. For example, if you’re building a time series of positive tests by date of specimen collection, you at least have to wait for a test result to come back before you can associate it with the day that its specimen was collected. So, if the state publishes a time series leading up to the current day, it will usually have incomplete data for the last few days or weeks.

In the example below, Washington’s case counts appear to be falling at the end of the chart, which would be great news if it were true—but in this case they are probably falling because of incomplete data for those days and not a change in the pandemic. Washington has made this clear by “graying out” the incomplete days, but not all states did that for all metrics, leading to confusion when interpreting time series.

Screenshot of a chart from the Washington State COVID-19 dashboard showing data from January through May 2021. The last several days of the chart show a steep decline in cases, but are also colored gray to indicate incomplete data, which the previous days are colored blue.

Time series of the state of Washington’s case counts, organized by date of specimen collection. The taper at the end is a result of incomplete data for recent dates and does not indicate a change in the direction of the pandemic.

Time series like this are continuously updated by the states as the latest data comes in. We call this updating of past data a “backfill.” Backfills often occur as a result of reporting delays, but they can also happen when a state finds errors in their data and takes steps to correct those errors. The result is that the completeness of the data on any given day increases over time. The chart below shows that. The height of each bar represents the number of tests performed on each day.

A daily bar chart, where each bar is composed of segments stacked on top of each other, with each segment having a different shade of blue. There are 5 shades, representing 5 dates on which data was collected by CTP: February 9th, 16th, and 23rd, March 2nd, and April 9th.

Days in early February are the most interesting, as they have segments representing the full range of color shading, indicating that they were updated for a month after the initial reporting. On these days, the segments corresponding to the February 9th data collection comprise about half of each bar, with the Feb 23rd data comprising most of the remainder. However, even for these early days, some data was added as late as April 5th.

Time series of tests in Washington state, for the month of February 2021, with data captured from the beginning of February to end of April 2021. The colors represent different updates that the state made to the data, with the darker colors indicating older updates and lighter colors indicating recent ones.

How we addressed dating scheme challenges in our own data

As we’ve discussed above, The Covid Tracking Project primarily collected data and maintained time series based on a dating scheme of “date of report,” whereas states maintained time series using various other dating schemes, and they revised their data continuously through backfills.

This resulted in the shape of our time series graphs diverging from the shape of the states’ regularly backfilled graphs, even though our total numbers came from the states every day. At first, we were tempted to make our time series smoother by replacing them with state time series. But we were concerned about presenting data with a misleading “taper” appearing on recent days, leading people to conclude that things were getting better even when they were not. We ultimately concluded that charting data by date of report was the best and clearest way to communicate trends in the pandemic.

However, there were some situations where we found that the benefits of backfilling our data using a state’s time series outweighed the downsides.

The most common situation was the introduction of a new metric that had not previously been reported. For example, a state might start reporting probable cases in addition to the previously reported confirmed cases. In these situations, states often provided a time series going back to the beginning of the pandemic. (They had been collecting the data all along, just not reporting it.) We used these time series to backfill our data for the new metric, and from that point forward, we collected the new metric using our normal method of taking the latest numbers from the dashboard each day.

Another common situation occurred when states improved their reporting to separate out subtypes of a specific metric. For example, a state might have initially lumped antigen tests in with PCR tests, and then later began reporting these tests separately, providing separate antigen and PCR-only time series. In this type of situation, we separated the two metrics into their own fields, and backfilled these fields using time series from the state. So, in the antigen/PCR tests example, we backfilled our field for antigen tests using the new antigen time series from the state; and we also backfilled our PCR tests field, overwriting our previously collected data (which had included antigen tests) with the state’s new PCR-only time series.

The final situation that required us to backfill was when metrics depended on one another, such that the backfill of one metric required that we backfill another metric. The most common example of this was the relationship between total tests and positive tests. While we repeatedly cautioned about the unreliability of test positivity, we found our data users, public health departments, and many newsrooms often wanted to calculate it anyway. As such, it was important that both total tests and positive test metrics used the same dating scheme. So if we needed to backfill positive tests, we also backfilled total tests, to keep them in sync.

When we did backfills, we used the state’s entire time series for the metric, from the beginning of their data up to the day of the backfill. This meant that each of our resulting time series only contained one transition from the state’s dating scheme to our report date dating scheme and only one period where the data tapered off due to incompleteness. While any mixing of this type was not ideal, we found that the overall trends were still correct, and that it was a reasonable compromise.

This bar chart spans from July 01 to October 01 2020. It is broken into three sections, for backfill (July 1st through August 20th), taper (August 21st through 26th), and daily collection (August 27th through October 1st). The data in the backfill section appears the most regular, with a clear weekly cadence (showing lower values on weekends) and daily maximums holding steady around 17000 tests. The data in the taper section falls off to zero new tests very quickly. The data in the daily collection section is similar to the data in the backfill section, but much less regular, with two large values going as high as 40000 tests, and four days with zero tests.

Time series showing Washington’s daily tests with a period of backfilled data (dated by specimen collection date), a period of taper near the day of our backfill, and the subsequent days of normal CTP data collection (dated by report date).

Finally, we were careful to note these backfills in our public notes, and often with notes directly on charts, highlighting the point of transition from one dating scheme to another.

What we learned

This mixing of dating schemes was something that we avoided when possible, given the potential for confusion from the taper and the fact that it makes the data set less consistent. But we believe that it was a net positive, allowing us to provide an up-to-date, unified view of the pandemic across 56 states and territories.

As an extra layer of research, in parallel with maintaining our by report date time series with some mixing of states dating schemes, we also undertook an effort to automatically collect full time series from as many states as possible on a daily basis, to allow for broader historical analysis based on other dating schemes. We were able to collect useful time series data from 47 states, and have made that data available in GitHub.

In an ideal world, all states and territories would consistently release full time series with consistent dating schemes in this way, allowing for all analysis to be done using appropriate dating schemes for each metric. But in the meantime, we believe that date of report remains the best dating scheme to use for comparison of all states and territories.

Additional research from Michal Mart.

Theo Michel is a Project Manager and contributor to Data Quality efforts at the Covid Tracking Project.

@theotayo

Rebma is an infrastructure engineer who likes puzzles, solving problems, and who did the data scripting and automation for The Covid Tracking Project.

More “How We Made The COVID Tracking Project” posts

See all analysis & updates

Dating Data: How We Used Multiple Dating Schemes to Provide the Most Complete Picture of the Pandemic

How a time series can get tricky

How states presented time series

Beware the taper

How we addressed dating scheme challenges in our own data

What we learned

More “How We Made The COVID Tracking Project” posts

20,000 Hours of Data Entry: Why We Didn’t Automate Our Data Collection

How and Why The COVID Tracking Project Built a Screenshot System

Behind The COVID Tracking Project’s Public Help Desk