Like most people, the COVID19 pandemic is unlike anything I have ever experienced. I wanted to be able to draw my own conclusions about the effectiveness of the mitigations we (in the United States) are taking. I decided to take available data and do some basic linear regressions on them.
In an effort to get businesses open, I am wondering if relaxation of the stay-at-home orders around the country will cause a second wave of infections. With no vaccine nor existing herd immunity, it would seem the cases would return to pre-stay-at-home levels if people return to socializing as they did before. With many people still following some level of social distancing, I would expect the number of cases to increase, but not necessarily to the levels we saw earlier.
So why not try and answer this by monitoring the daily reported data on the COVID19 cases around the country. We should be able to see if the cases are increasing, decreasing, or staying about the same.
I will go ahead and set your expectations for how rigorous the analysis I will be doing. "Statistics for Dummies" meets duct tape is about the right vibe.
LET'S GET SOME DATA
When data sets started becoming available, I ran across two that I have been playing with in spreadsheets:
- "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University" https://github.com/CSSEGISandData/COVID-19
- "Coronavirus (Covid-19) Data in the United States from The New York Times" https://github.com/nytimes/covid-19-data
They are both in GitHub and are updated daily. I decided to use two data sets as a sanity check. This does bring up the "a man with two clocks never knows what time it is" issue, but it should let me know if there was a problem with data collection. So far, the data sets more or less match up. So that is encouraging.
The thinking goes something like this:
- Choose a baseline two-week period of cases to use before stay-at-home orders were relaxed
- Perform linear regressions on the cases and deaths for for each state
- Plot the line along with reported data points from that two-week baseline period forward
- Inspect each chart to see if any trends stand out visually
- After two to three weeks following the baseline period, evaluate each state and compare to their mitigation efforts
Is this the best approach? It certainly is not rigorous, but it should be possible to draw some conclusions (incorrect or otherwise).
The baseline two-week period chosen is 2020-04-17 through 2020-04-30. I wanted a time period that showed "as good as it gets" when following a stay-at-home guidance. Since the Federal stay-at-home recommendation was lifted at the end of April, this seemed like a good choice. It makes sense, however, that the rates being evaluated for that period actually reflect infections happening for the two or three weeks prior to April 17th. But stay-at-home orders and other precautions had been happening earlier in April, so it seems representative enough for my purposes.
The data used is the cumulative total of reported cases (and deaths) up to, and including, that day for each state. After being transformed, the input data for each state looks like this:
date,state,cases,deaths 2020-04-28,Louisiana,27286,1758 2020-04-29,Louisiana,27660,1802 2020-04-30,Louisiana,28001,1862 2020-05-01,Louisiana,28711,1927 2020-05-02,Louisiana,29140,1950 2020-05-03,Louisiana,29340,1969 2020-05-04,Louisiana,29673,1991 2020-05-05,Louisiana,29996,2042 2020-05-06,Louisiana,30399,2094 2020-05-07,Louisiana,30652,2135
This data for the dates desired are processed for each state as well as for the United States for each data set (JHU and NYTimes). The results are a PNG chart and a report per data set. In this case, we get two PNGs and two reports. Here are examples of the charts for Louisiana:
The reports give some information about the linear regression and the correlation coefficient. It also provides some CSV data to be used in the future, if needed. The dates extend out into the future two weeks with the "predictions" based on the linear regression. An example of one of the reports looks like this:
Louisiana Regression Info: Cases m = 373.96483516483516 b = -8846.032967032963 r = 0.9956088290733547 Deaths m = 55.30989010989012 b = -3547.0219780219795 r = 0.9882421218338948 -------------------------------- date,predicted_cases,predicted_deaths,reported_cases,reported_deaths,diff_cases,diff_deaths,used_in_regression 2020-04-17,23314,1209,23118,1213,-196,4,True 2020-04-18,23688,1264,23580,1267,-108,3,True 2020-04-19,24062,1320,23928,1296,-134,-24,True 2020-04-20,24436,1375,24523,1328,87,-47,True 2020-04-21,24810,1430,24854,1405,44,-25,True 2020-04-22,25184,1486,25258,1473,74,-13,True 2020-04-23,25558,1541,25739,1599,181,58,True 2020-04-24,25932,1596,26140,1660,208,64,True 2020-04-25,26306,1652,26512,1707,206,55,True 2020-04-26,26680,1707,26773,1729,93,22,True 2020-04-27,27054,1762,27068,1740,14,-22,True 2020-04-28,27428,1818,27286,1801,-142,-17,True 2020-04-29,27802,1873,27660,1845,-142,-28,True 2020-04-30,28176,1928,28001,1905,-175,-23,True 2020-05-01,28550,1983,28711,1970,161,-13,False 2020-05-02,28924,2039,29140,1993,216,-46,False 2020-05-03,29298,2094,29340,2012,42,-82,False 2020-05-04,29672,2149,29673,2064,1,-85,False 2020-05-05,30046,2205,29996,2115,-50,-90,False 2020-05-06,30420,2260,30399,2167,-21,-93,False 2020-05-07,30794,2315,30652,2208,-142,-107,False 2020-05-08,31168,2371,-1,-1,-31169,-2372,False 2020-05-09,31542,2426,-1,-1,-31543,-2427,False 2020-05-10,31916,2481,-1,-1,-31917,-2482,False 2020-05-11,32290,2537,-1,-1,-32291,-2538,False 2020-05-12,32664,2592,-1,-1,-32665,-2593,False 2020-05-13,33038,2647,-1,-1,-33039,-2648,False 2020-05-14,33411,2702,-1,-1,-33412,-2703,False 2020-05-15,33785,2758,-1,-1,-33786,-2759,False 2020-05-16,34159,2813,-1,-1,-34160,-2814,False 2020-05-17,34533,2868,-1,-1,-34534,-2869,False 2020-05-18,34907,2924,-1,-1,-34908,-2925,False 2020-05-19,35281,2979,-1,-1,-35282,-2980,False 2020-05-20,35655,3034,-1,-1,-35656,-3035,False 2020-05-21,36029,3090,-1,-1,-36030,-3091,False
In the next post, I will look at the results from 2020-05-07.