In this post we will look at the number of cases and number of deaths due to Covid-19 in England, and we will use these numbers to estimate a few things:

  • The approximate number of cases that actually occured during the first wave (Winter and Spring of 2020)
  • The mortality rate (or rather, the number of people dying vs the number of positive cases)
  • The number death rate for the next few weeks based on the number of new cases over the last couple of weeks.

Importing data

We will grab the data on the number of cases and deaths for each English region and also do some cleaning and feature engineering.

from uk_covid19 import Cov19API

import pandas as pd
import altair as alt
import numpy as np 

#collapse
filter_all_regions = [
    "areaType=region"
]
structure_deaths = {
    "date": "date",
    "areaName": "areaName",
    "newCases": "newCasesByPublishDate",
    "newDeaths": "newDeathsByDeathDate"
}

eng_deaths = Cov19API(filters=filter_all_regions, structure=structure_deaths).get_dataframe().fillna(0)

eng_deaths['date'] = pd.to_datetime(eng_deaths['date'], format='%Y-%m-%d')
eng_deaths.sort_values(['areaName', 'date'], inplace=True)
eng_deaths.reset_index(drop=True,inplace=True)

eng_deaths['weeklyDeaths'] = eng_deaths.groupby(by='areaName')['newDeaths'].rolling(7).sum().reset_index(drop=True).fillna(0)
eng_deaths['weeklyCases'] = eng_deaths.groupby(by='areaName')['newCases'].rolling(7).sum().reset_index(drop=True).fillna(0)
eng_deaths['mortalityEstimated'] = 100 *(eng_deaths.groupby(by='areaName')['weeklyDeaths'].shift(-14))/eng_deaths['weeklyCases']

Next we do the same for the whole of England.

filter_england = [
    "areaType=nation",
    "areaName=England"
]
full_eng_deaths = Cov19API(filters=filter_england, structure=structure_deaths).get_dataframe().fillna(0)

full_eng_deaths['date'] = pd.to_datetime(full_eng_deaths['date'], format='%Y-%m-%d')
full_eng_deaths.sort_values(['areaName', 'date'], inplace=True)
full_eng_deaths.reset_index(drop=True,inplace=True)
full_eng_deaths['newDeaths'].iloc[-1] = np.nan
full_eng_deaths['newDeaths'].iloc[-2] = np.nan
full_eng_deaths['newDeaths'].iloc[-3] = np.nan
full_eng_deaths['laggedNewDeaths'] = full_eng_deaths['newDeaths'].shift(-7)
full_eng_deaths['estimateCasesFromDeaths'] = full_eng_deaths['laggedNewDeaths'] * 50
full_eng_deaths['estimateDeathsFromCases'] = full_eng_deaths['newCases'] * 0.02
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

Plotting the data

We first start by looking at the number of deaths in each English region since the beginning of march, then estimate the mortality rate (rate of deaths per positive case) in each region. We then move on to look at the data for the whole of England.

Plotting: English regions

bars = alt.Chart(eng_deaths.query("date >= '2020-03-01'")).mark_bar().encode(
    x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title='Date')),
    y=alt.Y("weeklyDeaths:Q", axis=alt.Axis(title='Weekly number of deaths')),
    tooltip="newDeaths:Q"
).properties(width=700)

bars.facet(alt.Column('areaName', title='Region'), columns=1).properties(title='Weekly number of deaths in each region')

bars = alt.Chart(eng_deaths.query("date >= '2020-07-07'")).mark_bar().encode(
    x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title='Date')),
    y=alt.Y("mortalityEstimated:Q", axis=alt.Axis(title='Implied estimated mortality')),
    tooltip="mortalityEstimated:Q"
).properties(width=800)

bars.facet(alt.Column('areaName', title='Region'), columns=1).properties(title='Number of deaths as a percentage of number of cases')

For each region in England, we plot the ration of new deaths (lagged by 7 days) over the number of new cases as a percentage. We only look at the dates in the second half of 2020 because the lack of testing capacity skewed the numbers too much (the mortality rate is certainly not above 10%...). Note that we still see an overestimate of the mortality (in particular in the North East and the South East) over the summer, likely due to low levels of testing capacity. In the end, the implied mortality rate seems to have settled down to around 2%.

Plotting: All of England

The first chart we look at has both the number of new cases and the number of new deaths in England. The blue line corresponds to the axis on the right hand side, the number of new deaths. The green bars are the number of new cases (axis on the left hand side).

base = alt.Chart(full_eng_deaths.query("date >= '2020-03-01'")).encode(x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title=None))).properties(title='Number of new cases and the number of new deaths in England',width=700)

bars = base.mark_bar(color='#57A44C').encode(
    y=alt.Y("newCases:Q", axis=alt.Axis(title='Number of new cases', titleColor='#57A44C')),
    tooltip="newCases:Q"
)

line = base.mark_line(stroke='#5276A7').encode(y=alt.Y("newDeaths:Q", axis=alt.Axis(title='Number of new deaths', titleColor='#5276A7')))

alt.layer(bars, line).resolve_scale(y='independent')

base = alt.Chart(full_eng_deaths.query("date >= '2020-03-01'")).encode(x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title=None))).properties(width=700, title='Estimating the number of deaths from the number of cases vs actual number of deaths')

bars = base.mark_bar(color='#57A44C').encode(
    y=alt.Y("estimateDeathsFromCases:Q", axis=alt.Axis(title='Estimated number of deaths', titleColor='#57A44C')),
    tooltip="newCases:Q"
)

line = base.mark_line(stroke='#5276A7').encode(y=alt.Y("laggedNewDeaths:Q", axis=alt.Axis(title='Actual number of new deaths', titleColor='#5276A7')))

alt.layer(bars, line).resolve_scale(y='independent')

In the above chart, the blue line (axis on the right hand side) indicates the number of deaths recorded lagged by 7 days. The green bars (with axis on the left hand side) are what the number of deaths would be if the mortality rate was 2% and if the number of positive cases exaclty reflects the true number of cases. Of course this is not the case (especially in the first half of 2020, before mass testing was available). However this rough estimate seems to line up pretty well from September onwards, so we will use this estimate. Note that we have lagged the time series for the number of deaths by 7 days, so that we can visualise the interaction between the number of new cases and number of new deaths. This also allows us to see a prediction for the number of deaths that might occur based on the number of new cases that have been seen in the last week.

base = alt.Chart(full_eng_deaths.query("date >= '2020-03-01'")).encode(x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title=None))).properties(width=700, title='Estimating the number of cases from the number of deaths, based on 2% mortality rate')

bars = base.mark_bar(color='#57A44C').encode(
    y=alt.Y("newCases:Q", axis=alt.Axis(title='Reported number of new cases', titleColor='#57A44C')),
    tooltip="newCases:Q"
)

line = base.mark_line(stroke='#5276A7').encode(y=alt.Y("estimateCasesFromDeaths:Q", axis=alt.Axis(title='Estimated number of cases from the deaths', titleColor='#5276A7')))

alt.layer(bars, line).resolve_scale(y='independent')

In the above chart, we have reversed the prediction to try and visualise how many cases there might have been based on the number of deaths that we saw - again, using a mortality rate of 2%.

We can compare this to other attempts at predicting the number of cases during the first peak, for example the following chart from ourworldindata.org

Predicted number of cases according to various models

Note that these numbers are for the whole of the UK - if we further assume that the cases were distributed evenly accross the UK, then we can just adjust those numbers from ourworldindata by ~80%. With these adjustments in mind, our prediction of ~50,000 cases at the peak of the first wave seems to fit in pretty well.

Spanish Influenza of 1918

Finally I saw worrying graph showing how the influenza pandemic of 1918 developed on the CDC's website. The axis aren't labeled so I'm not sure what the actual numbers were during the 3 waves, but it is estimated the a third of the world's population was infected with the flu virus, which is approximately 600 million (the world population was about 1.8 Billion then) and it is estimated the around 50 million died. This would make the mortality rate to be around 8%.

The three waves of the 1918 pandemic