Covid19 exploration
An EDA of covid19 data using the UK governments python API and Altair for graphics
In this post we will explore the data found in the UK governments Covid API (the Python version), which can be found on their website. Lets import the Python module first and checkout the main function.
from uk_covid19 import Cov19API
import pandas as pd
import altair as alt
Collapse the following output to see the help documents for Cov19API
. It tells us that it is a class with the parameters filters
(a list of strings
), structure
(a dictionary with a str
key and dict
or str
value) and latest_by
(a str
or None
). The class also has a method called get_dataframe()
which will return the data as a Pandas DataFrame.
help(Cov19API)
So now we need to define two things: the filters
and the structure
.
Data Filters
The filter tells the API what kind of area we would like data about. Valid values for the filters are:
List of valid filters
areaType
- Area type as string
areaName
- Area name as string
areaCode
- Area Code as string
date
- Date as string [
YYYY-MM-DD
]
We must specify the areaType
, so we will set it to nation
. This will give us the data on the country level - so the total data for Wales, Scotland, Northen Ireland and England.
filter_all_nations = [
"areaType=nation"
]
filter_all_uk = [
"areaType=overview"
]
Other options for areaType
will give:
-
overview
overview data for the UK -
region
Region data (regions for England only) -
nhsregion
NHS region data (only England) -
utla
Upper-tier local authority data (Again, only England) -
ltla
Lower-tier local authority data (...only England)
Data Structure
The structure
parameter describes what metrics we want the data to describe. There are a lot of them, but the main metrics are areaName
, date
and newCasesByPublishDate
. Click the arrow below to expand the full list of valid metrics.
See a list of valid metrics for structure
areaType
- Area type as string
areaName
- Area name as string
areaCode
- Area Code as string
date
- Date as string [
YYYY-MM-DD
] hash
- Unique ID as string
newCasesByPublishDate
- New cases by publish date
cumCasesByPublishDate
- Cumulative cases by publish date
cumCasesBySpecimenDateRate
- Rate of cumulative cases by publish date per 100k resident population
newCasesBySpecimenDate
- New cases by specimen date
cumCasesBySpecimenDateRate
- Rate of cumulative cases by specimen date per 100k resident population
cumCasesBySpecimenDate
- Cumulative cases by specimen date
maleCases
- Male cases (by age)
femaleCases
- Female cases (by age)
newPillarOneTestsByPublishDate
- New pillar one tests by publish date
cumPillarOneTestsByPublishDate
- Cumulative pillar one tests by publish date
newPillarTwoTestsByPublishDate
- New pillar two tests by publish date
cumPillarTwoTestsByPublishDate
- Cumulative pillar two tests by publish date
newPillarThreeTestsByPublishDate
- New pillar three tests by publish date
cumPillarThreeTestsByPublishDate
- Cumulative pillar three tests by publish date
newPillarFourTestsByPublishDate
- New pillar four tests by publish date
cumPillarFourTestsByPublishDate
- Cumulative pillar four tests by publish date
newAdmissions
- New admissions
cumAdmissions
- Cumulative number of admissions
cumAdmissionsByAge
- Cumulative admissions by age
cumTestsByPublishDate
- Cumulative tests by publish date
newTestsByPublishDate
- New tests by publish date
covidOccupiedMVBeds
- COVID-19 occupied beds with mechanical ventilators
hospitalCases
- Hospital cases
plannedCapacityByPublishDate
- Planned capacity by publish date
newDeaths28DaysByPublishDate
- Deaths within 28 days of positive test
cumDeaths28DaysByPublishDate
- Cumulative deaths within 28 days of positive test
cumDeaths28DaysByPublishDateRate
- Rate of cumulative deaths within 28 days of positive test per 100k resident population
newDeaths28DaysByDeathDate
- Deaths within 28 days of positive test by death date
cumDeaths28DaysByDeathDate
- Cumulative deaths within 28 days of positive test by death date
cumDeaths28DaysByDeathDateRate
- Rate of cumulative deaths within 28 days of positive test by death date per 100k resident population
We will look at new cases by publish date and new deaths by death date, so the structure will look like this
structure_cases_death = {
"date": "date",
"areaName": "areaName",
"newCases": "newCasesByPublishDate",
"cumCases": "cumCasesBySpecimenDate",
"cumCasesRate": "cumCasesBySpecimenDateRate",
"newDeaths": "newDeathsByDeathDate"
}
Now we create the class and get the DataFrame from it. We also use fillna(0)
to fill any entries that are NaN
's - because that is the default if a value is missing.
uk_cases = Cov19API(filters=filter_all_nations,
structure=structure_cases_death).get_dataframe().fillna(0)
uk_cases['date'] = pd.to_datetime(uk_cases['date'], format='%Y-%m-%d')
uk_cases.sort_values(['areaName', 'date'], inplace=True)
uk_cases.reset_index(drop=True, inplace=True)
Note that the Welsh Government announced that 11,000 cases were missing from between the 9th and 15th of December. This explains the large spike after the 17th of December, and also the decrease in cases before that. See this BBC article and relevant announcement by Public Health Wales about how they are changing the way they report cases.
In the data from the COVID19 API, all 11,000 cases are allocated to the 17th of December. To overcome this, we will evenly distribute the cases out over the preceeding 5 days. This may not be the most accurate way of doing it, but it will result in the cleanest picture when it comes to plotting the graphs.
date_list = ['2020-12-13', '2020-12-14',
'2020-12-15', '2020-12-16', '2020-12-17']
uk_cases.iloc[(uk_cases.query("areaName=='Wales'").query("date==@date_list").index), 2] = np.flip(
np.array(list(range(2494 + int((2801 - 2494)/6), 2801 - int((2801 - 2494)/6), int((2801 - 2494)/6)))))
Finally we add a column to the dataframe called dailyChange
which will keep track of if the number of new cases has gone up or down per day.
grouped_df = uk_cases.groupby('areaName')
uk_cases['casesChange'] = grouped_df.apply(
lambda x: x['newCases'] - x['newCases'].shift(1).fillna(0)).reset_index(drop=True)
uk_cases.sample(5, random_state=40) # a random sample of rows
Notice that only deaths in England have been counted in the newDeaths column. I prefer to look at the number of cases per 100k population, but to do this with the newCases
column, we would need to grab population data for each country. Alternatively we can estimate the population by using the cumulative cases per 100k column - the cases per 100k is given by
We will take numbers from the latest available day (just to make sure there are no zeros). For Wales:
wales_pop = round(100000 * uk_cases.query("areaName == 'Wales'").cumCases.max() /
uk_cases.query("areaName == 'Wales'").cumCasesRate.max())
print(f'Wales population: {wales_pop}')
which is about right (it was 3,152,879 in 2019..). And for the rest of the countries:
countries = ['Wales', 'Scotland', 'Northern Ireland', 'England']
countries_population = dict()
for country in countries:
countries_population[country] = round(100000 * uk_cases.query(
"areaName == @country").cumCases.max() / uk_cases.query("areaName == @country").cumCasesRate.max())
if 'population' not in uk_cases.columns:
countries_pop_df = pd.DataFrame.from_dict(countries_population, orient='index', columns=[
'population'])
uk_cases = uk_cases.join(countries_pop_df, on='areaName')
uk_cases['newCasesRate'] = 100000 * uk_cases.newCases / uk_cases.population
uk_cases['casesChangeRate'] = 100000 * \
uk_cases.casesChange / uk_cases.population
We will take the 7 day rolling sum of the new cases rate (i.e, new cases per 100k population) grouped by each country, and fill the missing values with 0's.
uk_cases['weeklyCasesRate'] = uk_cases.groupby(by='areaName')['newCasesRate'].rolling(7).sum().reset_index(drop=True).fillna(0)
Overview of UK cases
For the plotting we will also take the total cases for the UK. We could do this by grouping by date in the uk_cases
dataframe and summing up the new cases like that - however we will just run another query with the Cov19API
and run the same preprocessing as above, but this time cleaned up into a function.
overview_cases = Cov19API(filters=filter_all_uk, structure=structure_cases_death).get_dataframe().fillna(0)
def preprocess_dataframe(df):
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.sort_values('date', inplace=True)
df.reset_index(drop=True, inplace=True)
df['casesChange'] = df['newCases'] - df['newCases'].shift(-1).fillna(0)
population = round(100000 * df.cumCases.max() /
df.cumCasesRate.max())
df['newCasesRate'] = 100000 * df.newCases / population
df['casesChangeRate'] = 100000 * df.casesChange / population
df['weeklyCasesRate'] = df['newCasesRate'].rolling(7).sum().fillna(0)
return df
preprocess_dataframe(overview_cases)
We will use the Python library Altair
for visualising the data, see the altair docs for more information.
First we have a graph which shows the daily change in the number of new cases for each country. This number jumps up and down all over the place, which is likely due to delay in reporting of new cases over the weekend. Another interesting thing is that it looks like the daily cases in Wales experienced a much shorter period of calm over the summer (calm in the sense of daily cases not jumping up and down).
The orange bars are days when the number of new cases (per 100k population) was more than the previous day, while the blue are days when the number of new cases dropped. The red line is the 7 day moving average.
When the moving average line is below 0, it means that there is a consistent drop in new cases. We can see this clearly happening around the times that lockdowns were introduced (though, to varying degrees). I will update the graphs soon with a marker of when each lockdown started.
import altair as alt
bars = alt.Chart(uk_cases).mark_bar().encode(
x="yearmonthdate(date):T",
y="casesChangeRate:Q",
tooltip='casesChange',
color=alt.condition(
alt.datum.casesChangeRate > 0,
alt.value("orange"), # The positive color
alt.value("blue") # The negative color
)
).properties(title='Daily change in number of new cases with 7 day rolling mean',width=800).interactive()
line = alt.Chart(uk_cases).mark_line(
color='red',
size=2,
opacity=0.6
).transform_window(
rolling_mean='mean(casesChangeRate)',
frame=[0, 7],
groupby=['areaName']
).encode(
x='yearmonthdate(date):T',
y='rolling_mean:Q'
)
alt.layer(bars, line, data=uk_cases).facet(alt.Column(
'areaName'), columns=1).resolve_scale(y='independent')
Next is a bar chart of the number of new cases in each country (per 100k population), with the 7 day moving average of cases. Again we see that Wales saw a longer period of raising and falling cases compared to the other countries.
After a period of cases falling, each nation is now seeing a rise in the number of cases - especially in Wales.
bars = alt.Chart(uk_cases).mark_bar().encode(
x="yearmonthdate(date):T",
y="newCasesRate:Q",
tooltip='newCasesRate',
color=alt.condition(
alt.datum.dailyChange > 0,
alt.value("orange"), # The positive color
alt.value("blue") # The negative color
)
).properties(title='New cases per 100k population with rolling 7 day average', width=800).interactive()
line = alt.Chart(uk_cases).mark_line(
color='red',
size=2,
).transform_window(
rolling_mean='mean(newCasesRate)',
frame=[0, 7],
groupby=['areaName']
).encode(
x='yearmonthdate(date):T',
y='rolling_mean:Q'
)
alt.layer(line, bars, data=uk_cases).facet(alt.Row('areaName'), columns=1)
countries = uk_cases['areaName'].unique()
countries.sort()
selection = alt.selection_single(
name='Select',
fields=['areaName'],
init={'areaName': 'Wales'},
bind={'areaName': alt.binding_select(options=countries)}
)
# scatter plot, modify opacity based on selection
bars = alt.Chart(uk_cases).mark_bar().add_selection(
selection
).encode(
x=alt.X("yearmonthdate(date):T", axis=alt.Axis(title='Date')),
y=alt.Y("weeklyCasesRate:Q", axis=alt.Axis(title='Incidence rate')),
tooltip='weeklyCasesRate:Q',
opacity=alt.condition(selection, alt.value(1), alt.value(0))
).properties(title=f'7 day incidence rate of individual countries vs rolling mean across the UK', width=800)
line = alt.Chart(overview_cases).mark_line(
color='red',
size=2,
).transform_window(
rolling_mean='mean(weeklyCasesRate)',
frame=[0, 7]
).encode(
x='yearmonthdate(date):T',
y='rolling_mean:Q'
)
alt.layer(bars, line)