Background
Homework 4.
Please turn into Github your Python notebook by September 30th, end of day to a new Github repository called Homework4. The page should have a full notebook as a Readme. Hint, export as a readme.md
, unzip, and put the images & the readme.md
into Github.
Practice Exercise To Help With Homework below
First, we will go through a practice exercise is to get us used to using APIs and joins. In this exercise we explore whether there is obvious evidence for a correlation b/t number of hospital beds and COVID mortality. Intuitively, one might think that more hospital beds improve mortality within a region. Here, we want to complete a very quick cursory view of this premise.
For this exercise we will start by using the following API. The API Docs are here: https://covid-19-apis.postman.com/). Here is an example unit test where you can see a JSON result. https://api.covid19api.com/
How do we determine countries? Well we read the API manual and see: https://api.covid19api.com/countries
Homework Instructions
Your homework will look something like this with some deviations.
Documentation: https://documenter.getpostman.com/view/8854915/SzS8rjHv?version=latest
Homework: Dataset COVID
Our goal is to correlate beds per 10,000 populations to mortality for COVID. Lets bring in another data set that tells us about the number of beds per sample. This table has one-line per hospital. We will need to aggregate or sum by county in order to get the total number of beds per county. First the API.
Loading from API
Call API: https://www.communitybenefitinsight.org/api/get_hospitals.php
API Documents: https://www.communitybenefitinsight.org/?page=info.data_api
hospitals = pd.DataFrame(req) hospitals.head() |
CREATE HospitalInfoPerCounty Table
The field we want is hospital_bed_count
summed for each county
. One thing to realize is that county
is not unique in the US. There are more than one Jefferson Counties. So lets create a new column called `state_county`
hospitals[ 'state_county' ] = hospitals[ 'state' ] + '_' + hospitals[ 'county' ] hospitals.head() |
HospitalInfoPerCounty = hospitals.groupby([ 'state_county' ], as_index = False )[ 'hospital_bed_count' ].agg( 'sum' ) HospitalInfoPerCounty.head() |
That’s not right!
First, lets see the types.
Ok – there clearly aren’t 49622 in AK_Kenai Peninsula Borough. Lets inspect further.
hospitals[hospitals['state_county']=="AK_Kenai Peninsula Borough"]['hospital_bed_size'] |
*added 9/28*
We can see the issue
40 <100 beds 42 <100 beds 47 <100 beds Name: hospital_bed_size, dtype: objectBasically
<100 bed
is not a number but a string.
Turning a string into a number in hospitals.hospital_bed_count
**NEED COMMAND HERE TO CONVERT `hospitals.hospital_bed_count`**
CREATE HospitalInfoPerCounty
and summing for hospital_bed_count
We need to create a dataframe that is unique by state_county
, and we need to sum across entries on hospital_bed_count
. This is because the state and county may not be unique (e.g. multiple entries for different days). Its always important to know what tables are unique for a table.
HospitalInfoPerCounty = hospitals.groupby([ 'state_county' ], as_index = False )[ 'hospital_bed_count' ].agg( 'sum' ) HospitalInfoPerCounty.head() |
Substitution and curating
Sometimes data is not fully structured, as is case here.. This is really sub-optimal, but there are times when you want an approximate solution to understand the scope of the problem. In this case, lets just focus on those that end in County
and remove that text, noting we won’t get all matches, but we should get a majority.
HospitalInfoPerCounty[ 'state_county' ] = HospitalInfoPerCounty[ 'state_county' ]. str .replace( ' County' ,'') HospitalInfoPerCounty.head( 5 ) |
CREATE CovidPerCounty
DATAFRAME
We need to work on covid
, and curate it for use in order to compare with HospitalInfoPerCounty
import requests import pandas as pd import json import numpy as np req = requests.get( "https://webhooks.mongodb-stitch.com/api/client/v2.0/app/covid-19-qppza/service/REST-API/incoming_webhook/us_only?min_date=2021-01-15T00:00:00.000Z&max_date=2021-01-15T00:00:00.000Z" ).json() covid = pd.DataFrame(req) |
On your own examine the resulting table, e.g. covid.head(). What do you see? (Not shown below). We see that states is not a two letter code – how do we convert it?
Converting states to a two letter code using a dictionary
We are also going to need to convert states to a two letter code. Lets use a dictionary
us_state_to_abbrev = { "Alabama" : "AL" , "Alaska" : "AK" , "Arizona" : "AZ" , "Arkansas" : "AR" , "California" : "CA" , "Colorado" : "CO" , "Connecticut" : "CT" , "Delaware" : "DE" , "Florida" : "FL" , "Georgia" : "GA" , "Hawaii" : "HI" , "Idaho" : "ID" , "Illinois" : "IL" , "Indiana" : "IN" , "Iowa" : "IA" , "Kansas" : "KS" , "Kentucky" : "KY" , "Louisiana" : "LA" , "Maine" : "ME" , "Maryland" : "MD" , "Massachusetts" : "MA" , "Michigan" : "MI" , "Minnesota" : "MN" , "Mississippi" : "MS" , "Missouri" : "MO" , "Montana" : "MT" , "Nebraska" : "NE" , "Nevada" : "NV" , "New Hampshire" : "NH" , "New Jersey" : "NJ" , "New Mexico" : "NM" , "New York" : "NY" , "North Carolina" : "NC" , "North Dakota" : "ND" , "Ohio" : "OH" , "Oklahoma" : "OK" , "Oregon" : "OR" , "Pennsylvania" : "PA" , "Rhode Island" : "RI" , "South Carolina" : "SC" , "South Dakota" : "SD" , "Tennessee" : "TN" , "Texas" : "TX" , "Utah" : "UT" , "Vermont" : "VT" , "Virginia" : "VA" , "Washington" : "WA" , "West Virginia" : "WV" , "Wisconsin" : "WI" , "Wyoming" : "WY" , "District of Columbia" : "DC" , "American Samoa" : "AS" , "Guam" : "GU" , "Northern Mariana Islands" : "MP" , "Puerto Rico" : "PR" , "United States Minor Outlying Islands" : "UM" , "U.S. Virgin Islands" : "VI" , } covid[ 'state' ].replace(us_state_to_abbrev, inplace = True ) covid.head( 5 ) |
Creating a key to join our tables
The dataframe covid
may not be fully unique per state
or county
. Lets create a new column that we can merge on which concatenates both.
covid[ 'state_county' ] = covid[ 'state' ] + '_' + covid[ 'county' ] covid.head() |
Creating a dataframe with unique CovidPerCount
We will group and sum for deaths
, population
, and confirmed
CovidPerCounty = covid.groupby([ 'state_county' ], as_index = False ).agg({ 'deaths' : 'sum' , 'confirmed' : 'sum' , 'population' : 'sum' , 'confirmed_daily' : 'mean' , 'deaths_daily' : 'mean' }) CovidPerCounty.head( 15 ) |
Create mortality within CovidPerCounty
CovidPerCounty['mortality']=CovidPerCounty['deaths']/CovidPerCounty['confirmed'] CovidPerCounty.head(5) |
JOINING CovidPerCounty
and HospitalInfoPerCounty
Next Create An Inner Join HospitalInfoPerCounty
and CovidPerCounty
by state_county
.
Hospital_Covid_PerCounty = pd.merge(HospitalInfoPerCounty,CovidPerCounty,on = 'state_county' ) Hospital_Covid_PerCounty = Hospital_Covid_PerCounty[Hospital_Covid_PerCounty[ 'population' ]> 50000 ] Hospital_Covid_PerCounty = Hospital_Covid_PerCounty[Hospital_Covid_PerCounty[ 'population' ]< 300000 ] Hospital_Covid_PerCounty = Hospital_Covid_PerCounty[Hospital_Covid_PerCounty[ 'confirmed' ]> 10 ] Hospital_Covid_PerCounty.head() |
FINAL PLOT:
PLEASE FILL IN THE COMMAND TO GET THE PLOT BELOW
sns.relplot(MISSING PLEASE FILL IN TO MAKE PLOT BELOW ) |
PEOPLE PER HOSPITAL BED VS. MORTALITY
No Correlation Is Obvious! Surprise – buying more hospital beds is not obviously improving mortality
Turning in:
Please turn into github as Homework4. Your Github Readme should look like the Notebook for the assignment. The best way to do this is to “save and export notebook” as a “markdown”. The image of the graph must be rendered in your markdown.