Scraping public health web page to get data

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository https://github.com/sitrucp/covid_montreal_scrape_data.

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on Github.io I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = 'https://santemontreal.qc.ca/en/public/coronavirus-covid-19/'
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
    else:
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)
        upload_to_aws()

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

  
    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():
        bucket.delete_key(key)

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

AWS S3 csv file as D3 report data source

This is an example of how to read a csv file retrieved from an AWS S3 bucket as a data source for a D3 javascript visualization.

The D3 visualization would be an HTML document hosted on a web server. 

You will use the AWS SDK to get the csv file from the S3 bucket and so you need to have an AWS S3 bucket key and secret but I won’t cover that in this post.

The key point of this post is to highlight that the bucket.getObject function data is read into D3 using  d3.csv.parse(data.Body.toString());  

Another note is that d3.csv.parse is for D3 version 3. Older versions use d3.csvParse. 

Once implemented, whenever the webpage is refreshed it retrieves latest csv file from the S3 bucket and the D3 visualization is updated.

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.6.3.min.js"></script>

<script type="text/javascript">

// aws key and secret (note these should be retrieved from server not put as plain text into html code)
AWS.config.accessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.secretAccessKey = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.region = 'us-east-1';

// create the AWS.Request object
var bucket = new AWS.S3();

// use AWS SDK getobject to retrieve csv file
bucket.getObject({
    Bucket: 'my-S3-bucket', 
    Key: 'myfile.csv'
}, 

// function to use the data retrieve 
function awsDataFile(error, data) {
    if (error) {
        return console.log(error);
    }

        // this where magic happens using d3.csv.parse to read myCSVdata.Body.toString()
    myCSVdata = d3.csv.parse(data.Body.toString()); 

        // now loop through data and get fields desired for visualization 
    var counter = 0;
    myCSVdata.forEach(function(d) {
            d.field1= +d.field1;
            d.field2= +d.field2;
            countLoop = counter++;
    });

        // now you can create rest of D3 vizualization here 
        // for example like this example https://gist.github.com/d3noob/4414436

        my D3 vizualization code here

// this closes bucket.getObject 
});

</script>

 

 

AWS RDS Postgres database external secure connection

Require an SSL Connection to a PostgreSQL DB Instance? Here are some notes from my investigations.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.SSL

Not great examples on that but take a look at this other documentation for secure connection to Redshift database which has working examples
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html

Go to EC2 Security Groups, and modify the Inbound PostgreSQL rule from the VPS server eg something like “sg-c2a232b8” to either your current IP or to Anywhere eg 0.0.0.0/0

Setting it to Anywhere means both you and your EC2 application can connect to the database, but so can the rest of the world. Setting it to just your current IP means only you and anyone else on that IP (if it is a work or university IP) can connect to the database. Of course, they will need the key, username and password so reasonably secure.

Twitter Followers and List Membership Tracking

I have created my own historical reporting of my Twitter account followers and list memberships. I have hosted the reporting on this open webpage.

Most people know what followers are but the list memberships aren’t as well known so here are definitions of them.

Twitter Followers – Twitter followers are other Twitter accounts who follow my account. A Twitter Follower may follow my account for a few days and then unfollow me at any time afterwards.

Twitter List Memberships – Twitter accounts can also create their own Lists and then add other Twitter accounts to the lists. For example a Twitter account may create a List called “Tweet about Big Data” to track Twitter accounts they believe “Tweet about Big Data”. Twitter accounts may add my account to their List and remove it at any time afterwards.

The Twitter data retrieval, writing to database, and data querying are all done on a web server.

In order to record changes in counts of these you need to have daily historical data. The Twitter API doesn’t provide historical data so you need create it yourself by retrieving and saving daily data somewhere.

Three Python scripts using Twitter API, Python Tweepy and AWS SDK are scheduled to run daily using cron jobs.

Two scripts retrieve followers and list memberships and insert the data into a PostgreSQL database. This daily retrieval builds the daily history.

Another script queries the database table to create reporting datasets of new, active and dropped followers and list memberships that are exported as csv files to a AWS S3 folder which also has files for a AWS S3 hosted static website.

The AWS S3 hosted static website uses Chart.js and D3.js Javascript charting libraries to read and visualize the data. This post does not describe how to read the csv files but I have written another post that describes this AWS S3 csv file as D3 report data source

A screenshot of one of the visualizations showing active followers by day, with counts of new follows and unfollows by day is shown below.

twitter-follower-chart

A second screenshot of another visualization showing active list memberships by day, with counts of new listings and un-listing by day is shown below.

twitter-list-membership-chart

The Python code to retrieve the Twitter data, transform it and create csv files and upload them to AWS is below.

The code to retrieve Twitter followers:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get secrets
sys.path.insert(0, '/home/secrets/')
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_followers():

    twitter_followers = []
    
    for user in tweepy.Cursor(api.followers, screen_name="twitterusername").items():
        twitter_followers.append(user)
        
    for user in twitter_followers:
        query = "INSERT INTO twitter_followers \
                (date, \
                follower_id, \
                name, \
                screen_name) \
                VALUES \
                (%s, %s, %s, %s)";
        data = (today,
                str(user.id).strip(),
                str(user.name.encode('utf8','ignore')).replace(',','').strip(),
                str(user.screen_name.encode('utf8','ignore')).strip()
                )
        cursor.execute(query, data)
        conn.commit()

    conn.close()
    
    ## print job status to log
    print str(datetime.now()) + ' twitter followers'    
            
if __name__ == '__main__':
	get_followers()

The code to retrieve Twitter list memberships:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get database creds
sys.path.insert(0, '/home/secrets/') 
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_list_memberships():

    twitter_list_memberships = []
    
    for list in tweepy.Cursor(api.lists_memberships, screen_name="twitterusername").items():
        twitter_list_memberships.append(list)
        #print list.full_name
        
    for list in twitter_list_memberships:

        query = "INSERT INTO twitter_list_memberships \
                (date, \
                id_str, \
                name, \
                full_name, \
                member_count, \
                subscriber_count, \
                mode, \
                following, \
                user_screen_name) \
                VALUES \
                (%s, %s, %s, %s, %s, %s, %s, %s, %s)";
        data = (today,
                list.id_str.encode('utf8','ignore'),
                list.name.encode('utf8','ignore'),
                list.full_name.encode('utf8','ignore'),
                list.member_count,
                list.subscriber_count,
                list.mode,
                list.following,
                list.user.screen_name.encode('utf8','ignore'))
        cursor.execute(query, data)
        conn.commit()
        
    conn.close()
    
    ## print status for log
    print str(datetime.now()) + ' twitter list memberships'    
            
if __name__ == '__main__':
	get_list_memberships()

The code to create csv files and upload to AWS S3 bucket:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
from datetime import datetime
from os import listdir
from os.path import isfile, join
import sys
import csv
import psycopg2
import re
from collections import Counter

upload_path = '/home/twitter/aws_s3_site/'

sys.path.insert(0, '/home/secret/')
from aws_keys import aws_keys
from secrets import secrets

## create aws S3 connection
conn = S3Connection(aws_keys['AWS_KEY'], aws_keys['AWS_SECRET'])
bucket = conn.get_bucket('twitter-bucket', validate=False)

## create db connection
conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cur = conn.cursor()

## get data sets from db

## followers
cur.execute ("select something here;")
followers = list(cur.fetchall())

## lists
cur.execute ("select something here;")
lists = list(cur.fetchall())

conn.close()

## write db data to csv files, save in upload folder

## followers
with open(upload_path + 'followers.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in followers:
        writer.writerow(row)

## lists
with open(upload_path + 'lists.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in lists:
        writer.writerow(row)
            	
## upload csv files to S3 twitter-bucket
    
upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

# delete existing bucket keys to reset
for key in bucket.list():
    if '.csv' in key:
        bucket.delete_key(key)
  
# complete uploading to AWS
for file in upload_files:
    k = Key(bucket)
    k.key = file
    k.set_contents_from_filename(upload_path + file)
    
## write success to cron log
print str(datetime.now()) + ' aws s3 twitter upload'