Plotly Express Python remove legend title 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update("=")[0])

However now with 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.

Scraping public health web page using Python, BeautifulSoup4, Requests and Pandas

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.  EDIT: Health Montreal stopped publishing data so the link no longer works!

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = ''
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

Choropleth map of Canada COVID-19 cases by health region using Leaflet and D3.js

During the early days of the 2020 COVID-19 pandemic in Canada, I wanted to get better understanding of the geographical distribution of COVID-19 cases across Canada.

At the time, government or news agencies were only mapping case counts by province. However Canadian provinces are so big compared to population centers that it doesn’t accurately reflect actual geographic distribution of cases. It would be better to use the provincial “health regions” which correspond much better to population centers.

So I set about to create for myself a choropleth map visualization by health regions.

View the finished choropleth map at the following link. The source data is updated daily each evening.

I used Leaflet.js open-source JavaScript mapping library to create the interactive choropleth map, D3.js to retrieve and transform the csv format data, and Javascript to retrieve the JSON geographic boundary files and also to manipulate and present the data with HTML and CSS.

The COVID-19 case count data are obtained as csv file format from the “COVID-19 Canada Open Data Working Group” who are an amazing group of volunteers.  They have been tirelessly collating data from the various provincial and territory government agencies daily since early March. This group saves the collated and cleaned data as csv files in a Github repository

The health region geographical boundary descriptions are from Statistics Canada’s Statscan ArcGIS Health region boundary Canada dataset. These had very detailed boundaries so I simplified them using QGIS which also dramatically reduced the dataset size.

However, there were some data issues that needed to be addressed first. The Statscan health regions shape file boundary names are different than those used by the provincial and territory government agencies reporting the data.

The Statscan seems to have full-form ,”official” health region names, while the provincial and territory government agency names are common, more familar, short-hand names. Also, names may change over time.

Provinces may also add or remove health regions from time to time due to administrative changes or population changes etc. So either set may have health regions that the other doesn’t have.

From a data governance perspective, in a perfect world, everyone uses a single set of health region boundary names. COVID-19 reporting has made a lot of people aware of this issue which is a silver lining in the COVID-19 dark cloud!

Addressing these name differences was actually quite simple, requiring creation of a lookup table with two columns, one for each dataset, to match the names in the boundary data files to the names in the counts data file. The lookup table can then be used dynamically when getting data each time the map is refreshed. This is described in more detail in Github repository README linked below.

Code for this project is maintained in Github:

I also created a separate choropleth map for Montreal, where I was living at the time, which was Canada’s COVID-19 “hotspot” with about 25-30% of Canada’s total COVID-19 cases. However, the Montreal data source has since been discontinued so the map is archived now.

View archived Montreal map here:

How to use Python as a datasource for Power BI

Step 1 – Download latest version of Power BI to get new features (version that came out in early August 2018)

Step  2 – Enable Python in the “Options – Preview Features” check the “Python support” option. Close and restart Power BI.

Step 3 – Click “Get Data”, select “Other” and “Python script” and click “Connect” button.

Step 4 – Write or paste Python script in the popup Python script window and click Ok button.

I chose to use Pandas to get csv file:

import numpy as np
import pandas as pd
data = pd.read_csv(‘C:/mydatafile.csv’, delimiter=’,’)

Note if you are using Windows as I am, you have to write file path so Windows can recognize it otherwise you will get an error. I used foreslashes but you could also escape the backslashes.

You could also write any Python script that gets or generates data from any data source eg a database, from AWS S3 buckets, etc. Of course this also includes transforming data. The important thing to note is your script must output a named dataset(s). The named dataset(s) will be thing Power BI will use when you are finished coding.

Step 5 – After clicking ok, you get the familiar Power Query data source selection window. Select Python and you will see your named data set(s). Just check the one(s) you want and click ok.

Now just continue working with Power BI as normal! You can always go back to edit the Python script by select “Source” in query editor.