Retrieve and process Environment Canada hydrometric and climate data using Python

Recently needed to get flow and level data for a watercourse hydrological station as well as regional precipitation data for relevant location(s) upstream within the watershed of the station.

The objective was to combine two decades of watercourse flow and level with regional watershed precipitation data into a single set of analysis and reporting.

Environment Canada has all of this data and provides it in a variety of different ways.

Flow and level data

For many users the Water Level and Flow web portal wateroffice.ec.gc.ca site is sufficient. However I needed multiple years’ data so I looked around for alternatives.

I found the Hydat National Water Data Archive which is provided as a SQLite database was a good source. It contains all Canadian watercourse flow and level data up until about a month or so prior to current date and was an quick and easy download and convenient way to get required data.

Real-time / more recent flow data is available using a nice search interface in the same wateroffice.ec.gc.ca portal.

In addition, this same real-time / more recent data can also be obtained by going directly to an FTP site by clicking the ‘Datamart’ link on wateroffice.ec.gc.ca.

More details about the FTP site content are available here.

Both of these methods result in downloading csv files.

The website portal allows you to search by station name or reference id. The FTP site files are named with station reference id so you need to know the reference id to get the correct csv file.

This hydrological flow and level data is also available through the which is described MSC GeoMet API. The API was a bit too complex to use to get data and do analysis so I passed on it but it looks very powerful and well documented.

Precipitation data
Precipitation data is available for stations across Canada and this website https://climate.weather.gc.ca/historical_data/search_historic_data_e.html allows search by name, reference id or geographical location.

However, since this only allows download by single month of data, I needed to find another method to more quickly get multiple years and months of data.

When you use the website search and follow the links to the download page you will also see a link to get more historical data. This link brings you to a Google Drive folder. This folder documents how to use their ‘wget’ method to download files. However, I was using Windows and didn’t want to mess around with Cygwin or Windows linux to be able to use wget.

It turned out to be relatively simple to replicate the wget process using Python Requests, then loop through the csv files to process them. Python io.StringIO was also used to stream request content into a ‘fake csv’ file in each loop, which were then aggregated into a Python List that was converted into a Pandas dataframe so Pandas could be used to process data.

The Python code is in my Github ‘Environment-Canada-Precip-and-Flow-Data’ Repository.

Plotly Express Python remove legend title

Plotly.py 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to Plotly.py 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update(name=t.name.split("=")[0])

However now with Plotly.py 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

fig.update_layout({
    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.

Scraping public health web page using Python, BeautifulSoup4, Requests and Pandas

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.  EDIT: Health Montreal stopped publishing data so the link no longer works!

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository https://github.com/sitrucp/covid_montreal_scrape_data.

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on Github.io I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = 'https://santemontreal.qc.ca/en/public/coronavirus-covid-19/'
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
    else:
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)
        upload_to_aws()

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

  
    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():
        bucket.delete_key(key)

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

Choropleth map of Canada COVID-19 cases by health region using Leaflet and D3.js

During the early days of the 2020 COVID-19 pandemic in Canada, I wanted to get better understanding of the geographical distribution of COVID-19 cases across Canada.

At the time, government or news agencies were only mapping case counts by province. However Canadian provinces are so big compared to population centers that it doesn’t accurately reflect actual geographic distribution of cases. It would be better to use the provincial “health regions” which correspond much better to population centers.

So I set about to create for myself a choropleth map visualization by health regions.

View the finished choropleth map at the following link. The source data is updated daily each evening.
https://sitrucp.github.io/canada_covid_health_regions/index.html

I used Leaflet.js open-source JavaScript mapping library to create the interactive choropleth map, D3.js to retrieve and transform the csv format data, and Javascript to retrieve the JSON geographic boundary files and also to manipulate and present the data with HTML and CSS.

The COVID-19 case count data are obtained as csv file format from the “COVID-19 Canada Open Data Working Group” who are an amazing group of volunteers.  They have been tirelessly collating data from the various provincial and territory government agencies daily since early March. This group saves the collated and cleaned data as csv files in a Github repository https://github.com/ishaberry/Covid19Canada.

The health region geographical boundary descriptions are from Statistics Canada’s Statscan ArcGIS Health region boundary Canada dataset. These had very detailed boundaries so I simplified them using QGIS which also dramatically reduced the dataset size.

However, there were some data issues that needed to be addressed first. The Statscan health regions shape file boundary names are different than those used by the provincial and territory government agencies reporting the data.

The Statscan seems to have full-form “official” health region names, while the provincial and territory government agency names are common, more familar, short-hand names. Also, names appeared to have changed since they were recorded in Statscan data.

Provinces may also add or remove health regions from time to time due to administrative changes or population changes etc. So either set may have health regions that the other doesn’t have.

From a data governance perspective, in a perfect world, everyone uses a single set of health region boundary names. COVID-19 reporting has made a lot of people aware of this issue which is a silver lining in the COVID-19 dark cloud!

Addressing these name differences was actually quite simple, requiring creation of a lookup table with two columns, one for each dataset, to match the names in the boundary data files to the names in the counts data file. The lookup table can then be used dynamically when getting data each time the map is refreshed. This is described in more detail in Github repository README linked below.

Code for this project is maintained in Github:  github.com/sitrucp/canada_covid_health_regions.

I also created a separate choropleth map for Montreal, where I was living at the time, which was Canada’s COVID-19 “hotspot” with about 25-30% of Canada’s total COVID-19 cases. However, the Montreal data source has since been discontinued so the map is archived now.

View archived Montreal map here:
https://sitrucp.github.io/canada_covid_health_regions/montreal/index.html