Canadian Government First Nations long term water advisory data

The Government of Canada is working with Canadian First Nations communities to end long-term drinking water advisories which have been in effect for more than 12 months since November 2015. This includes projects to build or renew drinking water infrastructure in these communities.

The First Nation drinking water advisories were a big part of the 2021 Canadian federal election campaigns. The current government had made big promises to fix all of the advisories in 2015. The opposition criticized them for not having achieved that goal to-date. However there was poor communication about the actual number and status of the advisory projects. So I wanted to find some data to learn the actual status for myself.

TLDR the results can be seen on a Github.io hosted page.  A screenshot is provided below. The code to retrieve and transform the data to create that web page are in my Github https://github.com/sitrucp/first_nations_water repository.

Finding the data I needed only took a  few minutes of Google search from an Indigenous Services Canada  website that mapped the water advisories by counts of advisories by FN community, project status, advisory dates, and project type.

While the web page also included a link to download the map data, after a bit of web page scraping and inspection, I learned that the downloadable map data was created by web page Javascript from another larger data source referenced in the map page code https://www.sac-isc.gc.ca//DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt.

This larger data source is a text file containing JSON format records of all Canadian First Nations communities and is many thousands of records. As a side note this larger file appears to be an official Canadian government dataset used in other government websites. The map data is limited to only about 160 First Nations communities with drinking water issues.

So rather than download the 160 record map page data, I retrieved the larger JSON format text file and used similar logic as that for the map web page code to get the 160 records. This was done in two steps.

Step 1: retrieve JSON format text file using a Python script water_map_data.py to retrieve and save the data file locally. I may yet automate this using a scheduled task so the map gets regularly updated data as advisory status changes over time.

Step 2: process the saved data file and present it in an HTML web page as Plotly.js charts and an HTML tabular format using Javascript in this file  first_nations_water.js

Finally, I also separately created a Excel file with Pivot Table and Chart that you can download and use to do your own analysis. Download this Excel file from the Github repository. The file contains an Excel Power Query link to the larger text JSON file described above. You can simply refresh the query to get the latest data from the Indigenous Services Canada  website.

COVID-19 Data Analysis and Visualization Summary

This is a list of Canadian COVID-19 related data analysis and visualization that I created during 2020/21 pandemic.

Canada COVID-19 Case Map – COVID-19 new and total cases and mortalities by Canadian provincial health regions.
https://sitrucp.github.io/canada_covid_health_regions

Canada COVID-19 Vaccination Schedule – Canada’s current and historical vaccine doses administered and doses distributed. Also includes two distribution forecasts: 1) based on Canadian government vaccine planning and 2) based on Sep 30 2021 goal to vaccinate all 16+ Canadians. Uses COVID-19 Canada Open Data Working Group data.
https://sitrucp.github.io/covid_canada_vaccinations

Canada COVID-19 Vaccination vs World – Canada’s current and historical ranking vs all countries in the Our World in Data coronavirus data  by total doses, daily doses and total people vaccines adminstered.
https://sitrucp.github.io/covid_global_vaccinations

Global COVID-19 Vaccination Ranking – Ranking of all countries in the Our World in Data coronavirus data by daily vaccine dose administration. Includes small visualization of all time, population, vaccines used and trend. Can sort by these measures.
https://sitrucp.github.io/covid_world_vaccinations

COVID-19 New Case Rate by Canadian Health Regions Animation – SVG animation of new cases visualized as daily rate for each Canadian provincial health regions. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_canada

COVID-19 New Case Rate by Country Animation – SVG animation of new cases visualized as daily rate for each country in the Our World in Data dataset. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_world

COVID-19 New Case Rate by US State Animation – SVG animation of new cases visualized as daily rate for each US state. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_us

Apple Mobility Trends Reports – Canadian Regions Data – Apple cell phone mobility data tracking data used to create heat map visualizations of activity over time.
https://sitrucp.github.io/covid_canada_mobility_apple

WHO Draft landscape of COVID-19 Candidate Vaccines – AWS Textract used to extract tabular data from WHO pdf file. Python and Javascript code then used to create webpages from extracted data.
https://sitrucp.github.io/covid_who_vaccine_landscape

Montreal Confirmed COVID-19 Cases By City Neighbourhoods – Code and process used to scrape Quebec Health Montreal website to get COVID-19 case data for Montreal city boroughs.
https://github.com/sitrucp/covid_montreal_scrape_data

Use Excel Power Query to get data from Our World In Data – How to use Excel’s Power Query to get Our World in Data Github csv files automatically and update with simple refresh.
https://009co.com/?p=1491

 

TendiesTown.com – WallStreetBets gain and loss analysis

For a 16 month period, from February 2020 to June 2021, TendiesTown.com used the Reddit API in an automated process to retrieve r/wallstreetbets posts tagged with “Gain” or “Loss” flair. The subReddit r/wallstreetbets is a stock and option trading discussion board that gained popularity during the 2020/2021 “meme stock” media frenzy.

The retrieved posts were automatically categorized by gain or loss, post date, Reddit username, trade verification image url, and post url.

These posts were then manually processed to identify the gain or loss amount and reported stock ticker(s).

As of June 9, 2021, there were 4,582 trades with 2,585 gains and 1,997 losses recorded. Gain amounts totalled USD $389 million and losses USD $117 million.

While 6,844 posts were retrieved from the subreddit, 2,262 (33%) of these were rejected because they were either duplicate posts, not actual gain or loss trades (per r/wallstreetbets flair guidelines) or it was not possible to identify the amount.

The trades and more detailed analysis are available on the tendiestown.com website. As of June 9, 2021, the manual processing is no longer being done, so no new data will appear on the site.

“Gain” and “Loss” flair are described in more detail in the community flair guidelines but essentially “Gain” = trade made money and “Loss”= trade lost money.

Table 1: Summary statistics

Gains Losses
Count 2,585 1,997
Sum $389,329,571 $117,347,985
Avg $150,611 $58,762
Median $24,138 $12,120
Min $138 $100
Max $8,000,000 $14,776,725

The 4,582 trades were made by 3,903 unique Reddit users. 492 of these Reddit users have more than one trade as shown in table 2 below.

Table 2: Trade counts by unique Reddit users

# trades count of Reddit users
1 3,411
2 378
3 73
4 22
5 11
6 5
7 1
8

2

 

Charts 1 & 2: Daily and cumulative counts

Bar charts daily count of gain and loss

Bar charts cumulative count of gain and loss

Charts 3 & 4: Daily and cumulative amounts

Bar charts daily amount of gain and loss

Bar charts cumulative amount of gain and loss

Notable observations

Early 2021 increase in trades

The count and amount of trades rose significantly in the early months of 2021.  This can be explained by the surge of new users to the r/wallstreetbets subreddit due to huge increase in popular media reporting surrounding the GME, AMC meme stock craze.

Gains greater than losses

Gains consistently lead losses over the 16 month period.  Rather than simply concluding that r/wallstreetbets traders gain more than they lose, it suspected that the variance can be explained due to fact that it is easier to tell the world that you won, and harder to say you have lost.

@SirPatStew #ASonnetADay dashboard

Sir Patrick Stewart @SirPatStew was doing Shakespeare Sonnet readings from his home during the COVID-19 lockdown and they were really good.

I wanted to track the sonnet reading tweets’ like and retweet counts over time and show this to other @SirPatStew fans. I also thought it would be very helpful to provide an automatically updated list of links to the sonnet tweets.

So I created a daily automated job that retrieved Sir Patrick’s Twitter data, saved it in the cloud. I also created a web page that included a visualization showing the #ASonnetADay hashtag tweets like and retweet counts (screenshot below) along with a table listing the tweets in chronological order  with like and retweet counts and as well as a link to the tweet.

@SirPatStew finished has long finished posting new #ASonnetADay tweets and his tweets continue to get visitors and like and retweet counts continue increasing.  The automated daily job is still ongoing and the visualization continues to be updated.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github: https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Data was retrieved from Twitter using the Twitter API and Tweepy and the visualization was created using Plotly.js dot plot and is hosted on Github.io

The Tweet data was used to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

 

Amazon AWS Transcribe to get 2020 presidential debate #1 speaker segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.