COVID-19 Data Analysis and Visualization Summary

This is a list of Canadian COVID-19 related data analysis and visualization that I created during 2020/21 pandemic.

Canada COVID-19 Case Map – COVID-19 new and total cases and mortalities by Canadian provincial health regions.
https://sitrucp.github.io/canada_covid_health_regions

Canada COVID-19 Vaccination Schedule – Canada’s current and historical vaccine doses administered and doses distributed. Also includes two distribution forecasts: 1) based on Canadian government vaccine planning and 2) based on Sep 30 2021 goal to vaccinate all 16+ Canadians. Uses COVID-19 Canada Open Data Working Group data.
https://sitrucp.github.io/covid_canada_vaccinations

Canada COVID-19 Vaccination vs World – Canada’s current and historical ranking vs all countries in the Our World in Data coronavirus data  by total doses, daily doses and total people vaccines adminstered.
https://sitrucp.github.io/covid_global_vaccinations

Global COVID-19 Vaccination Ranking – Ranking of all countries in the Our World in Data coronavirus data by daily vaccine dose administration. Includes small visualization of all time, population, vaccines used and trend. Can sort by these measures.
https://sitrucp.github.io/covid_world_vaccinations

COVID-19 New Case Rate by Canadian Health Regions Animation – SVG animation of new cases visualized as daily rate for each Canadian provincial health regions. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_canada

COVID-19 New Case Rate by Country Animation – SVG animation of new cases visualized as daily rate for each country in the Our World in Data dataset. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_world

COVID-19 New Case Rate by US State Animation – SVG animation of new cases visualized as daily rate for each US state. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_us

Apple Mobility Trends Reports – Canadian Regions Data – Apple cell phone mobility data tracking data used to create heat map visualizations of activity over time.
https://sitrucp.github.io/covid_canada_mobility_apple

WHO Draft landscape of COVID-19 Candidate Vaccines – AWS Textract used to extract tabular data from WHO pdf file. Python and Javascript code then used to create webpages from extracted data.
https://sitrucp.github.io/covid_who_vaccine_landscape

Montreal Confirmed COVID-19 Cases By City Neighbourhoods – Code and process used to scrape Quebec Health Montreal website to get COVID-19 case data for Montreal city boroughs.
https://github.com/sitrucp/covid_montreal_scrape_data

Use Excel Power Query to get data from Our World In Data – How to use Excel’s Power Query to get Our World in Data Github csv files automatically and update with simple refresh.
https://009co.com/?p=1491

 

TendiesTown.com – WallStreetBets gain and loss analysis

For a 16 month period, from February 2020 to June 2021, TendiesTown.com used the Reddit API in an automated process to retrieve r/wallstreetbets posts tagged with “Gain” or “Loss” flair. The subReddit r/wallstreetbets is a stock and option trading discussion board that gained popularity during the 2020/2021 “meme stock” media frenzy.

The retrieved posts were automatically categorized by gain or loss, post date, Reddit username, trade verification image url, and post url.

These posts were then manually processed to identify the gain or loss amount and reported stock ticker(s).

As of June 9, 2021, there were 4,582 trades with 2,585 gains and 1,997 losses recorded. Gain amounts totalled USD $389 million and losses USD $117 million.

While 6,844 posts were retrieved from the subreddit, 2,262 (33%) of these were rejected because they were either duplicate posts, not actual gain or loss trades (per r/wallstreetbets flair guidelines) or it was not possible to identify the amount.

The trades and more detailed analysis are available on the tendiestown.com website. As of June 9, 2021, the manual processing is no longer being done, so no new data will appear on the site.

“Gain” and “Loss” flair are described in more detail in the community flair guidelines but essentially “Gain” = trade made money and “Loss”= trade lost money.

Table 1: Summary statistics

Gains Losses
Count 2,585 1,997
Sum $389,329,571 $117,347,985
Avg $150,611 $58,762
Median $24,138 $12,120
Min $138 $100
Max $8,000,000 $14,776,725

The 4,582 trades were made by 3,903 unique Reddit users. 492 of these Reddit users have more than one trade as shown in table 2 below.

Table 2: Trade counts by unique Reddit users

# trades count of Reddit users
1 3,411
2 378
3 73
4 22
5 11
6 5
7 1
8

2

 

Charts 1 & 2: Daily and cumulative counts

Bar charts daily count of gain and loss

Bar charts cumulative count of gain and loss

Charts 3 & 4: Daily and cumulative amounts

Bar charts daily amount of gain and loss

Bar charts cumulative amount of gain and loss

Notable observations

Early 2021 increase in trades

The count and amount of trades rose significantly in the early months of 2021.  This can be explained by the surge of new users to the r/wallstreetbets subreddit due to huge increase in popular media reporting surrounding the GME, AMC meme stock craze.

Gains greater than losses

Gains consistently lead losses over the 16 month period.  Rather than simply concluding that r/wallstreetbets traders gain more than they lose, it suspected that the variance can be explained due to fact that it is easier to tell the world that you won, and harder to say you have lost.

@SirPatStew #ASonnetADay Dashboard

Sir Patrick Stewart @SirPatStew was doing Shakespeare Sonnet readings from his home during the COVID-19 lockdown and they were really good.

I wanted to track the sonnet reading tweets’ like and retweet counts over time and show this to other @SirPatStew fans. I also thought it would be very helpful to provide an automatically updated list of links to the sonnet tweets.

So I created a daily automated job that retrieved Sir Patrick’s Twitter data, saved it in the cloud. I also created a web page that included a visualization showing the #ASonnetADay hashtag tweets like and retweet counts (screenshot below) along with a table listing the tweets in chronological order  with like and retweet counts and as well as a link to the tweet.

@SirPatStew finished has long finished posting new #ASonnetADay tweets and his tweets continue to get visitors and like and retweet counts continue increasing.  The automated daily job is still ongoing and the visualization continues to be updated.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github: https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Data was retrieved from Twitter using the Twitter API and Tweepy and the visualization was created using Plotly.js dot plot and is hosted on Github.io

The Tweet data was used to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

 

Amazon AWS Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.

Periodic chart elements by origin from SVG using Python

This cool periodic chart of the elements shows source / origin of the chemical elements. Source: Wikipedia created by Cmglee

It was really interesting to learn that elements may be created from more than one source/origin which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

After learning this, I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging, it turned out that the author of the SVG file shown above had embedded the data I wanted along with Python code necessary to create the SVG file inside the file, which is very cool!

With some minor modification to the SVG file Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications that I made in this Github repository:  https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their source/origin which answers my original question.

It was very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest of the elements were created afterwards, by source/origin that came into being after the Big Bang.

The second chart shows counts of elements by number of source/origin. It was also very interesting to also learn that some elements have more than one source/origin, and 136 elements have more than one source/origin.