Use Excel Power Query to get data from Our World In Data

Our World In Data (OWID) has been doing a hero’s job of collating the world’s covid vaccine distribution and administration data. They generate additional metrics and allow site visitors to analyse them in awesome interactive visualizations.

However, I wanted to do some analysis that went beyond what was possible on the site. No problem, because the OWID team also provides the data as csv files in their Github repository.

These csv files are easily manually downloaded from Github to be used. But as the OWID vaccine data is updated regularly, often twice per day, you will probably want to automate the retrieval of updated csv files from Github.

There are many tools and methods to automate this but in this blog post I want to quickly highlight how Excel users can use Excel’s Power Query Get Data->From Other Sources->From Web feature to link to the csv files in Github. (Read my other blog post to learn more about Power Query.)

Once implemented you only need to click refresh to get latest csv data from OWID’s Github repository.

First, you need to get a url for the csv file’s in the OWID Github repository.

The easiest way to get correct url is to open the csv file in Github in Raw format by clicking on the file name and then clicking on the Raw button on top right corner of page. This will open the csv file in native “raw” format. Then you can simply copy the url from browser location bar.

You can also manually create a url to any Github repository file as follows below. Let’s use the vaccinations.csv file in OWID’s Github repository as an example.

https://raw.githubusercontent.com/owid/COVID-19-data/master/public/data/vaccinations/vaccinations.csv

The vaccinations.csv Github repository file url has the following parts:

    • Base url: https://raw.githubusercontent.com
    • Repository owner: owid
    • Repository name: COVID-19-data
    • Repository branch: master
    • Repository base element: public
    • Repository folders: data/vaccinations
    • Repository file name: vaccinations.csv

Based on this example you should be able to manually create a url for any file in any Github repository.

Once you have your url, you can test if it is correct by simply copy pasting it into a browser location bar and verifying the url is good, and retrieves the data you are looking for.

After verifying the url is correct, go to your Excel file, find the Excel Power Query Get Data feature and select From Other SourcesFrom Web which will show input field for a url. Paste your url into the input field and simply click Load, which will load the csv file data into an Excel worksheet.

Now instead of manually downloading data from Github, you can simply click refresh and automatically retrieve the most recent csv file from the OWID Github repository.

@SirPatStew #ASonnetADay Dashboard

Sir Patrick Stewart @SirPatStew was doing Shakespeare Sonnet readings from his home during the COVID-19 lockdown and they were really good.

I retrieved Sir Patrick’s Twitter data and created a visualization of his #ASonnetADay hashtag tweets like and retweet counts to show their popularity (screenshot below).

The visualization is accompanied by a table listing each tweet in chronological order  with like and retweet counts and a link to the tweet.

The data is still being collected even though he has long stopped posting these #ASonnetADay tweets but the like and retweet counts continue to grow.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github: https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Data was retrieved from Twitter using the Twitter API and Tweepy and the visualization was created using Plotly.js dot plot and is hosted on Github.io

The Tweet data was used to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

 

The human radio-broadcast bubble

“Space… is big. Really big. You just won’t believe how vastly, hugely, mindbogglingly big it is” (quote from The Hitchhiker’s Guide to the Galaxy)

One measure of  space is the size of a “bubble” created by radio waves radiating outwards from the Earth over the past 139 years since they were first intentionally produced by humans in the 1880’s. Wikipedia: History of radio

The image below, created by science blogger Adam Grossman, illustrates this “bubble” of radio waves as a tiny blue dot on an artist visualization of the Milky Way galaxy.

The human radio-broadcast bubble compared to Milky Way galaxy.

The visualization has an inset in the bottom-right with a close-up of the Earth’s region of the Milky Way galaxy to make the blue dot more readily visible. This shows that the blue dot is very small compared to the Milky Way.

However, that tiny blue dot is actually an enormous sphere with a radius of 139 light-years Wikipedia: light-year and the Earth at its center and the sphere’s grows another light-year every year. For the non-science people, those radio waves are travelling at the speed of light!

Tiny blue dot is actually an enormous sphere with a radius of 139 light-years and the Earth at its center!

The blue dot is very small compared to the Milky Way and even smaller when compared to the universe.

The distance from Earth to the edge of the observable universe is 46 billion light-years Wikipedia: Universe and the observable universe contains about 200 billion galaxies according to latest Hubble observations Nasa: Hubble.

Feeling small yet? Even though that blue dot is arguably the biggest thing humans have created, its 139 light-year radius about 331 million times smaller than the 46 billion light-year distance to the edge of the observable universe.

Also as a final note, it is not likely that the radio-waves are actually detectable, at least not by our current technology, after travelling 139 light-years. But, perhaps somewhere out there, aliens with some magic technology are entertaining themselves by listening to human radio and tv broadcasts, air traffic controllers, police and EMT calls, and cell phone conversations. Nice thought .. scary thought?

Also this image has been used often without attribution to its creators. This planetary.org blog post gives more history of the creation of this image in 2011 by Adam Grossman and Nick Risinger. The original blog post by Adam Grossman can be found archived on github.io and it was featured on YCombinator’s Hacker News here.

 

 

 

Amazon AWS Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.

D3.js SVG animation – COVID-19 rate “race” visualization

This visualization shows COVID-19 new cases as a “race” of dots moving from left to right.

The dot’s “speed” or how long it takes to move from left to right is based on the number of cases per day.

If a country has one case per day, it will take an entire day for the dot to move from left to right. Some countries have many 1000’s of new cases daily and the dot moves from left to right in minutes or seconds.

There are three  visualizations for following geographical regions. Click “viz” to view the visualization and “github code” to view the code for the visualization:

The screenshot below shows countries of world. Some countries have not had any new cases over past 7 days so show as gray. Those that have had new cases over past 7 days are shown as white circle (no change from prev 7 days), red (increase from prev 7 days) or green (decrease from prev 7 days).

The visualization is sorted by country by default but can change sorting by average new cases. In addition, you can toggle between showing new cases as actual count or new cases per million (population).

The visualization uses D3.js SVG to create a canvas for each location, the location name text & counts, and circle shape, and transitions, and to retrieve csv file and process data, including filtering to most recent 7 days, group by location to get case count means.

The most important aspect for this visualization was how to use D3.js to animate the movement of the white circle across the canvas, and how to repeat the movement in an ‘endless’ loop.

The code block below hightlights use of a function that uses D3.js .on(“end”, repeat);  to loop through repeat function ‘endlessly’ so that shape is moved across canvas, and then back to original position, to move across canvas again and again. See bl.ocks.org ‘Looping a transition in v5’ example.

The duration() value is the proxy for rate in this visualization and is calculated in another function separately for each location SVG. I also added a counter that would increment an SVG text value to show each loop’s count on canvas.

// repeat transition endless loop
function repeat() {
    svgShape
    .attr("cx", 150)
    .transition()
    .duration(cycleDuration)
    .ease(d3.easeLinear)
    .attr("cx", 600)
    .transition()
    .duration(1)
    .attr("cx", 150)
    .on("end", repeat);
    
    svgTextMetric
    .text(counter + ' / ' + metric);
    counter++;
  };

This visualization was inspired by Jan Willem Tulp’s COVID-19 spreading rates and Dr James O’Donoghue’s  relative rotation periods of planets, and uses same data as Tulp’s spreading rates.