@SirPatStew #ASonnetADay Dashboard

Sir Patrick Stewart @SirPatStew was doing Shakespeare Sonnet readings from his home during the COVID-19 lockdown and they were really good.

So, using Twitter API and Tweepy, I retrieved his Twitter data and created a Github.io hosted Plotly.js dot plot visualization ofShakespeare Sonnet reading Twitter posts. Screenshot below:

The Tweet data was used to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github: https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Amazon AWS Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.

Periodic chart elements by origin from SVG using Python

This cool periodic chart of the elements shows source / origin of the chemical elements. Source: Wikipedia created by Cmglee

It was really interesting to learn that elements may be created from more than one source/origin which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

After learning this, I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging, it turned out that the author of the SVG file shown above had embedded the data I wanted along with Python code necessary to create the SVG file inside the file, which is very cool!

With some minor modification to the SVG file Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications that I made in this Github repository:  https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their source/origin which answers my original question.

It was very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest of the elements were created afterwards, by source/origin that came into being after the Big Bang.

The second chart shows counts of elements by number of source/origin. It was also very interesting to also learn that some elements have more than one source/origin, and 136 elements have more than one source/origin.

Plotly Express Python remove legend title

Plotly.py 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to Plotly.py 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update(name=t.name.split("=")[0])

However now with Plotly.py 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.