@SirPatStew #ASonnetADay Dashboard

Github.io hosted Plotly.js dot plot visualization of @SirPatStew’s Shakespeare Sonnet reading Twitter posts.

Sir Patrick Stewart was doing Shakespeare Sonnet readings from his home during COVID-19 lockdown and they are really good.

So, using Twitter API and Tweepy, I retrieved his Tweet data to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github:  https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Screenshot of visualization below:

Amazon AWS Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.

Amazon AWS Textract – WHO “Draft landscape of COVID-19 candidate vaccines” – convert PDF to csv

TLDR: I extracted text from the WHO’s vaccine candidate PDF file using AWS Textract and made text into a set of interactive web pages . View the AWS Textract PDF extract output csv files in this Github repository and view and interact with the web pages here.

The World Health Organization (WHO) maintains a regularly updated PDF document named Draft landscape of COVID-19 candidate vaccines which contains all COVID-19 vaccine candidates and treatments currently being developed and their status.

2020-01-03 EDIT: Note that the WHO is now providing an Excel file which contains the same data previously contained in the PDF file referred to in this post.

The main content in this PDF document is a tabular format with one vaccine or treatment per row. There is also some non-tabular text content including introduction text and footer notes.

I wanted a machine readable format version of this PDF document’s table data so I could do some analysis. This meant I needed to do PDF text extraction. There are lots of solutions. I ended up using Amazon Textract to extract the PDF into csv file format.

“Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables”

Using Textract

You need an AWS account to use Textract and it does cost to use the service (see costs for this analysis at bottom of post).

The Textract service has UI that you can use to upload and process documents manually eg not using code. However, it is also included in the AWS Boto3 SDK so you can code or use command line automation with Textract too.

I used the manual UI for this one time processing of the PDF. However, if I was to automate this to regularly extract data from the PDF I would use Python and Boto3 SDK. The Boto3 SDK Textract documentation and example code are here.

Use this link to go directly to the AWS Textract UI once you are logged into your AWS console.

The Textract UI is quite intuitive and easy to use. You manually upload your PDF file, it processes the file and shows you the interpreted content which is described in “blocks” of text, tables, images, and then you can select which to extract from document.

During this process the the manual Textract process asks permission to create a new S3 folder in your account where it uploads the PDF before processing it. This is because Textract will only accept documents from S3.

Screenshot of Textract UI

The Textract PDF extract output is a zip file contained bunch of files that is automatically downloaded to your computer. The zip file contained the files listed below.

These 3 files appear to be standard information for any AWS Textract job.

    • apiResponse.json
    • keyValues.csv
    • rawText.txt

The rest of the AWS Textract output will vary depending on your document. In this case it returned a file for each table in the document.

    • table-1.csv
    • table-2.csv
    • table-3.csv
    • table-4.csv
    • table-5.csv
    • table-6.csv
    • table-7.csv
    • table-8.csv
    • table-9.csv

To process the AWS Textract table csv files , I imported them into a Pandas dataframe. The Python code used is in Github.

There was some minor clean-up of OCR/interpreted tabular data which included stripping trailing white spaces from all text and removing a few blank table rows. In addition the PDF tables had two header rows that were removed and manually replaced with single header row. Also there were some minor OCR mistakes for example some zeros were rendered as capital letter ‘O’ and some words were missing last letter.

The table columns in the *WHO Draft landscape of COVID-19 candidate vaccines* PDF document tables are shown below. Textract did a good job of capturing these columns.

Vaccine columns:

    • COVID-19 Vaccine developer or manufacturer
    • Vaccine platform
    • Type of candidate vaccine
    • Number of doses
    • Timing of doses
    • Route of administration
    • Stage – Phase 1
    • Stage – Phase 1/2
    • Stage – Phase 2
    • Stage – Phase 3

Treatment columns:

    • Platform
    • Type of candidate vaccine
    • Developer
    • Coronavirus target
    • Current stage of clinical evaluation/regulatory -Coronavirus candidate
    • Same platform for non-Coronavirus candidates

Cost for procesing 9 page PDF file 3 times:

I copied the costs from AWS Billing below to give readers some idea of what Textract costs.

Amazon Textract USE1-AsyncFormsPagesProcessed $1.35

AsyncPagesProcessed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 27.000 pages $1.35

Amazon Textract USE1-AsyncTablespages Processed $0.41
Asyncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 27.000 pages $0.41

Amazon Textract USE1-SyncFormspages Processed $0.30
Syncpages Processed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 6.000 pages $0.30

Amazon Textract USE1-SyncTablespages Processed $0.09
Syncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 6.000 pages $0.09

HTML tabular presentation of Textract output data

View it here.


Periodic chart elements by origin from SVG using Python

This cool periodic chart of the elements shows source / origin of the chemical elements. Source: Wikipedia created by Cmglee

It was really interesting to learn that elements may be created from more than one source/origin which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

After learning this, I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging, it turned out that the author of the SVG file shown above had embedded the data I wanted along with Python code necessary to create the SVG file inside the file, which is very cool!

With some minor modification to the SVG file Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications that I made in this Github repository:  https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their source/origin which answers my original question.

It was very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest of the elements were created afterwards, by source/origin that came into being after the Big Bang.

The second chart shows counts of elements by number of source/origin. It was also very interesting to also learn that some elements have more than one source/origin, and 136 elements have more than one source/origin.

Retrieve and Process Environment Canada Hydrometric and Climate Data using Python

Recently needed to get flow and level data for a watercourse hydrological station as well as regional precipitation data for relevant location(s) upstream within the watershed of the station.

The objective was to combine two decades of watercourse flow and level with regional watershed precipitation data into a single set of analysis and reporting.

Environment Canada has all of this data and provides it in a variety of different ways.

Flow and level data

For many users the Water Level and Flow web portal wateroffice.ec.gc.ca site is sufficient. However I needed multiple years’ data so I looked around for alternatives.

I found the Hydat National Water Data Archive which is provided as a SQLite database was a good source. It contains all Canadian watercourse flow and level data up until about a month or so prior to current date and was an quick and easy download and convenient way to get required data.

Real-time / more recent flow data is available using a nice search interface in the same wateroffice.ec.gc.ca portal.

In addition, this same real-time / more recent data can also be obtained by going directly to an FTP site by clicking the ‘Datamart’ link on wateroffice.ec.gc.ca.

More details about the FTP site content are available here.

Both of these methods result in downloading csv files.

The website portal allows you to search by station name or reference id. The FTP site files are named with station reference id so you need to know the reference id to get the correct csv file.

This hydrological flow and level data is also available through the which is described MSC GeoMet API. The API was a bit too complex to use to get data and do analysis so I passed on it but it looks very powerful and well documented.

Precipitation data
Precipitation data is available for stations across Canada and this website https://climate.weather.gc.ca/historical_data/search_historic_data_e.html allows search by name, reference id or geographical location.

However, since this only allows download by single month of data, I needed to find another method to more quickly get multiple years and months of data.

When you use the website search and follow the links to the download page you will also see a link to get more historical data. This link brings you to a Google Drive folder. This folder documents how to use their ‘wget’ method to download files. However, I was using Windows and didn’t want to mess around with Cygwin or Windows linux to be able to use wget.

It turned out to be relatively simple to replicate the wget process using Python Requests, then loop through the csv files to process them. Python io.StringIO was also used to stream request content into a ‘fake csv’ file in each loop, which were then aggregated into a Python List that was converted into a Pandas dataframe so Pandas could be used to process data.

The Python code is in my Github ‘Environment-Canada-Precip-and-Flow-Data’ Repository.