D3.js SVG animation – COVID-19 rate visualization

This is a visualization that shows relative rate of increase in new COVID-19 cases over the past 7 days.

Click link to view visualization for:

The visualization uses D3.js SVG to create a canvas for each location, the location name text & counts, and circle shape, and transitions, and to retrieve csv file and process data, including filtering to most recent 7 days, group by location to get case count means.

The most important aspect for this visualization was how to use D3.js to animate the movement of the white circle across the canvas, and how to repeat the movement in an ‘endless’ loop.

The code block below hightlights use of a function that uses D3.js .on(“end”, repeat);  to loop through repeat function ‘endlessly’ so that shape is moved across canvas, and then back to original position, to move across canvas again and again. See bl.ocks.org ‘Looping a transition in v5’ example.

The duration() value is the proxy for rate in this visualization and is calculated in another function separately for each location SVG. I also added a counter that would increment an SVG text value to show each loop’s count on canvas.

// repeat transition endless loop
function repeat() {
    svgShape
    .attr("cx", 150)
    .transition()
    .duration(cycleDuration)
    .ease(d3.easeLinear)
    .attr("cx", 600)
    .transition()
    .duration(1)
    .attr("cx", 150)
    .on("end", repeat);
    
    svgTextMetric
    .text(counter + ' / ' + metric);
    counter++;
  };

This visualization was inspired by Jan Willem Tulp’s COVID-19 spreading rates and Dr James O’Donoghue’s  relative rotation periods of planets, and uses same data as Tulp’s spreading rates.

AWS Textract – WHO “Draft landscape of COVID-19 candidate vaccines” – convert PDF to csv

TLDR: View the AWS Textract PDF extract output csv files in this Github repository and view and interact with an HTML tabular presentation of the resulting data here.

The World Health Organization (WHO) maintains a regularly updated PDF document named Draft landscape of COVID-19 candidate vaccines which contains all COVID-19 vaccine candidates and treatments currently being developed and their status.

The main content in this PDF document is a tabular format with one vaccine or treatment per row. There is also some non-tabular text content including introduction text and footer notes.

I wanted a machine readable format version of this PDF document’s table data so I could do some analysis. This meant I needed to do PDF text extraction. There are lots of solutions. I ended up using Amazon Textract to extract the PDF into csv file format.

“Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables”

Using Textract

You need an AWS account to use Textract and it does cost to use the service (see costs for this analysis at bottom of post).

The Textract service has UI that you can use to upload and process documents manually eg not using code. However, it is also included in the AWS Boto3 SDK so you can code or use command line automation with Textract too.

I used the manual UI for this one time processing of the PDF. However, if I was to automate this to regularly extract data from the PDF I would use Python and Boto3 SDK. The Boto3 SDK Textract documentation and example code are here.

Use this link to go directly to the AWS Textract UI once you are logged into your AWS console.

The Textract UI is quite intuitive and easy to use. You manually upload your PDF file, it processes the file and shows you the interpreted content which is described in “blocks” of text, tables, images, and then you can select which to extract from document.

During this process the the manual Textract process asks permission to create a new S3 folder in your account where it uploads the PDF before processing it. This is because Textract will only accept documents from S3.

Screenshot of Textract UI

The Textract PDF extract output is a zip file contained bunch of files that is automatically downloaded to your computer. The zip file contained the files listed below.

These 3 files appear to be standard information for any AWS Textract job.

    • apiResponse.json
    • keyValues.csv
    • rawText.txt

The rest of the AWS Textract output will vary depending on your document. In this case it returned a file for each table in the document.

    • table-1.csv
    • table-2.csv
    • table-3.csv
    • table-4.csv
    • table-5.csv
    • table-6.csv
    • table-7.csv
    • table-8.csv
    • table-9.csv

To process the AWS Textract table csv files , I imported them into a Pandas dataframe. The Python code used is in Github.

There was some minor clean-up of OCR/interpreted tabular data which included stripping trailing white spaces from all text and removing a few blank table rows. In addition the PDF tables had two header rows that were removed and manually replaced with single header row. Also there were some minor OCR mistakes for example some zeros were rendered as capital letter ‘O’ and some words were missing last letter.

The table columns in the *WHO Draft landscape of COVID-19 candidate vaccines* PDF document tables are shown below. Textract did a good job of capturing these columns.

Vaccine columns:

    • COVID-19 Vaccine developer or manufacturer
    • Vaccine platform
    • Type of candidate vaccine
    • Number of doses
    • Timing of doses
    • Route of administration
    • Stage – Phase 1
    • Stage – Phase 1/2
    • Stage – Phase 2
    • Stage – Phase 3

Treatment columns:

    • Platform
    • Type of candidate vaccine
    • Developer
    • Coronavirus target
    • Current stage of clinical evaluation/regulatory -Coronavirus candidate
    • Same platform for non-Coronavirus candidates

Cost for procesing 9 page PDF file 3 times:

I copied the costs from AWS Billing below to give readers some idea of what Textract costs.

Amazon Textract USE1-AsyncFormsPagesProcessed $1.35

AsyncPagesProcessed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 27.000 pages $1.35

Amazon Textract USE1-AsyncTablespages Processed $0.41
Asyncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 27.000 pages $0.41

Amazon Textract USE1-SyncFormspages Processed $0.30
Syncpages Processed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 6.000 pages $0.30

Amazon Textract USE1-SyncTablespages Processed $0.09
Syncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 6.000 pages $0.09

HTML tabular presentation of Textract output data

View it here.

 

Periodic chart elements by origin

This cool periodic chart of the elements shows source / origin of the elements. Source: Wikipedia created by Cmglee

Some elements may come from more than one source which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging it turned out that the author of the SVG file had embedded the data and Python code necessary to create the SVG file inside the file which is very cool.

With some minor modification to the Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications here https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their origin which answers my question.

Very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest are created from various processes probably quite some time after the Big Bang.

The second chart shows counts of elements by number of origins. Interesting to also learn that some elements have more than one origin and 136 elements have more than one source.

Get Environment Canada Hydrometric and Climate Data

Recently needed to get flow and level data for a watercourse hydrological station as well as regional precipitation data for relevant location(s) upstream within the watershed of the station.

The objective was to combine two decades of watercourse flow and level with regional watershed precipitation data into a single set of analysis and reporting.

Environment Canada has all of this data and provides it in a variety of different ways.

Flow and level data

For many users the Water Level and Flow web portal wateroffice.ec.gc.ca site is sufficient. However I needed multiple years’ data so I looked around for alternatives.

I found the Hydat National Water Data Archive which is provided as a SQLite database was a good source. It contains all Canadian watercourse flow and level data up until about a month or so prior to current date and was an quick and easy download and convenient way to get required data.

Real-time / more recent flow data is available using a nice search interface in the same
wateroffice.ec.gc.ca portal.

In addition, this same real-time / more recent data can also be obtained by going directly to an FTP site by clicking the ‘Datamart’ link on wateroffice.ec.gc.ca.

More details about the FTP site content are available here.

Both of these methods result in downloading csv files.

The website portal allows you to search by station name or reference id. The FTP site files are named with station reference id so you need to know the reference id to get the correct csv file.

This hydrological flow and level data is also available through the which is described MSC GeoMet API. The API was a bit too complex to use to get data and do analysis so I passed on it but it looks very powerful and well documented.

Precipitation data
Precipitation data is available for stations across Canada and this website https://climate.weather.gc.ca/historical_data/search_historic_data_e.html allows search by name, reference id or geographical location.

However, since this only allows download by single month of data, I needed to find another method to more quickly get multiple years and months of data.

When you use the website search and follow the links to the download page you will also see a link to get more historical data. This link brings you to a Google Drive folder. This folder documents how to use their ‘wget’ method to download files. However, I was using Windows and didn’t want to mess around with Cygwin or Windows linux to be able to use wget.

I wanted to use Python and it turned out to be relatively simple to replicate the wget process using Python Requests and looping through a list of years and months to get each csv file. I also used Python io.StringIO to stream request content into a ‘fake csv’ file in each loop, which were aggregated into a list that was converted into a Pandas dataframe afterwards.

The Python code is in my Github ‘Environment-Canada-Precip-and-Flow-Data’ Repository.

Plotly Express Python remove legend title

Plotly.py 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to Plotly.py 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update(name=t.name.split("=")[0])

However now with Plotly.py 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

fig.update_layout({
    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.