The human radio-broadcast bubble

“Space… is big. Really big. You just won’t believe how vastly, hugely, mindbogglingly big it is” (quote from The Hitchhiker’s Guide to the Galaxy)

One measure of  space is the size of a “bubble” created by radio waves radiating outwards from the Earth over the past 139 years since they were first intentionally produced by humans in the 1880’s. Wikipedia: History of radio

The image below, created by science blogger Adam Grossman, illustrates this “bubble” of radio waves as a tiny blue dot on an artist visualization of the Milky Way galaxy.

The visualization has an inset in the bottom-right with a close-up of the Earth’s region of the Milky Way galaxy to make the blue dot more readily visible. This shows that the blue dot is very small compared to the Milky Way.

The human radio-broadcast bubble compared to Milky Way galaxy.

 

However, that tiny blue dot is actually an enormous sphere with a radius of 139 light-years Wikipedia: light-year and the Earth at its center and the sphere’s grows another light-year every year. For the non-science people, those radio waves are travelling at the speed of light!

Tiny blue dot is actually an enormous sphere with a radius of 139 light-years and the Earth at its center!

The blue dot is very small compared to the Milky Way and even smaller when compared to the universe.

The distance from Earth to the edge of the observable universe is 46 billion light-years Wikipedia: Universe and the observable universe contains about 200 billion galaxies according to latest Hubble observations Nasa: Hubble.

Feeling small yet? Even though that blue dot is arguably the biggest thing humans have created, its 139 light-year radius about 331 million times smaller than the 46 billion light-year distance to the edge of the observable universe.

Also as a final note, it is not likely that the radio-waves are actually detectable, at least not by our current technology, after travelling 139 light-years. But, perhaps somewhere out there, aliens with some magic technology are entertaining themselves by listening to human radio and tv broadcasts, air traffic controllers, police and EMT calls, and cell phone conversations. Nice thought .. scary thought?

Also this image has been used often without attribution to its creators. This planetary.org blog post gives more history of the creation of this image in 2011 by Adam Grossman and Nick Risinger. The original blog post by Adam Grossman can be found archived on github.io and it was featured on YCombinator’s Hacker News here.

 

 

 

Using Amazon Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. I used Libav to trim off the beginning ~30 minute portion of the audio which was recorded debate stage prep before the debate actually started.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

The transcription contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

Amazon Transcribe requires an audio file in an S3 bucket. The audio was an mp3 file.

Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The transcription process produces a JSON file that is saved in the S3 bucket. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The JSON file read into a Pandas dataframe that was processed and used as data source for a Plotly timeline chart.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times.

A Plotly timeline chart requires datetimes for period start and end and x-axis values. However the transcription has start and end times that are elapsed seconds since audio starts.

So “fake” start and end dates were created by adding arbitrary date (1970-01-01) to a “HH:mm:ss” value created from the seconds values.

D3.js SVG animation – COVID-19 rate visualization

This is a visualization that shows relative rate of increase in new COVID-19 cases over the past 7 days.

Click links to view visualizations and github code for following geographical regions:

Screenshot below shows countries of world by average new cases over past 7 days which is visualized as an animated rate over one day. Some countries have not had any new cases over past 7 days so show as gray. Those that have had new cases over past 7 days are shown as white circle (no change from prev 7 days), red (increase from prev 7 days) and green (decrease from prev 7 days).

The visualization is sorted by country by default but can change sorting by average new cases. In addition, can toggle between showing new cases as actual count or new cases per million (population).

The visualization uses D3.js SVG to create a canvas for each location, the location name text & counts, and circle shape, and transitions, and to retrieve csv file and process data, including filtering to most recent 7 days, group by location to get case count means.

The most important aspect for this visualization was how to use D3.js to animate the movement of the white circle across the canvas, and how to repeat the movement in an ‘endless’ loop.

The code block below hightlights use of a function that uses D3.js .on(“end”, repeat);  to loop through repeat function ‘endlessly’ so that shape is moved across canvas, and then back to original position, to move across canvas again and again. See bl.ocks.org ‘Looping a transition in v5’ example.

The duration() value is the proxy for rate in this visualization and is calculated in another function separately for each location SVG. I also added a counter that would increment an SVG text value to show each loop’s count on canvas.

// repeat transition endless loop
function repeat() {
    svgShape
    .attr("cx", 150)
    .transition()
    .duration(cycleDuration)
    .ease(d3.easeLinear)
    .attr("cx", 600)
    .transition()
    .duration(1)
    .attr("cx", 150)
    .on("end", repeat);
    
    svgTextMetric
    .text(counter + ' / ' + metric);
    counter++;
  };

This visualization was inspired by Jan Willem Tulp’s COVID-19 spreading rates and Dr James O’Donoghue’s  relative rotation periods of planets, and uses same data as Tulp’s spreading rates.

AWS Textract – WHO “Draft landscape of COVID-19 candidate vaccines” – convert PDF to csv

TLDR: I extracted text from the WHO’s vaccine candidate PDF file using AWS Textract and made text into a set of interactive web pages . View the AWS Textract PDF extract output csv files in this Github repository and view and interact with the web pages here.

The World Health Organization (WHO) maintains a regularly updated PDF document named Draft landscape of COVID-19 candidate vaccines which contains all COVID-19 vaccine candidates and treatments currently being developed and their status.

The main content in this PDF document is a tabular format with one vaccine or treatment per row. There is also some non-tabular text content including introduction text and footer notes.

I wanted a machine readable format version of this PDF document’s table data so I could do some analysis. This meant I needed to do PDF text extraction. There are lots of solutions. I ended up using Amazon Textract to extract the PDF into csv file format.

“Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables”

Using Textract

You need an AWS account to use Textract and it does cost to use the service (see costs for this analysis at bottom of post).

The Textract service has UI that you can use to upload and process documents manually eg not using code. However, it is also included in the AWS Boto3 SDK so you can code or use command line automation with Textract too.

I used the manual UI for this one time processing of the PDF. However, if I was to automate this to regularly extract data from the PDF I would use Python and Boto3 SDK. The Boto3 SDK Textract documentation and example code are here.

Use this link to go directly to the AWS Textract UI once you are logged into your AWS console.

The Textract UI is quite intuitive and easy to use. You manually upload your PDF file, it processes the file and shows you the interpreted content which is described in “blocks” of text, tables, images, and then you can select which to extract from document.

During this process the the manual Textract process asks permission to create a new S3 folder in your account where it uploads the PDF before processing it. This is because Textract will only accept documents from S3.

Screenshot of Textract UI

The Textract PDF extract output is a zip file contained bunch of files that is automatically downloaded to your computer. The zip file contained the files listed below.

These 3 files appear to be standard information for any AWS Textract job.

    • apiResponse.json
    • keyValues.csv
    • rawText.txt

The rest of the AWS Textract output will vary depending on your document. In this case it returned a file for each table in the document.

    • table-1.csv
    • table-2.csv
    • table-3.csv
    • table-4.csv
    • table-5.csv
    • table-6.csv
    • table-7.csv
    • table-8.csv
    • table-9.csv

To process the AWS Textract table csv files , I imported them into a Pandas dataframe. The Python code used is in Github.

There was some minor clean-up of OCR/interpreted tabular data which included stripping trailing white spaces from all text and removing a few blank table rows. In addition the PDF tables had two header rows that were removed and manually replaced with single header row. Also there were some minor OCR mistakes for example some zeros were rendered as capital letter ‘O’ and some words were missing last letter.

The table columns in the *WHO Draft landscape of COVID-19 candidate vaccines* PDF document tables are shown below. Textract did a good job of capturing these columns.

Vaccine columns:

    • COVID-19 Vaccine developer or manufacturer
    • Vaccine platform
    • Type of candidate vaccine
    • Number of doses
    • Timing of doses
    • Route of administration
    • Stage – Phase 1
    • Stage – Phase 1/2
    • Stage – Phase 2
    • Stage – Phase 3

Treatment columns:

    • Platform
    • Type of candidate vaccine
    • Developer
    • Coronavirus target
    • Current stage of clinical evaluation/regulatory -Coronavirus candidate
    • Same platform for non-Coronavirus candidates

Cost for procesing 9 page PDF file 3 times:

I copied the costs from AWS Billing below to give readers some idea of what Textract costs.

Amazon Textract USE1-AsyncFormsPagesProcessed $1.35

AsyncPagesProcessed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 27.000 pages $1.35

Amazon Textract USE1-AsyncTablespages Processed $0.41
Asyncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 27.000 pages $0.41

Amazon Textract USE1-SyncFormspages Processed $0.30
Syncpages Processed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 6.000 pages $0.30

Amazon Textract USE1-SyncTablespages Processed $0.09
Syncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 6.000 pages $0.09

HTML tabular presentation of Textract output data

View it here.

 

Periodic chart elements by origin

This cool periodic chart of the elements shows source / origin of the elements. Source: Wikipedia created by Cmglee

Some elements may come from more than one source which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging it turned out that the author of the SVG file had embedded the data and Python code necessary to create the SVG file inside the file which is very cool.

With some minor modification to the Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications here https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their origin which answers my question.

Very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest are created from various processes probably quite some time after the Big Bang.

The second chart shows counts of elements by number of origins. Interesting to also learn that some elements have more than one origin and 136 elements have more than one source.