Using Amazon Transcribe to get 2020 Presidential Debate #1 Speaker Segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. I used Libav to trim off the beginning ~30 minute portion of the audio which was recorded debate stage prep before the debate actually started.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

The transcription contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

Amazon Transcribe requires an audio file in an S3 bucket. The audio was an mp3 file.

Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The transcription process produces a JSON file that is saved in the S3 bucket. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The JSON file read into a Pandas dataframe that was processed and used as data source for a Plotly timeline chart.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times.

A Plotly timeline chart requires datetimes for period start and end and x-axis values. However the transcription has start and end times that are elapsed seconds since audio starts.

So “fake” start and end dates were created by adding arbitrary date (1970-01-01) to a “HH:mm:ss” value created from the seconds values.

Periodic chart elements by origin

This cool periodic chart of the elements shows source / origin of the elements. Source: Wikipedia created by Cmglee

Some elements may come from more than one source which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging it turned out that the author of the SVG file had embedded the data and Python code necessary to create the SVG file inside the file which is very cool.

With some minor modification to the Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications here

The first chart below shows the counts of elements by their origin which answers my question.

Very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest are created from various processes probably quite some time after the Big Bang.

The second chart shows counts of elements by number of origins. Interesting to also learn that some elements have more than one origin and 136 elements have more than one source.

Plotly Express Python remove legend title 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update("=")[0])

However now with 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.