CBC news comments data scraping and word cloud visualization

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

The comments section is at the end of the story.

Unfortunately, the comment delivery method makes it very difficult to read all of the comments because it uses the “endless scrolling” format.

This requires clicking a “SHOW MORE” button at the bottom of the comments again and again to show more comments.

In addition, longer comments require clicking a “» more” link to reveal hidden text

and comments with multiple replies requires clicking a “SHOW 2 OLDER REPLIES”  to show more replies.

In order to see all of the comments and their complete text we would need a process that would effectively click through all of the buttons above until all of the comments and their content was displayed on the webpage.

Once all of the content was visible on the webpage it could be saved locally and Python BeautifulSoup could be used to extract all comments and their content and save it in a tabular data format.

Using Chrome browser’s  “Inspect”, “View pge source” (Ctrl-U) and “Developer tools” (Ctrl-Shift-i ) quickly revealed the relevant HTML tags behind the buttons identified above. These are the things that need to be “clicked” again and again until all the comments and their content are displayed on the webpage.

Relevant code is provided below and can be found in this Github repository.

View complete list of CBC comments visualizations here.

// SHOW MORE COMMENTS
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-load-more-con" style="display: none;">
<a href="#" class="vf-load-more vf-text-small vf-strong">Show More</a>
</div>

// SHOW REPLIES
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-comment-replies hidden">
<a class="vf-replies-button vf-strong vf-text-small" href="#">Show <span class="vf-replies">0</span> older replies</a>
</div>

// SHOW MORE COMMENT TEXT
// tag is displayed only when comment has hidden text otherwise the tag is not present
<a href="#" class="vf-show-more" data-action="more">» more</a>

The button clicking was somewhat automated using the Javascript below executed in the Developer tools console. The process currently requires pasting the code into the console and manually executing it. Step 1 required some babysitting to ensure it runs to completion satisfactorily.

The workflow to show all comments and their content is as follows:

    • Step 1: Run “STEP 1 – Show more comments” javascript in browser console.
    • Step 2: Run “STEP 2 – Show replies” javascript in browser console.
    • Step 3: Run “STEP 3 – Show more comment text” javascript in browser console.

At this point, all the comments and their content are displayed on the webpage.

    • Step 4: Save webpage locally.
    • Step 5: Run Python script to scape local webpage and save data as csv file.
    • Step 6: Open csv in Excel or analyse using your favourite data visualization tool.
//STEP 1 - Show more comments - pages with 1000's of comments gets slower and show button exceeds 5000 ms so requires manual rerun of script

var timer = setInterval(getMore, 5000);
function getMore() {
    moreDiv = document.getElementsByClassName('vf-load-more-con')[0];
    if(moreDiv.style.display === "none") {
        console.log('vf-load-more comments finished');
        clearInterval(timer);
        return;
    }
    console.log('More comments');
    moreDiv.childNodes[0].nextElementSibling.click();
}

//STEP 2 - Show replies - loops to auto show all comments' replies

var buttons = document.getElementsByClassName('vf-replies-button');
console.log(buttons.length, 'vf-replies-button')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i ,'of', buttons.length) 
}
console.log('vf-rreplies-button finished');

//STEP 3 - Show more comment text - loops to show all commments' text

var buttons = document.getElementsByClassName('vf-show-more');
console.log(buttons.length, 'vf-show-more buttons')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i, 'of',buttons.length) 
}
console.log('vf-show-more comment text finished');

Once all the comments and their content are displayed on the webpage, Step 4 is to save the webpage locally. You need to save as complete html page to save the javascript otherwise the page will be blank.

Then Step 5 is to run the following Python code to extract comment data into csv file.

This uses Python BeautifulSoup to extract HTML tag data into a Pandas dataframe which is then saved locally as a csv file.

import sys, os
import csv
import re
from datetime import datetime, timedelta
from bs4 import BeautifulSoup 
import pandas as pd

file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'

file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'

file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') + '.html'

soup = BeautifulSoup(open(file_path_html + file_name, encoding='utf8').read(), 'html.parser')

publish_date_raw = soup.find('time', class_='timeStamp')['datetime'][:-5]
publish_date = datetime.strptime(str(publish_date_raw), '%Y-%m-%dT%H:%M:%S')
vf_comments = soup.find('div', class_='vf-comments')
vf_comment_threads = soup.find_all('div', class_='vf-comment-container')
vf_usernames = soup.find_all('button', class_='vf-username')

# create comment data list of lists
comment_data = []
replies = []

for thread in vf_comment_threads:
    # children = data_ids.findChildren()
    # div_data_id = soup.find('div', class_='vf-comment')
    data_id = thread['data-id']
    username = thread.find('button', class_='vf-username').get_text()
    comment_time_str = thread.find('span', class_='vf-date').get_text().replace('s ago', '')
    comment_time_int = int(re.sub('[^0-9]', '', comment_time_str))
    if 'minute' in comment_time_str:
        elapsed_minutes = comment_time_int
    if 'hour' in comment_time_str:
        elapsed_minutes = comment_time_int * 60
    comment_text_raw = thread.find('span', class_='vf-comment-html-content').get_text()
    comment_time = publish_date - timedelta(minutes=elapsed_minutes)
    if 'Reply to @' in comment_text_raw:
        comment_type = 'reply'
        replied_to_user = comment_text_raw.split(": ",1)[0].replace('Reply to @', '').strip()
        try:
            comment_text = comment_text_raw.split(": ",1)[1].strip()
        except:
            comment_text = 'no text'
    else:
        comment_type = 'parent'
        replied_to_user = ''
        comment_text = comment_text_raw.strip()

    comment_data.append((
        data_id, 
        publish_date, 
        comment_time,
        username, 
        comment_type, 
        replied_to_user, 
        comment_text, 
        file_name.replace('.html', ''), 
        file_url))

df_comment_data = pd.DataFrame(
    list(comment_data), 
    columns=[
    'data_id', 
    'publish time', 
    'comment_time', 
    'comment_user', 
    'comment_type', 
    'replied_to_user', 
    'comment_text', 
    'file_name',
    'file_url'])

df_comment_data.to_csv(
    file_path_csv + file_name.replace('.html', '.csv'), 
    encoding='utf-8', 
    index=False)

Now that you have a nice tabular format csv data file you can do Step 6 and open the csv in Excel/Google Sheets or analyse the data using your favourite data visualization tool.

Comments Word Cloud

One of the visualizations I created was a comment word cloud. This used the csv file that was created above as a data source.

The Python NLTK  (Natural Language Toolkit) was used to remove stop words and punctuation, tokenize the comment text, and Python WordCloud was used to create the word cloud chart.

import csv
import string
from string import punctuation
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
lemma = nltk.wordnet.WordNetLemmatizer()

# get paths and files
file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'
file_path_image = 'C:/cbc_comments/image/'
file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'
file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_')

# read csv into df
df = pd.read_csv(file_path_csv + file_name + '.csv')

# Drop null comment text df records
df.dropna(subset=['comment_text'], how='all', inplace=True)

# Combine comment_text into list of comments
text_list = df['comment_text'].tolist()

# Combine all comment text into one huge text
text = ' '.join(comment.lower() for comment in df.comment_text)

# clean up comment text data
stop_words = set(stopwords.words('english'))
punctuation = list(punctuation)

tokens = word_tokenize(text)
filtered_text1 = [token for token in tokens if token not in stop_words]
filtered_text2 = [idx for idx in filtered_text1 if not any(punc in idx for punc in string.punctuation)]
filtered_text3 = [item for item in filtered_text2 if len(item)>1]
filtered_text4 = [x for x in filtered_text3 if not isinstance(x, int)]
filtered_text = [lemma.lemmatize(x) for x in filtered_text4]

# Create wordcloud
wordcloud = WordCloud(
    width=800, 
    height=450,
    font_path='c:/windows/font/ARLRDBD.TTF',
    colormap='rainbow',
    max_font_size=80,
    collocations=False, 
    max_words=1000, 
    background_color='black'
    ).generate(' '.join(filtered_text))

plt.figure(
    figsize=(20,10), 
    facecolor='k'
    )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# Save the image in the img folder:
wordcloud.to_file(file_path_image + file_name + '.png')

The word cloud for this story’s comments looks like this.

Canadian Government First Nations long term water advisory data

The Government of Canada is working with Canadian First Nations communities to end long-term drinking water advisories which have been in effect for more than 12 months since November 2015. This includes projects to build or renew drinking water infrastructure in these communities.

The First Nation drinking water advisories were a big part of the 2021 Canadian federal election campaigns. The current government had made big promises to fix all of the advisories in 2015. The opposition criticized them for not having achieved that goal to-date. However there was poor communication about the actual number and status of the advisory projects. So I wanted to find some data to learn the actual status for myself.

TLDR the results can be seen on a Github.io hosted page.  A screenshot is provided below. The code to retrieve and transform the data to create that web page are in my Github https://github.com/sitrucp/first_nations_water repository.

Finding the data I needed only took a  few minutes of Google search from an Indigenous Services Canada  website that mapped the water advisories by counts of advisories by FN community, project status, advisory dates, and project type.

While the web page also included a link to download the map data, after a bit of web page scraping and inspection, I learned that the downloadable map data was created by web page Javascript from another larger data source referenced in the map page code https://www.sac-isc.gc.ca//DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt.

This larger data source is a text file containing JSON format records of all Canadian First Nations communities and is many thousands of records. As a side note this larger file appears to be an official Canadian government dataset used in other government websites. The map data is limited to only about 160 First Nations communities with drinking water issues.

So rather than download the 160 record map page data, I retrieved the larger JSON format text file and used similar logic as that for the map web page code to get the 160 records. This was done in two steps.

Step 1: retrieve JSON format text file using a Python script water_map_data.py to retrieve and save the data file locally. I may yet automate this using a scheduled task so the map gets regularly updated data as advisory status changes over time.

Step 2: process the saved data file and present it in an HTML web page as Plotly.js charts and an HTML tabular format using Javascript in this file  first_nations_water.js

Finally, I also separately created a Excel file with Pivot Table and Chart that you can download and use to do your own analysis. Download this Excel file from the Github repository. The file contains an Excel Power Query link to the larger text JSON file described above. You can simply refresh the query to get the latest data from the Indigenous Services Canada  website.

D3.js SVG animation – COVID-19 rate “race” visualization

This visualization shows COVID-19 new cases as a “race” of dots moving from left to right.

The dot’s “speed” or how long it takes to move from left to right is based on the number of cases per day.

If a country has one case per day, it will take an entire day for the dot to move from left to right. Some countries have many 1000’s of new cases daily and the dot moves from left to right in minutes or seconds.

There are three  visualizations for following geographical regions. Click “viz” to view the visualization and “github code” to view the code for the visualization:

The screenshot below shows countries of world. Some countries have not had any new cases over past 7 days so show as gray. Those that have had new cases over past 7 days are shown as white circle (no change from prev 7 days), red (increase from prev 7 days) or green (decrease from prev 7 days).

The visualization is sorted by country by default but can change sorting by average new cases. In addition, you can toggle between showing new cases as actual count or new cases per million (population).

The visualization uses D3.js SVG to create a canvas for each location, the location name text & counts, and circle shape, and transitions, and to retrieve csv file and process data, including filtering to most recent 7 days, group by location to get case count means.

The most important aspect for this visualization was how to use D3.js to animate the movement of the white circle across the canvas, and how to repeat the movement in an ‘endless’ loop.

The code block below hightlights use of a function that uses D3.js .on(“end”, repeat);  to loop through repeat function ‘endlessly’ so that shape is moved across canvas, and then back to original position, to move across canvas again and again. See bl.ocks.org ‘Looping a transition in v5’ example.

The duration() value is the proxy for rate in this visualization and is calculated in another function separately for each location SVG. I also added a counter that would increment an SVG text value to show each loop’s count on canvas.

// repeat transition endless loop
function repeat() {
    svgShape
    .attr("cx", 150)
    .transition()
    .duration(cycleDuration)
    .ease(d3.easeLinear)
    .attr("cx", 600)
    .transition()
    .duration(1)
    .attr("cx", 150)
    .on("end", repeat);
    
    svgTextMetric
    .text(counter + ' / ' + metric);
    counter++;
  };

This visualization was inspired by Jan Willem Tulp’s COVID-19 spreading rates and Dr James O’Donoghue’s  relative rotation periods of planets, and uses same data as Tulp’s spreading rates.

Legend and polygon colors for Leaflet choropleth using Chroma.js

A Leaflet tutorial uses the following hard-coded getColor function to return colors.

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';
}

However, I wanted to use Chroma.js to generate the legend colors dynamically. So I needed a new getColor function.

Chroma.js has a variety of methods to return colors. The one I choose was using scale and classes. These can then be sent as variables to a getColor function to return colors to use in legend and map.

scale can be single value or an array of two colors (either as hex values or color words). In my case, the first is a light blue and the second is a darker blue. Chroma.js will then return gradients between these two colors. See colorHex variable below.

classes is an array of legend ‘breaks’ for the color gradients. For example they could be the numerical values from the Leaflet tutorial getColor function above (eg 10, 20, 50, etc). See classBreaks variable below.

The new getColor function is shown below:

var classBreaks = [1,50,100,250,500,1000,2000,3000,6000,9000];
var colorHex = ['#deebf7','#08306b'];

function getColor(n,classBreaks,colorHex) {
    var mapScale = chroma.scale(colorHex).classes(classBreaks);
    if (n === 0) {
        var regionColor = '#ffffff';
    } else { 
        var regionColor = mapScale(n).hex();
    }
    return regionColor
}

This getColor function can then be used as described in the Leaflet tutorial to set choropleth polygon fill colors. It also be used similarly to create the legend by looping through the classes to get a color for each legend entry.

However there is important consideration when creating the legend. Using scale and classes, Chroma.js only returns classes – 1 colors. For example the variable classBreaks array with 10 elements will only return 9 colors. To hack this I push a dummy element (‘999’) to the array so Chroma.js would return 10 colors and then ignore the dummy element when creating the legend.

The legend code is below includes hard-coded zero (0) value set to color white (#ffffff). Looping through the classBreaks each time using getColor function to return legend color based on break value.

var legend = L.control({position: 'topright'});

legend.onAdd = function (map) {
    var div = L.DomUtil.create('div', 'legend');
    div.innerHTML += '<i style="background: #ffffff;"></i>0
';
    classBreaks.push(999); // add dummy class to extend to get last class color, chroma only returns class.length - 1 colors
    for (var i = 0; i &lt; classBreaks.length; i++) {
        if (i+2 === classBreaks.length) {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '+';
            break
        } else {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '–' + classBreaks[i+1] + '<br>';
        }
    }
    return div;
};
legend.addTo(map);

The final map legend looks like this:

Heat maps of Canadian activity changes due to COVID-19 using Google Community Mobility Reports

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 related activity changes across Canada.

Google has helpfully provided freely available global “Community Mobility Reporting” which shows Google location history change compared to baseline by country, and country sub-regions. These provide changes in activity by location categories: Workplace, Retail & Recreation, Transit Stations, Grocery & Pharmacy and Parks locations, and Residential locations. For Canada it is available by province. As of April 19, data contained daily values from Feb 15 to Apr 11.

The Community Mobility Reporting data is available as a single csv file for all countries at Google Community Mobility Report site. In addition, Google provides feature to filter for specific country or country sub regions eg state or provinces, etc and download resulting PDF format.

As the COVID-19 lockdowns occurred across Canada you would expect that people were less likely to be in public spaces and more likely to be at home. The Community Mobility Reporting location history allows us to get some insight into whether or not this happened, and if it did, to what degree and how this changed over time.

I used the Community Mobility Report data to create a D3.js heat map visualization which is described in more detail below and in this Github repository.

I also created an Excel version of this heat map visualization using Pivot Table & Chart plus conditional formatting. This Excel file, described in more detail below, is available in the Github repository.

More detail and screenshots of visualizations is provided below:

Heatmaps
Heatmaps are grids where columns represent date and rows province/territory. Each heatmap is a grid representing a single mobility report category. The grid cell colors represent value of percent change which could be positive or negative. Changes can be observed as lockdowns occurred where locations in public areas decreased relative to baseline. Inversely, residential location increased relative to baseline as people sheltered in place at their homes.

1) Heatmap created using Excel / Power Query: For this heatmap visualization the global csv data was transformed using Excel Power Query. The Excel file has two Pivot Table and Chart combos. The Excel files and Power Query M Code are in the repository. Excel files are available in Github repository.

2) Heatmap created using D3.js: For this heatmap visualization the global csv data was transformed using Excel Power Query. The heatmap visualization was created using slightly modified code from ONSvisual.

Bar charts
These were created using Excel to visualize percent change by Province/Territory and location category using Excel / Power Query. These allow comparison between provinces by date and category. This Excel / Power Query file can be used for analytical purposes to slice and dice global data by date, country, sub region 1 & 2 and category. Excel files are available in Github repository.