CBC news comments data scraping and word cloud visualization

The CBC (Canadian Broadcasting Corporation) news website stories often have a comments section following the news content.

Unfortunately, the comment delivery method makes it very difficult to read all of the comments because it uses the “endless scrolling” format.

This requires clicking a “SHOW MORE” button at the bottom of the comments again and again to show more comments.

In addition, longer comments require clicking a “» more” link to reveal hidden text

and comments with multiple replies requires clicking a “SHOW 2 OLDER REPLIES”  to show more replies.

In order to see all of the comments and their complete text we would need a process that would effectively click through all of the buttons above until all of the comments and their content was displayed on the webpage.

Once all of the content was visible on the webpage it could be saved locally and Python BeautifulSoup could be used to extract all comments and their content and save it in a tabular data format.

Using Chrome browser’s  “Inspect”, “View pge source” (Ctrl-U) and “Developer tools” (Ctrl-Shift-i ) quickly revealed the relevant HTML tags behind the buttons identified above. These are the things that need to be “clicked” again and again until all the comments and their content are displayed on the webpage.

// SHOW MORE COMMENTS
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-load-more-con" style="display: none;">
<a href="#" class="vf-load-more vf-text-small vf-strong">Show More</a>
</div>

// SHOW REPLIES
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-comment-replies hidden">
<a class="vf-replies-button vf-strong vf-text-small" href="#">Show <span class="vf-replies">0</span> older replies</a>
</div>

// SHOW MORE COMMENT TEXT
// tag is displayed only when comment has hidden text otherwise the tag is not present
<a href="#" class="vf-show-more" data-action="more">» more</a>

The button clicking was somewhat automated using the Javascript below executed in the Developer tools console. The process currently requires pasting the code into the console and manually executing it. Step 1 required some babysitting to ensure it runs to completion satisfactorily.

The workflow to show all comments and their content is as follows:

    • Step 1: Run “STEP 1 – Show more comments” javascript in browser console.
    • Step 2: Run “STEP 2 – Show replies” javascript in browser console.
    • Step 3: Run “STEP 3 – Show more comment text” javascript in browser console.

At this point, all the comments and their content are displayed on the webpage.

    • Step 4: Save webpage locally.
    • Step 5: Run Python script to scape local webpage and save data as csv file.
    • Step 6: Open csv in Excel or analyse using your favourite data visualization tool.
//STEP 1 - Show more comments - pages with 1000's of comments gets slower and show button exceeds 5000 ms so requires manual rerun of script

var timer = setInterval(getMore, 5000);
function getMore() {
    moreDiv = document.getElementsByClassName('vf-load-more-con')[0];
    if(moreDiv.style.display === "none") {
        console.log('vf-load-more comments finished');
        clearInterval(timer);
        return;
    }
    console.log('More comments');
    moreDiv.childNodes[0].nextElementSibling.click();
}

//STEP 2 - Show replies - loops to auto show all comments' replies

var buttons = document.getElementsByClassName('vf-replies-button');
console.log(buttons.length, 'vf-replies-button')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i ,'of', buttons.length) 
}
console.log('vf-rreplies-button finished');

//STEP 3 - Show more comment text - loops to show all commments' text

var buttons = document.getElementsByClassName('vf-show-more');
console.log(buttons.length, 'vf-show-more buttons')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i, 'of',buttons.length) 
}
console.log('vf-show-more comment text finished');

Once all the comments and their content are displayed on the webpage, Step 4 is to save the webpage locally. You need to save as complete html page to save the javascript otherwise the page will be blank.

Then Step 5 is to run the following Python code to extract comment data into csv file.

This uses Python BeautifulSoup to extract HTML tag data into a Pandas dataframe which is then saved locally as a csv file.

import sys, os
import csv
import re
from datetime import datetime, timedelta
from bs4 import BeautifulSoup 
import pandas as pd

file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'

file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'

file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') + '.html'

soup = BeautifulSoup(open(file_path_html + file_name, encoding='utf8').read(), 'html.parser')

publish_date_raw = soup.find('time', class_='timeStamp')['datetime'][:-5]
publish_date = datetime.strptime(str(publish_date_raw), '%Y-%m-%dT%H:%M:%S')
vf_comments = soup.find('div', class_='vf-comments')
vf_comment_threads = soup.find_all('div', class_='vf-comment-container')
vf_usernames = soup.find_all('button', class_='vf-username')

# create comment data list of lists
comment_data = []
replies = []

for thread in vf_comment_threads:
    # children = data_ids.findChildren()
    # div_data_id = soup.find('div', class_='vf-comment')
    data_id = thread['data-id']
    username = thread.find('button', class_='vf-username').get_text()
    comment_time_str = thread.find('span', class_='vf-date').get_text().replace('s ago', '')
    comment_time_int = int(re.sub('[^0-9]', '', comment_time_str))
    if 'minute' in comment_time_str:
        elapsed_minutes = comment_time_int
    if 'hour' in comment_time_str:
        elapsed_minutes = comment_time_int * 60
    comment_text_raw = thread.find('span', class_='vf-comment-html-content').get_text()
    comment_time = publish_date - timedelta(minutes=elapsed_minutes)
    if 'Reply to @' in comment_text_raw:
        comment_type = 'reply'
        replied_to_user = comment_text_raw.split(": ",1)[0].replace('Reply to @', '').strip()
        try:
            comment_text = comment_text_raw.split(": ",1)[1].strip()
        except:
            comment_text = 'no text'
    else:
        comment_type = 'parent'
        replied_to_user = ''
        comment_text = comment_text_raw.strip()

    comment_data.append((
        data_id, 
        publish_date, 
        comment_time,
        username, 
        comment_type, 
        replied_to_user, 
        comment_text, 
        file_name.replace('.html', ''), 
        file_url))

df_comment_data = pd.DataFrame(
    list(comment_data), 
    columns=[
    'data_id', 
    'publish time', 
    'comment_time', 
    'comment_user', 
    'comment_type', 
    'replied_to_user', 
    'comment_text', 
    'file_name',
    'file_url'])

df_comment_data.to_csv(
    file_path_csv + file_name.replace('.html', '.csv'), 
    encoding='utf-8', 
    index=False)

Now that you have a nice tabular format csv data file you can do Step 6 and open the csv in Excel/Google Sheets or analyse the data using your favourite data visualization tool.

Comments Word Cloud

One of the visualizations I created was a comment word cloud. This used the csv file that was created above as a data source.

The Python NLTK  (Natural Language Toolkit) was used to remove stop words and punctuation, tokenize the comment text, and Python WordCloud was used to create the word cloud chart.

import csv
import string
from string import punctuation
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
lemma = nltk.wordnet.WordNetLemmatizer()

# get paths and files
file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'
file_path_image = 'C:/cbc_comments/image/'
file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'
file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_')

# read csv into df
df = pd.read_csv(file_path_csv + file_name + '.csv')

# Drop null comment text df records
df.dropna(subset=['comment_text'], how='all', inplace=True)

# Combine comment_text into list of comments
text_list = df['comment_text'].tolist()

# Combine all comment text into one huge text
text = ' '.join(comment.lower() for comment in df.comment_text)

# clean up comment text data
stop_words = set(stopwords.words('english'))
punctuation = list(punctuation)

tokens = word_tokenize(text)
filtered_text1 = [token for token in tokens if token not in stop_words]
filtered_text2 = [idx for idx in filtered_text1 if not any(punc in idx for punc in string.punctuation)]
filtered_text3 = [item for item in filtered_text2 if len(item)>1]
filtered_text4 = [x for x in filtered_text3 if not isinstance(x, int)]
filtered_text = [lemma.lemmatize(x) for x in filtered_text4]

# Create wordcloud
wordcloud = WordCloud(
    width=800, 
    height=450,
    font_path='c:/windows/font/ARLRDBD.TTF',
    colormap='rainbow',
    max_font_size=80,
    collocations=False, 
    max_words=1000, 
    background_color='black'
    ).generate(' '.join(filtered_text))

plt.figure(
    figsize=(20,10), 
    facecolor='k'
    )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# Save the image in the img folder:
wordcloud.to_file(file_path_image + file_name + '.png')

The word cloud for this story’s comments looks like this.