The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.
See the results: https://sitrucp.github.io/cbc_comments/image_grid.html
The comments section is at the end of the story.
Unfortunately, the comment delivery method makes it very difficult to read all of the comments because it uses the “endless scrolling” format.
This requires clicking a “SHOW MORE” button at the bottom of the comments again and again to show more comments.
In addition, longer comments require clicking a “» more” link to reveal hidden text
and comments with multiple replies requires clicking a “SHOW 2 OLDER REPLIES” to show more replies.
In order to see all of the comments and their complete text we would need a process that would effectively click through all of the buttons above until all of the comments and their content was displayed on the webpage.
Once all of the content was visible on the webpage it could be saved locally and Python BeautifulSoup could be used to extract all comments and their content and save it in a tabular data format.
Using Chrome browser’s “Inspect”, “View pge source” (Ctrl-U) and “Developer tools” (Ctrl-Shift-i ) quickly revealed the relevant HTML tags behind the buttons identified above. These are the things that need to be “clicked” again and again until all the comments and their content are displayed on the webpage.
Relevant code is provided below and can be found in this Github repository.
View complete list of CBC comments visualizations here.
// SHOW MORE COMMENTS // div tag will have style="display: none;" if there are no more comments otherwise it is displayed <div class="vf-load-more-con" style="display: none;"> <a href="#" class="vf-load-more vf-text-small vf-strong">Show More</a> </div> // SHOW REPLIES // div tag will have style="display: none;" if there are no more comments otherwise it is displayed <div class="vf-comment-replies hidden"> <a class="vf-replies-button vf-strong vf-text-small" href="#">Show <span class="vf-replies">0</span> older replies</a> </div> // SHOW MORE COMMENT TEXT // tag is displayed only when comment has hidden text otherwise the tag is not present <a href="#" class="vf-show-more" data-action="more">» more</a>
The button clicking was somewhat automated using the Javascript below executed in the Developer tools console. The process currently requires pasting the code into the console and manually executing it. Step 1 required some babysitting to ensure it runs to completion satisfactorily.
The workflow to show all comments and their content is as follows:
-
- Step 1: Run “STEP 1 – Show more comments” javascript in browser console.
- Step 2: Run “STEP 2 – Show replies” javascript in browser console.
- Step 3: Run “STEP 3 – Show more comment text” javascript in browser console.
At this point, all the comments and their content are displayed on the webpage.
-
- Step 4: Save webpage locally.
- Step 5: Run Python script to scape local webpage and save data as csv file.
- Step 6: Open csv in Excel or analyse using your favourite data visualization tool.
//STEP 1 - Show more comments - pages with 1000's of comments gets slower and show button exceeds 5000 ms so requires manual rerun of script var timer = setInterval(getMore, 5000); function getMore() { moreDiv = document.getElementsByClassName('vf-load-more-con')[0]; if(moreDiv.style.display === "none") { console.log('vf-load-more comments finished'); clearInterval(timer); return; } console.log('More comments'); moreDiv.childNodes[0].nextElementSibling.click(); } //STEP 2 - Show replies - loops to auto show all comments' replies var buttons = document.getElementsByClassName('vf-replies-button'); console.log(buttons.length, 'vf-replies-button') for(var i = 0; i <= buttons.length; i++) { buttons[i].click(); console.log('click', i ,'of', buttons.length) } console.log('vf-rreplies-button finished'); //STEP 3 - Show more comment text - loops to show all commments' text var buttons = document.getElementsByClassName('vf-show-more'); console.log(buttons.length, 'vf-show-more buttons') for(var i = 0; i <= buttons.length; i++) { buttons[i].click(); console.log('click', i, 'of',buttons.length) } console.log('vf-show-more comment text finished');
Once all the comments and their content are displayed on the webpage, Step 4 is to save the webpage locally. You need to save as complete html page to save the javascript otherwise the page will be blank.
Then Step 5 is to run the following Python code to extract comment data into csv file.
This uses Python BeautifulSoup to extract HTML tag data into a Pandas dataframe which is then saved locally as a csv file.
import sys, os import csv import re from datetime import datetime, timedelta from bs4 import BeautifulSoup import pandas as pd file_path_html = 'C:/cbc_comments/html/' file_path_csv = 'C:/cbc_comments/data/' file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936' file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') + '.html' soup = BeautifulSoup(open(file_path_html + file_name, encoding='utf8').read(), 'html.parser') publish_date_raw = soup.find('time', class_='timeStamp')['datetime'][:-5] publish_date = datetime.strptime(str(publish_date_raw), '%Y-%m-%dT%H:%M:%S') vf_comments = soup.find('div', class_='vf-comments') vf_comment_threads = soup.find_all('div', class_='vf-comment-container') vf_usernames = soup.find_all('button', class_='vf-username') # create comment data list of lists comment_data = [] replies = [] for thread in vf_comment_threads: # children = data_ids.findChildren() # div_data_id = soup.find('div', class_='vf-comment') data_id = thread['data-id'] username = thread.find('button', class_='vf-username').get_text() comment_time_str = thread.find('span', class_='vf-date').get_text().replace('s ago', '') comment_time_int = int(re.sub('[^0-9]', '', comment_time_str)) if 'minute' in comment_time_str: elapsed_minutes = comment_time_int if 'hour' in comment_time_str: elapsed_minutes = comment_time_int * 60 comment_text_raw = thread.find('span', class_='vf-comment-html-content').get_text() comment_time = publish_date - timedelta(minutes=elapsed_minutes) if 'Reply to @' in comment_text_raw: comment_type = 'reply' replied_to_user = comment_text_raw.split(": ",1)[0].replace('Reply to @', '').strip() try: comment_text = comment_text_raw.split(": ",1)[1].strip() except: comment_text = 'no text' else: comment_type = 'parent' replied_to_user = '' comment_text = comment_text_raw.strip() comment_data.append(( data_id, publish_date, comment_time, username, comment_type, replied_to_user, comment_text, file_name.replace('.html', ''), file_url)) df_comment_data = pd.DataFrame( list(comment_data), columns=[ 'data_id', 'publish time', 'comment_time', 'comment_user', 'comment_type', 'replied_to_user', 'comment_text', 'file_name', 'file_url']) df_comment_data.to_csv( file_path_csv + file_name.replace('.html', '.csv'), encoding='utf-8', index=False)
Now that you have a nice tabular format csv data file you can do Step 6 and open the csv in Excel/Google Sheets or analyse the data using your favourite data visualization tool.
Comments Word Cloud
One of the visualizations I created was a comment word cloud. This used the csv file that was created above as a data source.
The Python NLTK (Natural Language Toolkit) was used to remove stop words and punctuation, tokenize the comment text, and Python WordCloud was used to create the word cloud chart.
import csv import string from string import punctuation import pandas as pd import matplotlib.pyplot as plt from wordcloud import WordCloud import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize lemma = nltk.wordnet.WordNetLemmatizer() # get paths and files file_path_html = 'C:/cbc_comments/html/' file_path_csv = 'C:/cbc_comments/data/' file_path_image = 'C:/cbc_comments/image/' file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936' file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') # read csv into df df = pd.read_csv(file_path_csv + file_name + '.csv') # Drop null comment text df records df.dropna(subset=['comment_text'], how='all', inplace=True) # Combine comment_text into list of comments text_list = df['comment_text'].tolist() # Combine all comment text into one huge text text = ' '.join(comment.lower() for comment in df.comment_text) # clean up comment text data stop_words = set(stopwords.words('english')) punctuation = list(punctuation) tokens = word_tokenize(text) filtered_text1 = [token for token in tokens if token not in stop_words] filtered_text2 = [idx for idx in filtered_text1 if not any(punc in idx for punc in string.punctuation)] filtered_text3 = [item for item in filtered_text2 if len(item)>1] filtered_text4 = [x for x in filtered_text3 if not isinstance(x, int)] filtered_text = [lemma.lemmatize(x) for x in filtered_text4] # Create wordcloud wordcloud = WordCloud( width=800, height=450, font_path='c:/windows/font/ARLRDBD.TTF', colormap='rainbow', max_font_size=80, collocations=False, max_words=1000, background_color='black' ).generate(' '.join(filtered_text)) plt.figure( figsize=(20,10), facecolor='k' ) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.tight_layout(pad=0) plt.show() # Save the image in the img folder: wordcloud.to_file(file_path_image + file_name + '.png')
The word cloud for this story’s comments looks like this.