CBC news article comments analysis

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

Comments for a specific CBC opinion article are anaysed in detail below.

See a previous post which details how to obtain comments from CBC news and opinion article. Code for this project can be found in this Github repository.

The opinion article was titled “On COVID restrictions, our governments keep firing up the gaslights and shifting the goalposts“. This article garnered 7,800 comments by 1,226 unique users. The comment and user counts include posts and replies. The comments were posted over a two day period beginning Dec 03, 2021 4:00 AM ET after which the comments were locked.

Referring to the line chart below, one can see that 50% (about 615) of the 1,226 users made 90% of the comments. Only 9% (about 105 users) of the users made 50% of the comments!

The “word cloud” chart below shows the names of the top 200 users by comment and reply count. The name size corresponds to user comment and reply counts.

Of the 7,800 comments 1,744 (22%) were “top-level” comments eg they were not directly replying to another comment. The rest 6,056 (78%) were replies to another comment. This indicates a lot of interaction between comments.

The next series of “network” charts below provide some insight into the  interactions between users, their comments and replies.

The network charts were created by using the Python NetworkX module. The code used create the NetworkX charts is in the another post.

The red circles (“nodes”) are users. The circle size corresponds to user comment counts. The lines (“edges”) connecting the red circles represent interactions between users as replies to comments. The line arrows indicate who was replying to who.

The first chart is a whole view of the 1,140 users that had at least one reply to their comment. It has 1,140 nodes and 6,000 edges so it makes for a very dense visualization and a big image size. Click on the image to open it in your browser where you will be able to zoom into it and download it if you want.

A closer look below shows more detail. The center of the chart has the users with the greatest number of comments and replies. The outer edges show users with fewer comments and replies.

And another closer looks shows even more detail of the sparse low comment and reply count users on the edges of the chart.

This final “word cloud” visualization shows the top 200 words in all of the comments.

CBC news comment and replies interaction network visualization

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

Another post details a method to retrieve the comments. Comments include a timestamp when it was posted, comment text, and comment user name, and if it is a reply, then name of the comment user being replied to.

This information can be aggregated to get count of posts by comment user name or date/time. It can also be used to learn more about comment user interactions by visualizing the comment and reply user names in a network visualization using the Python NetworkX module. Code used is provided below.

The visualization below illustrates the relationships between 104 comments and replies by comment user for an article “Unvaccinated travellers over the age of 12 barred from planes and trains as of today” (Note comments data fro this visualization were obtained just after the article was posted when it had about 100 comments and replies. Today it has 2000+ comments.)

 

The red circles are “nodes” which represent the comment users. The node size corresponds to the user’s total number of comments or replies. The lines are “edges” and connect nodes. Edges represent reply from one user to another user. The edges have arrows that indicate the direction eg who replied to who.

The edge line widths represent the number of interactions between two nodes. Interactions are comment replies from one person to another (in either direction). The more interactions, the wider the edge line.

Most of the article comments sections that were analysed had one or more prolific commenters (represented by larger size nodes). In addition, there are comment users that have a greater number of replies (represented by edges).

Examples of visualizations provided below. View complete list of CBC comments visualizations here.

Click on the image to view full size as some of them are very big and you will be able to zoom in to get more detailed view.

On COVID restrictions, our governments keep firing up the gaslights and shifting the goalposts

 

In a fiery speech, O’Toole says Canada is ‘drowning in debt and division’ on Trudeau’s watch

 

Trudeau calls for global carbon tax at COP26 summit

 

RCMP union says it supports a Mountie’s ‘right’ to refuse vaccination

 

View more CBC comments visualizations here.

 

Python code to create the NetworkX charts is provided below and in Github repository.

import networkx as nx
import matplotlib.pyplot as plt
import math

# Drop comments without any replies
df.dropna(subset=['replied_to_user'], how='all', inplace=True)

# Build NetworkX graph
G = nx.Graph()

# Select data to use in graph from dataframe with full data
G = nx.from_pandas_edgelist(df, 'comment_user', 'replied_to_user', 'minutes')

# Create node size variable
d = nx.degree(G)

# create edges, and weights list for edge colors
# weights are minutes from first comment 
edges, weights = zip(*nx.get_edge_attributes(G,'minutes').items())

# create variable to increase graph figure size based on number of nodes to make more readable
factor = math.sqrt(len(G.nodes()) * 0.01)

# Create plot
plt_width = 25 * factor
plt_height = 25 * factor
fig, ax = plt.subplots(figsize=(plt_width, plt_height))
fig.set_facecolor('black')
ax.set_facecolor('black')

# create layout kamada_kawai_layout seemed best!
#pos = nx.spring_layout(G, k=.10, iterations=20)
#pos = nx.spring_layout(G)
pos = nx.kamada_kawai_layout(G)
#pos = nx.fruchterman_reingold_layout(G)

# draw edges
nx.draw_networkx_edges(
    G, 
    pos,
    arrows=True,
    arrowsize=20,
    edgelist=edges,
    edge_color=weights,
    width=1.0,
    edge_cmap=plt.cm.spring,
    node_size=[(d[node]+1) * 200 for node in G.nodes()], # tells edge to go join node on border
)

# draw nodes
nx.draw_networkx_nodes(
    G, 
    pos,
    node_color='red',
    alpha = 0.7,
    edgecolors='white', #color of node border
    node_size=[(d[node]+1) * 200 for node in G.nodes()],
)

# draw labels
nx.draw_networkx_labels(
    G, 
    pos, 
    labels=None, 
    font_size=10, 
    font_color='white', 
    font_family='sans-serif', 
    font_weight='normal', 
    alpha=None, 
    bbox=None, 
    horizontalalignment='center', 
    verticalalignment='center', 
    ax=None, 
    clip_on=False
)

# create variables to use in chart title
min_comment_time = df['comment_time'].min()[:-3]
max_comment_time = df['comment_time'].max()[:-3]
comment_count = len(df)

# create chart title text
title_text = file_url + '\n' + str(comment_count) + ' comments & replies '+ 'from: ' + min_comment_time + ' to: ' + max_comment_time

# add chart title
plt.title(title_text, fontsize=26 * factor, color='white')

#  save the image in the img folder:
plt.savefig(file_path_image + 'network_' + file_name + '.png', format="PNG")






 

Canada cell tower distribution

opencellid.org  has global cell tower location data that I used to do a series of QGIS maps showing Canada’s cell phone tower distribution. The data includes at latitude, longitude and radio generation eg 4G (LTE) or 3G/2G (GSM, UMTS or CDMA).

Canada cell tower distribution
4G (red) and 3G/2G (blue)

Southern Ontario cell tower distribution
4G (red) and 3G/2G (blue)

Southern Quebec cell tower distribution
4G (red) and 3G/2G (blue)

Canada Maritimes (NB, NS, PEI) cell tower distribution
4G (red) and 3G/2G (blue)

Newfoundland cell tower distribution
4G (red) and 3G/2G (blue)

Canada Prairies (AB, SK, MB) cell tower distribution
4G (red) and 3G/2G (blue)

BC and Alberta cell tower distribution
4G (red) and 3G/2G (blue)

Yukon and Northwest Territories cell tower distribution
4G (red) and 3G/2G (blue)

Nunavut cell tower distribution
4G (red) and 3G/2G (blue)

 

CBC news comments data scraping and word cloud visualization

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

The comments section is at the end of the story.

Unfortunately, the comment delivery method makes it very difficult to read all of the comments because it uses the “endless scrolling” format.

This requires clicking a “SHOW MORE” button at the bottom of the comments again and again to show more comments.

In addition, longer comments require clicking a “» more” link to reveal hidden text

and comments with multiple replies requires clicking a “SHOW 2 OLDER REPLIES”  to show more replies.

In order to see all of the comments and their complete text we would need a process that would effectively click through all of the buttons above until all of the comments and their content was displayed on the webpage.

Once all of the content was visible on the webpage it could be saved locally and Python BeautifulSoup could be used to extract all comments and their content and save it in a tabular data format.

Using Chrome browser’s  “Inspect”, “View pge source” (Ctrl-U) and “Developer tools” (Ctrl-Shift-i ) quickly revealed the relevant HTML tags behind the buttons identified above. These are the things that need to be “clicked” again and again until all the comments and their content are displayed on the webpage.

Relevant code is provided below and can be found in this Github repository.

View complete list of CBC comments visualizations here.

// SHOW MORE COMMENTS
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-load-more-con" style="display: none;">
<a href="#" class="vf-load-more vf-text-small vf-strong">Show More</a>
</div>

// SHOW REPLIES
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-comment-replies hidden">
<a class="vf-replies-button vf-strong vf-text-small" href="#">Show <span class="vf-replies">0</span> older replies</a>
</div>

// SHOW MORE COMMENT TEXT
// tag is displayed only when comment has hidden text otherwise the tag is not present
<a href="#" class="vf-show-more" data-action="more">» more</a>

The button clicking was somewhat automated using the Javascript below executed in the Developer tools console. The process currently requires pasting the code into the console and manually executing it. Step 1 required some babysitting to ensure it runs to completion satisfactorily.

The workflow to show all comments and their content is as follows:

    • Step 1: Run “STEP 1 – Show more comments” javascript in browser console.
    • Step 2: Run “STEP 2 – Show replies” javascript in browser console.
    • Step 3: Run “STEP 3 – Show more comment text” javascript in browser console.

At this point, all the comments and their content are displayed on the webpage.

    • Step 4: Save webpage locally.
    • Step 5: Run Python script to scape local webpage and save data as csv file.
    • Step 6: Open csv in Excel or analyse using your favourite data visualization tool.
//STEP 1 - Show more comments - pages with 1000's of comments gets slower and show button exceeds 5000 ms so requires manual rerun of script

var timer = setInterval(getMore, 5000);
function getMore() {
    moreDiv = document.getElementsByClassName('vf-load-more-con')[0];
    if(moreDiv.style.display === "none") {
        console.log('vf-load-more comments finished');
        clearInterval(timer);
        return;
    }
    console.log('More comments');
    moreDiv.childNodes[0].nextElementSibling.click();
}

//STEP 2 - Show replies - loops to auto show all comments' replies

var buttons = document.getElementsByClassName('vf-replies-button');
console.log(buttons.length, 'vf-replies-button')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i ,'of', buttons.length) 
}
console.log('vf-rreplies-button finished');

//STEP 3 - Show more comment text - loops to show all commments' text

var buttons = document.getElementsByClassName('vf-show-more');
console.log(buttons.length, 'vf-show-more buttons')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i, 'of',buttons.length) 
}
console.log('vf-show-more comment text finished');

Once all the comments and their content are displayed on the webpage, Step 4 is to save the webpage locally. You need to save as complete html page to save the javascript otherwise the page will be blank.

Then Step 5 is to run the following Python code to extract comment data into csv file.

This uses Python BeautifulSoup to extract HTML tag data into a Pandas dataframe which is then saved locally as a csv file.

import sys, os
import csv
import re
from datetime import datetime, timedelta
from bs4 import BeautifulSoup 
import pandas as pd

file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'

file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'

file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') + '.html'

soup = BeautifulSoup(open(file_path_html + file_name, encoding='utf8').read(), 'html.parser')

publish_date_raw = soup.find('time', class_='timeStamp')['datetime'][:-5]
publish_date = datetime.strptime(str(publish_date_raw), '%Y-%m-%dT%H:%M:%S')
vf_comments = soup.find('div', class_='vf-comments')
vf_comment_threads = soup.find_all('div', class_='vf-comment-container')
vf_usernames = soup.find_all('button', class_='vf-username')

# create comment data list of lists
comment_data = []
replies = []

for thread in vf_comment_threads:
    # children = data_ids.findChildren()
    # div_data_id = soup.find('div', class_='vf-comment')
    data_id = thread['data-id']
    username = thread.find('button', class_='vf-username').get_text()
    comment_time_str = thread.find('span', class_='vf-date').get_text().replace('s ago', '')
    comment_time_int = int(re.sub('[^0-9]', '', comment_time_str))
    if 'minute' in comment_time_str:
        elapsed_minutes = comment_time_int
    if 'hour' in comment_time_str:
        elapsed_minutes = comment_time_int * 60
    comment_text_raw = thread.find('span', class_='vf-comment-html-content').get_text()
    comment_time = publish_date - timedelta(minutes=elapsed_minutes)
    if 'Reply to @' in comment_text_raw:
        comment_type = 'reply'
        replied_to_user = comment_text_raw.split(": ",1)[0].replace('Reply to @', '').strip()
        try:
            comment_text = comment_text_raw.split(": ",1)[1].strip()
        except:
            comment_text = 'no text'
    else:
        comment_type = 'parent'
        replied_to_user = ''
        comment_text = comment_text_raw.strip()

    comment_data.append((
        data_id, 
        publish_date, 
        comment_time,
        username, 
        comment_type, 
        replied_to_user, 
        comment_text, 
        file_name.replace('.html', ''), 
        file_url))

df_comment_data = pd.DataFrame(
    list(comment_data), 
    columns=[
    'data_id', 
    'publish time', 
    'comment_time', 
    'comment_user', 
    'comment_type', 
    'replied_to_user', 
    'comment_text', 
    'file_name',
    'file_url'])

df_comment_data.to_csv(
    file_path_csv + file_name.replace('.html', '.csv'), 
    encoding='utf-8', 
    index=False)

Now that you have a nice tabular format csv data file you can do Step 6 and open the csv in Excel/Google Sheets or analyse the data using your favourite data visualization tool.

Comments Word Cloud

One of the visualizations I created was a comment word cloud. This used the csv file that was created above as a data source.

The Python NLTK  (Natural Language Toolkit) was used to remove stop words and punctuation, tokenize the comment text, and Python WordCloud was used to create the word cloud chart.

import csv
import string
from string import punctuation
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
lemma = nltk.wordnet.WordNetLemmatizer()

# get paths and files
file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'
file_path_image = 'C:/cbc_comments/image/'
file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'
file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_')

# read csv into df
df = pd.read_csv(file_path_csv + file_name + '.csv')

# Drop null comment text df records
df.dropna(subset=['comment_text'], how='all', inplace=True)

# Combine comment_text into list of comments
text_list = df['comment_text'].tolist()

# Combine all comment text into one huge text
text = ' '.join(comment.lower() for comment in df.comment_text)

# clean up comment text data
stop_words = set(stopwords.words('english'))
punctuation = list(punctuation)

tokens = word_tokenize(text)
filtered_text1 = [token for token in tokens if token not in stop_words]
filtered_text2 = [idx for idx in filtered_text1 if not any(punc in idx for punc in string.punctuation)]
filtered_text3 = [item for item in filtered_text2 if len(item)>1]
filtered_text4 = [x for x in filtered_text3 if not isinstance(x, int)]
filtered_text = [lemma.lemmatize(x) for x in filtered_text4]

# Create wordcloud
wordcloud = WordCloud(
    width=800, 
    height=450,
    font_path='c:/windows/font/ARLRDBD.TTF',
    colormap='rainbow',
    max_font_size=80,
    collocations=False, 
    max_words=1000, 
    background_color='black'
    ).generate(' '.join(filtered_text))

plt.figure(
    figsize=(20,10), 
    facecolor='k'
    )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# Save the image in the img folder:
wordcloud.to_file(file_path_image + file_name + '.png')

The word cloud for this story’s comments looks like this.

Xero.com data and reporting

Once upon a time, I was using Xero to record and manage a company’s finances and wanted to use Zebra Bi for Power BI to create interactive income statement actual vs plan vs budget reporting.

Xero Features

Zebra BI offers very compelling visualization tools to create financial reporting in Power BI and Excel specifically for comparing actuals vs plan and budget. Zebra BI’s product is a Power BI and Excel plugin. Using it in Power BI is as simple as choosing Zebra BI as a visualization in same way as any other Power BI chart object.

Zebra BI Example Visualizations

Xero API

The primary challenge to using Power BI with Xero is deciding the best way to get data out of Xero so that it can be used by Power BI.

Xero has an API that provides access to almost all of the Xero data. The API can be used directly or indirectly via many third parties. However, the API features and limitations require careful  consideration to determine if they meet your use case.

The Xero API has some good documentation as well as an interactive API explorer that can be used in a no-code way to learn about the API endpoint features and data returned.

Two methods of using the Xero API:

    • Directly from Xero – The API can be used directly by registering for API access and developing your own code and process.
    • Indirectly via third parties – There are also a variety of third parties include a range of SaaS platforms and tools that connect to the API and retrieve Xero data on your behalf. These provide a wide range of methods and Xero data “modelling” and transformation to enable it to be readily used for financial reporting.

I ended up conducting a selection process to identify the best solution to get Xero data which considered a variety of Xero API and third party limitations.

Requirements Summary

    • Multi-year Xero data retrieval and analysis eg 2019, 2020, 2021.
    • Use Power BI to enable Zebra BI reporting.
    • Data must include actuals, budget and multiple forecasts, and ability to create previous year metrics.
    • Method to allow ad-hoc “account mapping” eg ability to categorize accounts other than Xero tracking categories eg group all travel as “Travel”.
    • Entirely automated to minimize manual effort required. This includes retrieval and transformation of Xero data and Power BI report refresh.

Xero API Limitations

    • API 11 month per call limit – need workarounds to get more.
    • Scheduling – need method to schedule data refresh.
    • Update existing data – either delete all or update existing data.
    • Data modelling – Xero Report API’s provide Xero modelling however additional modelling required to get desired datasets.
    • Budget data – need a method to incorporate Xero budget data. Xero is only recently providing budget data via API.
    • Forecast data – need a method to incorporate forecast data. Xero does not have forecast data functionality. However, a workaround is to create a budget that has forecast data.
    • Account grouping – how to categorize accounts other than Xero tracking categories eg group all travel as “Travel”.
    • Implementation & support complexity – code, technical mix, processes, components, environments.
    • All methods require extraction of Xero data into some intermediary storage location eg a datawarehouse or datamart.

Summary

The graphic below summarizes the selection investigation results and comparision and highlights my top two choices that best satisfied my requirements.

Note this comparison doesn’t include costs. However, generally speaking, these solutions have more or less similar costs.

My requirements were best met using Acterys which provides Xero API data ETL,  an advanced datawarehouse, and scheduling and user administration capabilities.  Of the remaining solutions FreshBI  was ranked second.

Note that there are a lot of other potential solutions including the many Xero application partners  or the many cloud data integration tools but these were the ones that I found to be most useful.

Solution Details

FreshBI Power BI Custom Connector for Xero Accounting was initially selected primarily because it was quick and easy way to get Xero API data.

    • The FreshBI team has developed their own custom Power BI Xero connector that they sell for a fixed price. Once you have the connector you are free to modify it as required which is what I did.
    • The FreshBI connector connects to API and returns data to Power BI as Power BI queries. The connector can be accessed and the query code modified as desired to get data into desired shape or to modify it to retrieve multiple years of data.
    • However, code required non-trivial modification to get desired data. Original connector retrieves only 1 year of pivoted format data. I modified the query code to unpivot the data months from columns to rows. Stopped before making further modifications to retrieve > 1 year data. (Interesting note: based on my discussions with FreshBI they said they may modify connector to provide this capability which they will make available as update to purchased connector).
    • Connector cannot refresh from Xero from PBI Online. Only Microsoft ‘certified’ connectors have this capability and FreshBI said Microsoft will certify their connector only if it has many more users. Hence, a Power BI pbix file published to PBI Online can only refresh data by using the PBI Gateway which is free app that runs connector on laptop or on server and syncs Xero data and PBI Online dataset.
    • Support complexity deemed high due to modifications, multiple components and environments.

Acterys ultimately selected because it provides best supported, most robust Xero data solution.

    • Acterys has developed a sophisticated modern BI data modelling and process which meets all data requirements eg actuals, budget, forecast, relational star schema, process to import budget and forecast, modify existing datasets.
    • Includes Microsoft Azure SQL Server database for Xero data storage as reporting data source.
    • Xero data refresh process includes scheduling, status email notifications, and previously retrieved update process. Changes made to data within past 7 days updated in Acterys database.
    • Support complexity deemed medium, though Acterys solution is quite complex, it built and supported by Acterys, for all of their customers, which reduces end user support complexity.

ODataLink and CData were not selected as they are essentially alternate methods of FreshBI connector API data acquisition, and do not offer additional data modelling other than Xero Report API. Interesting note: As a result of my discussions with them, OdataLink added new functionality to retrieve > 1 year data.

Synchub not selected because while it provides excellent data syncing capabilities it does not have any data modelling simply providing raw Xero database table data.

Xero PBI App is a Power BI application that is available in the Microsoft Power BI app store. It is considered only as a Xero “showcase”. While the reports can be modified, or new reports created using the dataset, the data has only rolling 1 year data, and it is not possible to have access to the report pbix file to inspect or modify the dataset, hence it is limited to the purpose of the “showcase” reporting.

Hopefully this write-up was helpful when deciding how to get data out of Xero for analytical and reporting purposes. It seemed to me while researching this topic that there is still a lot of opportunity to provide Xero or other ERP / financial application data integration and reporting capability.