CBC news article comments analysis

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

Comments for a specific CBC opinion article are anaysed in detail below.

See a previous post which details how to obtain comments from CBC news and opinion article. Code for this project can be found in this Github repository.

The opinion article was titled “On COVID restrictions, our governments keep firing up the gaslights and shifting the goalposts“. This article garnered 7,800 comments by 1,226 unique users. The comment and user counts include posts and replies. The comments were posted over a two day period beginning Dec 03, 2021 4:00 AM ET after which the comments were locked.

Referring to the line chart below, one can see that 50% (about 615) of the 1,226 users made 90% of the comments. Only 9% (about 105 users) of the users made 50% of the comments!

The “word cloud” chart below shows the names of the top 200 users by comment and reply count. The name size corresponds to user comment and reply counts.

Of the 7,800 comments 1,744 (22%) were “top-level” comments eg they were not directly replying to another comment. The rest 6,056 (78%) were replies to another comment. This indicates a lot of interaction between comments.

The next series of “network” charts below provide some insight into the  interactions between users, their comments and replies.

The network charts were created by using the Python NetworkX module. The code used create the NetworkX charts is in the another post.

The red circles (“nodes”) are users. The circle size corresponds to user comment counts. The lines (“edges”) connecting the red circles represent interactions between users as replies to comments. The line arrows indicate who was replying to who.

The first chart is a whole view of the 1,140 users that had at least one reply to their comment. It has 1,140 nodes and 6,000 edges so it makes for a very dense visualization and a big image size. Click on the image to open it in your browser where you will be able to zoom into it and download it if you want.

A closer look below shows more detail. The center of the chart has the users with the greatest number of comments and replies. The outer edges show users with fewer comments and replies.

And another closer looks shows even more detail of the sparse low comment and reply count users on the edges of the chart.

This final “word cloud” visualization shows the top 200 words in all of the comments.

CBC news comment and replies interaction network visualization

The CBC (Canadian Broadcasting Corporation) news website articles often have a comments section. It would be interesting to see the interactions between comments and replies, and to understand which person makes the most comments, and frequently used words and phrases.

See the results: https://sitrucp.github.io/cbc_comments/image_grid.html

Another post details a method to retrieve the comments. Comments include a timestamp when it was posted, comment text, and comment user name, and if it is a reply, then name of the comment user being replied to.

This information can be aggregated to get count of posts by comment user name or date/time. It can also be used to learn more about comment user interactions by visualizing the comment and reply user names in a network visualization using the Python NetworkX module. Code used is provided below.

The visualization below illustrates the relationships between 104 comments and replies by comment user for an article “Unvaccinated travellers over the age of 12 barred from planes and trains as of today” (Note comments data fro this visualization were obtained just after the article was posted when it had about 100 comments and replies. Today it has 2000+ comments.)

 

The red circles are “nodes” which represent the comment users. The node size corresponds to the user’s total number of comments or replies. The lines are “edges” and connect nodes. Edges represent reply from one user to another user. The edges have arrows that indicate the direction eg who replied to who.

The edge line widths represent the number of interactions between two nodes. Interactions are comment replies from one person to another (in either direction). The more interactions, the wider the edge line.

Most of the article comments sections that were analysed had one or more prolific commenters (represented by larger size nodes). In addition, there are comment users that have a greater number of replies (represented by edges).

Examples of visualizations provided below. View complete list of CBC comments visualizations here.

Click on the image to view full size as some of them are very big and you will be able to zoom in to get more detailed view.

On COVID restrictions, our governments keep firing up the gaslights and shifting the goalposts

 

In a fiery speech, O’Toole says Canada is ‘drowning in debt and division’ on Trudeau’s watch

 

Trudeau calls for global carbon tax at COP26 summit

 

RCMP union says it supports a Mountie’s ‘right’ to refuse vaccination

 

View more CBC comments visualizations here.

 

Python code to create the NetworkX charts is provided below and in Github repository.

import networkx as nx
import matplotlib.pyplot as plt
import math

# Drop comments without any replies
df.dropna(subset=['replied_to_user'], how='all', inplace=True)

# Build NetworkX graph
G = nx.Graph()

# Select data to use in graph from dataframe with full data
G = nx.from_pandas_edgelist(df, 'comment_user', 'replied_to_user', 'minutes')

# Create node size variable
d = nx.degree(G)

# create edges, and weights list for edge colors
# weights are minutes from first comment 
edges, weights = zip(*nx.get_edge_attributes(G,'minutes').items())

# create variable to increase graph figure size based on number of nodes to make more readable
factor = math.sqrt(len(G.nodes()) * 0.01)

# Create plot
plt_width = 25 * factor
plt_height = 25 * factor
fig, ax = plt.subplots(figsize=(plt_width, plt_height))
fig.set_facecolor('black')
ax.set_facecolor('black')

# create layout kamada_kawai_layout seemed best!
#pos = nx.spring_layout(G, k=.10, iterations=20)
#pos = nx.spring_layout(G)
pos = nx.kamada_kawai_layout(G)
#pos = nx.fruchterman_reingold_layout(G)

# draw edges
nx.draw_networkx_edges(
    G, 
    pos,
    arrows=True,
    arrowsize=20,
    edgelist=edges,
    edge_color=weights,
    width=1.0,
    edge_cmap=plt.cm.spring,
    node_size=[(d[node]+1) * 200 for node in G.nodes()], # tells edge to go join node on border
)

# draw nodes
nx.draw_networkx_nodes(
    G, 
    pos,
    node_color='red',
    alpha = 0.7,
    edgecolors='white', #color of node border
    node_size=[(d[node]+1) * 200 for node in G.nodes()],
)

# draw labels
nx.draw_networkx_labels(
    G, 
    pos, 
    labels=None, 
    font_size=10, 
    font_color='white', 
    font_family='sans-serif', 
    font_weight='normal', 
    alpha=None, 
    bbox=None, 
    horizontalalignment='center', 
    verticalalignment='center', 
    ax=None, 
    clip_on=False
)

# create variables to use in chart title
min_comment_time = df['comment_time'].min()[:-3]
max_comment_time = df['comment_time'].max()[:-3]
comment_count = len(df)

# create chart title text
title_text = file_url + '\n' + str(comment_count) + ' comments & replies '+ 'from: ' + min_comment_time + ' to: ' + max_comment_time

# add chart title
plt.title(title_text, fontsize=26 * factor, color='white')

#  save the image in the img folder:
plt.savefig(file_path_image + 'network_' + file_name + '.png', format="PNG")