Scraping public health web page to get data

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository https://github.com/sitrucp/covid_montreal_scrape_data.

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on Github.io I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = 'https://santemontreal.qc.ca/en/public/coronavirus-covid-19/'
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
    else:
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)
        upload_to_aws()

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

  
    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():
        bucket.delete_key(key)

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

Dell ecommerce web site scraping analysis

Once upon a time, I needed to find Dell monitor data to analyse.

A quick search brought me to their eCommerce web site which had all the monitor data I needed and all I had to do was get the data out of the website.

To get the data from the website I used the Python and Python module Scrapy to scrape the webpage and write data to a csv file.

Based on the data I got from the site the counts of monitors by size and country are presented below.


 

However this data is probably not accurate. In fact I know it isn’t. There was a surprising number of variances in the monitor descriptions including screen size which made it hard to get quick accurate counts. I had to do some data munging to clean up the data but there is still a bit more to do.

The surprising thing is that there do not appear to be specific data points for each of the monitor descriptions components. This website is being generated from a data source likely a database that contains Dell’s products. This database does not appear to have fields for each independent data point that are used to categorize and describe Dell monitors.

The reason I say this is that the monitor descriptions single string of text. Within the text string are things like the monitor size, model, common name, and various other features.

These are not in same order, do not all have same spelling, format such as use of text separators, lower or upper case.

Most descriptions are formatted like this example:

Dell UltraSharp 24 InfinityEdge Monitor – U2417H”.

However the many variations on this format at listed below. There is obviously no standardization for Dell to enter monitor descriptions for their ecommerce site.

  • Monitor Dell S2240T serie S 21.5″
  • Dell P2214H – Monitor LED – 22-pulgadas – 1920 x 1080 – 250 cd/m2 – 1000:1 – 8 ms – DVI-D
  • Dell 22 Monitor | P2213 56cm(22′) Black No Stand
  • Monitor Dell UltraSharp de 25″ | Monitor UP2516D
  • Dell Ultrasharp 25 Monitor – UP2516D with PremierColor
  • Dell 22 Monitor – S2216M
  • Monitor Dell UltraSharp 24: U2415
  • Dell S2340M 23 Inch LED monitor – Widescreen 60Hz Full HD Monitor

Some descriptions include the monitor size unit of measurement, usually in inches, sometimes in centimeters, and sometimes none at all.

Sometimes hyphens are used to separate description sections but other times the pipe character ( | ) is used to separate content. Its a real mish mash.

Description do not have consistent order of description components. Sometimes part number is after monitor size, sometimes it is elsewhere.

The problem with this is that customers browsing the site will have to work harder to compare monitors taking into account these variances.

I’d bet this leads to lost sales or poorly chosen sales that result in refunds or disappointed customers.

I’d also bet that Dell enterprise customers and resellers also have a hard time parsing these monitor descriptions too.

This did affect my ability to easily get the data to do analysis of monitors by description categories because they were not in predictable locations and were presented in many different formats.

Another unusual finding was that it looks like Dell has designated default set of 7 monitors to a large number of two digit country codes. For example Bhutan (bt) and Bolivia (rb) both have the same 7 records, as do many others. Take look at the count of records per country at bottom of page. Many countries have only 7 monitors.

Here is the step by step process used to scrape this data.

The screenshot below shows the ecommerce web site page structure. The monitor information is presented on the page in a set of nested HTML tags which contain the monitor data.

dell ecommerce screenshot

These nested HTML tags can be scraped relatively easily. A quick review revealed that the web pages contained identifiable HTML tags that held the data I needed. Those tags are named in Python code below.

The website’s url also had consistent structure so I could automate navigating through paged results as well as navigate through multiple countries to get monitor data for more than one Dell country in the same sessions.

Below is an example of the url for the Dell Canada eCommerce web site’s page 1:

http://accessories.dell.com/sna/category.aspx?c=ca&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=1

The only two variables in url that change for the crawling purposes are:

  • The “c” variable was a 2 character country code eg “ca” = Canada, “sg” = Singapore, “my” = Malaysia, etc.
  • The “p” variable was a number representing the count of web pages that a country’s monitors are shown on about 10 monitors per page. No country I looked at had more than 5 pages of monitors.

Dell is a multi-national corporation so likely has many countries in this eCommerce database.

Rather than guess what they are I got a list of two character country codes from Wikipedia that I could use to create urls to see if that country has data. As a bonus the Wikipedia list gives me the country name.

The Wikipedia country code list needs a bit of clean-up. Some entries are clearly not countries but some type of administrative designation. Some countries are listed twice with two country codes. For example Argentina has “ar” and “ra”. For practical purposes if the Dell url can’t be created from this country codes in this list then the code just skips to next one country code.

The Python code I used is shown below. It outputs a csv file with the website data for each country with the following columns:

  • date (of scraping)
  • country_code (country code entered from Wikipedia)
  • country (country name from Wikipedia)
  • page (page number of website results)
  • desc (HTML tag containing string of text)
  • prod_name (parsed from desc)
  • size (parsed from desc)
  • model (parsed from desc)
  • delivery (HTML tag containing just this string)
  • price (HTML tag containing just this string)
  • url (url generated from country code and page)

The code loops through the list of countries that I got from Wikipedia and within each country it also loops through the pages of results while pagenum < 6:.

I hard coded the number of page loops to 6 as no country had more than 5 pages of results. I could have used other methods perhaps looping until url returned 404 or page not found. It was easier to hard code based on manual observation.

Dell eCommerce website scraping Python code

#-*- coding: utf-8 -*-
import urllib2
import urllib
from cookielib import CookieJar

from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

countries={
    'AC':'Ascension Island',
    'AD':'Andorra',
    'AE':'United Arab Emirates',
     ... etc
    'ZM':'Zambia',
    'ZR':'Zaire',
    'ZW':'Zimbabwe'
}

def main():

    output = list()
    todaydate = datetime.today().strftime('%Y-%m-%d')
    
    with open('dell_monitors.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['date', 'country_code', 'country', 'page', 'desc', 'prod_name', 'size', 'model', 'delivery', 'price', 'url'], delimiter = ',')
        writer.writeheader()
        
        for key in sorted(countries):
            country_code = key.lower()
            country = countries[key]
            pagenum = 1      
            while pagenum < 6:
                url = "http://accessories.dell.com/sna/category.aspx?c="+country_code+"&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=" + str(pagenum)
                #HTTPCookieProcessor allows cookies to be accepted and avoid timeout waiting for prompt
                page = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url).read()
                soup = BeautifulSoup(page)           
                if soup.find("div", {"class":"rgParentH"}):
                    tablediv = soup.find("div", {"class":"rgParentH"})
                    tables = tablediv.find_all('table')
                    data_table = tables[0] # outermost table parent =0 or no parent
                    rows = data_table.find_all("tr")
                    
                    for row in rows:
                        rgDescription = row.find("div", {"class":"rgDescription"})
                        rgMiscInfo = row.find("div", {"class":"rgMiscInfo"})
                        pricing_retail_nodiscount_price = row.find("span", {"class":"pricing_retail_nodiscount_price"})

                        if rgMiscInfo: 
                            delivery = rgMiscInfo.get_text().encode('utf-8')
                        else:
                            delivery = ''
                            
                        if pricing_retail_nodiscount_price:
                            price = pricing_retail_nodiscount_price.get_text().encode('utf-8').replace(',','')
                        else:
                            price = ''
                            
                        if rgDescription:
                            desc = rgDescription.get_text().encode('utf-8')
                            prod_name = desc.split("-")[0].strip()
                            try:
                                size1 = [int(s) for s in prod_name.split() if s.isdigit()]
                                size = str(size1[0])
                            except:
                                size = 'unknown'
                            try:
                                model = desc.split("-")[1].strip()
                            except:
                                model = desc
                                
                            results = str(todaydate)+","+country_code+","+country+","+str(pagenum)+","+desc+","+prod_name+","+size+","+model+","+delivery+","+price+","+url
                            
                            file.write(results + '\n')
                    
                    pagenum +=1
                else:
                    #skip to next country
                    pagenum = 6 
                    continue

                
if __name__ == '__main__':
    main()


The Python code scraping output is attached here as a csv file.

The summary is a list of the scraping output that shows a list of country codes, countries and count of Dell monitor records scraped from a web page using the country code Wikipedia had for these countries.

af – Afghanistan – 7 records
ax – Aland – 7 records
as – American Samoa – 7 records
ad – Andorra – 7 records
aq – Antarctica – 7 records
ar – Argentina – 12 records
ra – Argentina – 7 records
ac – Ascension Island – 7 records
au – Australia – 36 records
at – Austria – 6 records
bd – Bangladesh – 7 records
be – Belgium – 6 records
bx – Benelux Trademarks and Design Offices – 7 records
dy – Benin – 7 records
bt – Bhutan – 7 records
rb – Bolivia – 7 records
bv – Bouvet Island – 7 records
br – Brazil – 37 records
io – British Indian Ocean Territory – 7 records
bn – Brunei Darussalam – 7 records
bu – Burma – 7 records
kh – Cambodia – 7 records
ca – Canada – 46 records
ic – Canary Islands – 7 records
ct – Canton and Enderbury Islands – 7 records
cl – Chile – 44 records
cn – China – 46 records
rc – China – 7 records
cx – Christmas Island – 7 records
cp – Clipperton Island – 7 records
cc – Cocos (Keeling) Islands – 7 records
co – Colombia – 44 records
ck – Cook Islands – 7 records
cu – Cuba – 7 records
cw – Curacao – 7 records
cz – Czech Republic – 6 records
dk – Denmark – 23 records
dg – Diego Garcia – 7 records
nq – Dronning Maud Land – 7 records
tp – East Timor – 7 records
er – Eritrea – 7 records
ew – Estonia – 7 records
fk – Falkland Islands (Malvinas) – 7 records
fj – Fiji – 7 records
sf – Finland – 7 records
fi – Finland – 5 records
fr – France – 17 records
fx – Korea – 7 records
dd – German Democratic Republic – 7 records
de – Germany – 17 records
gi – Gibraltar – 7 records
gr – Greece – 5 records
gl – Greenland – 7 records
wg – Grenada – 7 records
gu – Guam – 7 records
gw – Guinea-Bissau – 7 records
rh – Haiti – 7 records
hm – Heard Island and McDonald Islands – 7 records
va – Holy See – 7 records
hk – Hong Kong – 47 records
in – India – 10 records
ri – Indonesia – 7 records
ir – Iran – 7 records
ie – Ireland – 7 records
im – Isle of Man – 7 records
it – Italy – 1 records
ja – Jamaica – 7 records
jp – Japan – 49 records
je – Jersey – 7 records
jt – Johnston Island – 7 records
ki – Kiribati – 7 records
kr – Korea – 34 records
kp – Korea – 7 records
rl – Lebanon – 7 records
lf – Libya Fezzan – 7 records
li – Liechtenstein – 7 records
fl – Liechtenstein – 7 records
mo – Macao – 7 records
rm – Madagascar – 7 records
my – Malaysia – 25 records
mv – Maldives – 7 records
mh – Marshall Islands – 7 records
mx – Mexico – 44 records
fm – Micronesia – 7 records
mi – Midway Islands – 7 records
mc – Monaco – 7 records
mn – Mongolia – 7 records
mm – Myanmar – 7 records
nr – Nauru – 7 records
np – Nepal – 7 records
nl – Netherlands – 8 records
nt – Neutral Zone – 7 records
nh – New Hebrides – 7 records
nz – New Zealand – 37 records
rn – Niger – 7 records
nu – Niue – 7 records
nf – Norfolk Island – 7 records
mp – Northern Mariana Islands – 7 records
no – Norway – 19 records
pc – Pacific Islands – 7 records
pw – Palau – 6 records
ps – Palestine – 7 records
pg – Papua New Guinea – 7 records
pe – Peru – 43 records
rp – Philippines – 7 records
pi – Philippines – 7 records
pn – Pitcairn – 7 records
pl – Poland – 4 records
pt – Portugal – 7 records
bl – Saint Barthelemy – 7 records
sh – Saint Helena – 7 records
wl – Saint Lucia – 7 records
mf – Saint Martin (French part) – 7 records
pm – Saint Pierre and Miquelon – 7 records
wv – Saint Vincent – 7 records
ws – Samoa – 7 records
sm – San Marino – 7 records
st – Sao Tome and Principe – 7 records
sg – Singapore – 37 records
sk – Slovakia – 23 records
sb – Solomon Islands – 7 records
gs – South Georgia and the South Sandwich Islands – 7 records
ss – South Sudan – 7 records
es – Spain – 10 records
lk – Sri Lanka – 7 records
sd – Sudan – 7 records
sj – Svalbard and Jan Mayen – 7 records
se – Sweden – 6 records
ch – Switzerland – 21 records
sy – Syrian Arab Republic – 7 records
tw – Taiwan – 43 records
th – Thailand – 40 records
tl – Timor-Leste – 7 records
tk – Tokelau – 7 records
to – Tonga – 7 records
ta – Tristan da Cunha – 7 records
tv – Tuvalu – 7 records
uk – United Kingdom – 35 records
un – United Nations – 7 records
us – United States of America – 7 records
hv – Upper Volta – 7 records
su – USSR – 7 records
vu – Vanuatu – 7 records
yv – Venezuela – 7 records
vd – Viet-Nam – 7 records
wk – Wake Island – 7 records
wf – Wallis and Futuna – 7 records
eh – Western Sahara – 7 records
yd – Yemen – 7 records
zr – Zaire – 7 records

Grand Total – 1760 records

LinkedIn ‘People You May Know’ web scraping and analysis

A while back LinkedIn sneakily vacuumed up all of my contacts from my phone via the Android Cardmunch app. Turns out Cardmunch is owned by LinkedIn. There must be fine print somewhere that indicates they do this but I sure didn’t see it.

Many of my contacts that were imported were out of date and in most cases not someone I wanted to be LinkedIn with. Some had actually passed away.

It took a mini-campaign of customer service interaction to get LinkedIn to delete the imported contacts.

Anyways I discovered this had happened when suddenly large numbers of my contacts started showing up in my LinkedIn’s “People You May Know” page.

The PYMK page is a LinkedIn feature that identifies 1,000 people LinkedIn thinks you may know. LinkedIn identifies people you may know by matching contacts they vacuum up from everyone’s address books. They probably also do matching of people on company name, profession, city, LinkedIn groups, etc too.

When LinkedIn finally agreed to delete the contacts I monitored the PYMK page to make sure they were doing it and that it was permanent.

My monitoring was a mix of manual work and automation. I regularly manually downloaded and saved the PYMK webpage and extracted the names on the page to see if my stolen contacts were still on the page. The contacts were removed very quickly (thank you LinkedIn : )) but I continued downloading the PYMK page because I was curious to see how the names would change over time. I ended up downloading the page 29 times over a 3 month period.

I used Python and BeautifulSoup to process the downloaded PYMK html pages and scrape them for the data I wanted.

I used Excel add-in Power Query to shape the data and Excel Pivot tables and charts for the visualizations.

After I downloaded a new page I would run the code on a folder containing the PYMK web page files to produce a data file for analysis. I just wanted to see that my 2,000 imported contacts were deleted. Finally after a few weeks they were gone.

Here are some of the results.

Over the 3 month period about 6,300 unique people were on my PYMK page at least once.

The data I have is incomplete because it wasn’t a daily sample of the PYMK page. I downloaded the pages only 29 times over a 3 month period of time.

Even so it does give some relative information about people’s appearances on my PYMK page.

People’s appearances were not a contiguous series of days. There were gaps in appearances. LinkedIn appears to swap people in and out over a duration of days.

A Gantt chart style visualization made the pattern of people’s appearances obvious. The screenshot below shows an overview a huge 6,300 row Gantt chart that has one unique person per row. The columns are the 29 downloads. The records are sorted descending by 1st date of appearance eg so most recent are on top.

The pink cells indicate that a person appeared on that downloaded PYMK page. Blue cells cells are where they did not appear on the page. At a quick glance you can easily see the regular patterns of appearances on the PYMK web page.

Over time eg going from bottom of chart to top (it is sorted by date people are added to page eg most recently added on top) you can see people are always being introduced to the PYMK page. Some people added in past continue to appear on the page and some appear a few times never to reappear on page.

The varying patterns indicate that the methodology LinkedIn uses to select people to show on my PYMK page changed over time.

The two big columns of pink at the very bottom there are the 2,000 people that were imported from my contact book. Most appeared for only the first few downloads and then LinkedIn deleted them so they don’t ever appear again.

Gantt chart style presentation of all 6,300 people (one unique person per row). Records are sorted descending by 1st date of appearance eg so most recent are on top.
Click on image to open in new tab to view full size.

pymk_by_day_added

Using the 1st and last day people appeared on the 29 downloads I could calculate a ‘duration’ for each person. These are durations are shown in a frequency distribution chart below.

duration freq

Many people appeared over most of the 3 months. About 50% remained on the page for more than 2 months. This would have changed had the sampling continued eg more people may have remained on the page as long too.

However, the relative distribution does indicate that there is a split between people that stay on page and those that are just passing through.

The bulk of the people who appear only once on the PYMK page once are the 2,000 contacts that were imported and then deleted. Some of these appeared in the first PYMK page download and never again, some appeared in one or two subsequent downloads until they were all finally deleted.

What is interesting is that LinkedIn was not able to match these contacts to me using their other methods. That hints that the basic mechanism behind the PYMK matching is simple contact name matching between LinkedIn accounts.

Of the 29 downloaded PYMK pages most people were on the page less than 9 times as shown in the frequency distribution below. Daily sampling would likely see these counts increase though I expect the relative distribution would be similar.

days freq

I created a ‘presence’ metric that is the relative # days appearances people have over their entire # days from their 1st appearance to last appearance. This is shown in the frequency distribution below. The big spike at 100% are the imported contacts which showed up in only one download (and then were deleted from LinkedIn forever).

presence freq

Daily sampling would have seen the distribution shift to the right towards 100%. I guess that the peak of the distribution would shift up to around 30% eg most people appear about 30% of the time from when they 1st appear to their last appearance.

The Python code used to scrape the downnloaded PYMK web pages was the following:

  1. Go to LinkedIn PYMK page, scroll down until all 1000 contacts are shown. The page has ‘infinite scrolling’ that pages through entire 1000 contacts incrementally. I couldn’t find easy way to get Python to do this for me automatically which would have been nice.
  2.  Save resulting web page as html file on my computer.
  3.  Run Python script with BeautifulSoup (below) to get list of names.
  4.  Compare list of names day to day to see the changes made.

Python code to get names from a single html file:

from bs4 import BeautifulSoup
soup = BeautifulSoup (open("pymk.htm"))
    for card in soup.findAll('a', 'title'):
        for title in card.find_all('span', 'a11y-headline'):
            print str(card['title'].encode("utf-8")) + ' | ' + str(title.string.encode("utf-8"))

 

Python code to get names from a folder of html files:

import os
from bs4 import BeautifulSoup
    path = str('C:/PYMK Page Downloads/')
    for file in os.listdir(path):
        file_path = os.path.join(path, file)
        soup = BeautifulSoup(open(file_path))
        for card in soup.findAll('div', 'entityblock profile'):
            name2 = card.find("a", {"class":"title"})['title']
            name = name2.encode("utf-8")
            title2 = card.find("span", {"class":"a11y-headline"})
            title = title2.text.encode("utf-8")
            connections2 = card.find("span", {"class":"glyph-text"})
            if connections2 is None:
                connections = 0
            else:
                connections = connections2.text.encode("utf-8")
            print str(file) + ' | ' + str(name) + ' | ' + str(title) + ' | ' + str(connections)

Analysis of results