Categories
Amazon Web Services (AWS) BeautifulSoup Python

Scraping public health web page to get data

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository https://github.com/sitrucp/covid_montreal_scrape_data.

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on Github.io I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = 'https://santemontreal.qc.ca/en/public/coronavirus-covid-19/'
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
    else:
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)
        upload_to_aws()

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

  
    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():
        bucket.delete_key(key)

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

Categories
D3.js Geocoding Javascript Python

Choropleth map of Canada COVID-19 cases by health region using Leaflet and D3

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 cases across Canada. So I set about to make a choropleth map visualization of confirmed COVID-19 case counts in Canada. I also created a separate choropleth map for Montreal which is Canada’s COVID-19 “hotspot” with about 25-30% of Canada’s total confirmed COVID-19 cases.

View live Canada map here:
https://sitrucp.github.io/canada_covid_health_regions/index.html

View live Montreal map here:
https://sitrucp.github.io/canada_covid_health_regions/montreal/index.html

The only similar geographical boundaries that have confirmed case counts for all of Canada that I could find for was by: 1) province/territory and 2) health region.

I choose to use health regions in these choropleth maps because there are lots of maps by province / territory. The health regions are geographical boundaries described by provincial health authorities. They likely roughly correspond to population size.

I used Leaflet.js open-source JavaScript library to create the interactive choropleth maps, using D3.js to retrieve and transform csv format data, and Javascript to retrieve JSON geographic boundary files.

The confirmed COVID-19 case counts are available in csv file format from the COVID-19 Canada Open Data Working Group. The csv files are maintained on Github which is updated daily collating data from provinces and territories.

The health region geographical boundary descriptions were obtained primarily from Statscan ArcGIS Health region boundary Canada dataset. However, I needed to make some modifications to update boundaries used by health regions which is described in more detail in Github repository README.

The biggest challenge to create these choropleth maps were data issues were relating geographical boundary names to the confirmed case health region name. Basically needed to create a lookup table to match different names in boundary data file and confirmed counts data file. See Github repository for more details on boundary modifications and relationship between boundary names and health region names.

Code for this project is maintained on github.com/sitrucp/canada_covid_health_regions.

Montreal COVID-19 confirmed case count
Montreal COVID-19 confirmed case count
Categories
Power BI Python

How to use Python as a datasource for Power BI

Step 1 – Download latest version of Power BI to get new features (version that came out in early August 2018)

Step  2 – Enable Python in the “Options – Preview Features” check the “Python support” option. Close and restart Power BI.

Step 3 – Click “Get Data”, select “Other” and “Python script” and click “Connect” button.

Step 4 – Write or paste Python script in the popup Python script window and click Ok button.

I chose to use Pandas to get csv file:

import numpy as np
import pandas as pd
data = pd.read_csv(‘C:/mydatafile.csv’, delimiter=’,’)

Note if you are using Windows as I am, you have to write file path so Windows can recognize it otherwise you will get an error. I used foreslashes but you could also escape the backslashes.

You could also write any Python script that gets or generates data from any data source eg a database, from AWS S3 buckets, etc. Of course this also includes transforming data. The important thing to note is your script must output a named dataset(s). The named dataset(s) will be thing Power BI will use when you are finished coding.

Step 5 – After clicking ok, you get the familiar Power Query data source selection window. Select Python and you will see your named data set(s). Just check the one(s) you want and click ok.

Now just continue working with Power BI as normal! You can always go back to edit the Python script by select “Source” in query editor.

Categories
Python Tableau

BC Executive Compensation Tableau Visualization

The Province of British Columbia publishes public agency executive compensation each year. The data is grouped by:

  • Sector – Employers are categorized in one of six sectors – Colleges, Crown Corporations, Education, Health, Public Service, Universities
  • Employer – Employers belong to one Sector. There are about 190 employers such as BC Assessment Authority, BC Housing Management Commission, BC Hydro, etc
  • Employee – An employee belongs to one Employer and they publish the full name of employee.

The employee’s salary is given as:

  • Base Salary
  • Total Compensation

The data is provided each year as pdf and csv data files. I downloaded csv data files for each year from 2012 to 2016 and combined them into one dataset.

Here is link to Tableau Public Dashboard visualizing the data and the chart is embedded below.

 

I used Python to combine the data. Some of the Sector values where missing spaces, one of the years had a “$” symbol in salary column values which had to be removed, and I added a new year column based on the data file’s name.

Here is the Python code used to combine and clean up the data:

import csv
    import pandas as pd
    from datetime import date, datetime
    import glob
    import os

    def main():

        ## path for data files
        data_path = 'C:/bc_salary_data/'

        allFiles = glob.glob(data_path + "*.csv")
        frame = pd.DataFrame()
        list_ = []
        
        for file_ in allFiles:
            df = pd.read_csv(file_,index_col=None, header=0)
            df['filename'] = os.path.basename(file_).split('.')[0]
            df['Sector'] = df['Sector'].str.replace(' ','').str.replace('CrownCorporations', 'Crown Corporations').str.replace('CrownCorporation', 'Crown Corporations').str.replace('PublicService', 'Public Service')
            df['year'] =  df['filename'].str[-4:]
            df['Base Salary'] = df['Base Salary'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            df['Total Compensation'] = df['Total Compensation'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            list_.append(df)
            
        frame = pd.concat(list_)

        ## Create csv file to write results to
        frame.to_csv('merged.csv')

        print str(datetime.now()) + ' data files merged'
        
        return

    if __name__ == '__main__':
        main()

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 

Categories
Amazon Web Services (AWS) API Chart.js Python Twitter API

Twitter Followers and List Membership Tracking

I have created my own historical reporting of my Twitter account followers and list memberships. I have hosted the reporting on this open webpage.

Most people know what followers are but the list memberships aren’t as well known so here are definitions of them.

Twitter Followers – Twitter followers are other Twitter accounts who follow my account. A Twitter Follower may follow my account for a few days and then unfollow me at any time afterwards.

Twitter List Memberships – Twitter accounts can also create their own Lists and then add other Twitter accounts to the lists. For example a Twitter account may create a List called “Tweet about Big Data” to track Twitter accounts they believe “Tweet about Big Data”. Twitter accounts may add my account to their List and remove it at any time afterwards.

The Twitter data retrieval, writing to database, and data querying are all done on a web server.

In order to record changes in counts of these you need to have daily historical data. The Twitter API doesn’t provide historical data so you need create it yourself by retrieving and saving daily data somewhere.

Three Python scripts using Twitter API, Python Tweepy and AWS SDK are scheduled to run daily using cron jobs.

Two scripts retrieve followers and list memberships and insert the data into a PostgreSQL database. This daily retrieval builds the daily history.

Another script queries the database table to create reporting datasets of new, active and dropped followers and list memberships that are exported as csv files to a AWS S3 folder which also has files for a AWS S3 hosted static website.

The AWS S3 hosted static website uses Chart.js and D3.js Javascript charting libraries to read and visualize the data. This post does not describe how to read the csv files but I have written another post that describes this AWS S3 csv file as D3 report data source

A screenshot of one of the visualizations showing active followers by day, with counts of new follows and unfollows by day is shown below.

twitter-follower-chart

A second screenshot of another visualization showing active list memberships by day, with counts of new listings and un-listing by day is shown below.

twitter-list-membership-chart

The Python code to retrieve the Twitter data, transform it and create csv files and upload them to AWS is below.

The code to retrieve Twitter followers:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get secrets
sys.path.insert(0, '/home/secrets/')
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_followers():

    twitter_followers = []
    
    for user in tweepy.Cursor(api.followers, screen_name="twitterusername").items():
        twitter_followers.append(user)
        
    for user in twitter_followers:
        query = "INSERT INTO twitter_followers \
                (date, \
                follower_id, \
                name, \
                screen_name) \
                VALUES \
                (%s, %s, %s, %s)";
        data = (today,
                str(user.id).strip(),
                str(user.name.encode('utf8','ignore')).replace(',','').strip(),
                str(user.screen_name.encode('utf8','ignore')).strip()
                )
        cursor.execute(query, data)
        conn.commit()

    conn.close()
    
    ## print job status to log
    print str(datetime.now()) + ' twitter followers'    
            
if __name__ == '__main__':
	get_followers()

The code to retrieve Twitter list memberships:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get database creds
sys.path.insert(0, '/home/secrets/') 
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_list_memberships():

    twitter_list_memberships = []
    
    for list in tweepy.Cursor(api.lists_memberships, screen_name="twitterusername").items():
        twitter_list_memberships.append(list)
        #print list.full_name
        
    for list in twitter_list_memberships:

        query = "INSERT INTO twitter_list_memberships \
                (date, \
                id_str, \
                name, \
                full_name, \
                member_count, \
                subscriber_count, \
                mode, \
                following, \
                user_screen_name) \
                VALUES \
                (%s, %s, %s, %s, %s, %s, %s, %s, %s)";
        data = (today,
                list.id_str.encode('utf8','ignore'),
                list.name.encode('utf8','ignore'),
                list.full_name.encode('utf8','ignore'),
                list.member_count,
                list.subscriber_count,
                list.mode,
                list.following,
                list.user.screen_name.encode('utf8','ignore'))
        cursor.execute(query, data)
        conn.commit()
        
    conn.close()
    
    ## print status for log
    print str(datetime.now()) + ' twitter list memberships'    
            
if __name__ == '__main__':
	get_list_memberships()

The code to create csv files and upload to AWS S3 bucket:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
from datetime import datetime
from os import listdir
from os.path import isfile, join
import sys
import csv
import psycopg2
import re
from collections import Counter

upload_path = '/home/twitter/aws_s3_site/'

sys.path.insert(0, '/home/secret/')
from aws_keys import aws_keys
from secrets import secrets

## create aws S3 connection
conn = S3Connection(aws_keys['AWS_KEY'], aws_keys['AWS_SECRET'])
bucket = conn.get_bucket('twitter-bucket', validate=False)

## create db connection
conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cur = conn.cursor()

## get data sets from db

## followers
cur.execute ("select something here;")
followers = list(cur.fetchall())

## lists
cur.execute ("select something here;")
lists = list(cur.fetchall())

conn.close()

## write db data to csv files, save in upload folder

## followers
with open(upload_path + 'followers.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in followers:
        writer.writerow(row)

## lists
with open(upload_path + 'lists.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in lists:
        writer.writerow(row)
            	
## upload csv files to S3 twitter-bucket
    
upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

# delete existing bucket keys to reset
for key in bucket.list():
    if '.csv' in key:
        bucket.delete_key(key)
  
# complete uploading to AWS
for file in upload_files:
    k = Key(bucket)
    k.key = file
    k.set_contents_from_filename(upload_path + file)
    
## write success to cron log
print str(datetime.now()) + ' aws s3 twitter upload'

Categories
Django Geocoding Python Web Development

Django form geocode submitted address to get lat, lon and postal code

One of my Django applications includes a form where user can enter and submit a property address.

django form

The user submitting the form might not know the postal code so I left it optional. However the postal code is a key piece of information for this particular application so I wanted to ensure that I was getting it.

I also wanted to geocode the address immediately to get the address latitude and longitude so it could be shown to user on a Leaflet.js map.

There are lots of free geocoding services for low intensity usage but I ended up using Google Geocoding which is free under certain usage level. You just need to create a Geocoding API project and use the credentials to set up the geocoding.

To interact with the geocoding API I tried Python gecoding modules geopy and geocoder but in the end just used Python Requests module instead as it was less complicated.

When the user clicked the Submit button, behind the scenes, Requests submitted the address to Google’s Geocoding API, gets the JSON response containing the latitude, longitude and postal code which are then written to the application database.

I will update the code in future to check if the user’s postal code is correct and replace it if it is incorrect. Will wait to see how the postal code accuracy looks. Making geocoding API requests too often could bump me over the free usage limit.

The Django View that contains the code described above is shown below.

def property_add(request):
   
    property_list = Property.objects.filter(user_id=request.user.id).order_by('created')
    
    if request.method == 'POST':
        form = PropertyForm(request.POST)
        if form.is_valid():
            new_property = form.save(commit=False)
            address = "%s, %s, %s, %s" % (new_property.address1, new_property.city, new_property.state, new_property.postcode)
            google_geocode_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'
            url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + "'" + address + "'" + '&key=' + google_geocode_key
            
            try:
                response = requests.get(url)
                geoArray = response.json()
                new_property.lat = geoArray['results'][0]['geometry']['location']['lat']
                new_property.lon = geoArray['results'][0]['geometry']['location']['lng']
                new_postcode = geoArray['results'][0]['address_components'][7]['long_name']
                new_fsa = geoArray['results'][0]['address_components'][7]['short_name'][:3]
            except:
                new_property.lat = None
                new_property.lon = None
                new_postcode = None
                new_fsa = None
           
            if new_property.postcode:
                new_property.fsa = new_property.postcode[:3]
            else:
                new_property.postcode = new_postcode
                new_property.fsa = new_fsa
           
            new_property.user_id = request.user.id
            new_property = form.save()
            return HttpResponseRedirect(reverse(property, args=(new_property.pk,)))
    else:
        form = PropertyForm()

    context_dict = {
        'form': form, 
        'property_list': property_list,
    }
        
    return render(
        request,
        'property_form.html',
        context_dict,
        context_instance = RequestContext(
            request,
            {
                'title':'Add Property',
             }
            )
    )    

 

Categories
Django Geocoding Javascript Postgres Python Web Development

Leaflet.js choropleth map color by count using geoJSON datasource

I have a Django web application that needed an interactive map with shapes corresponding to Canadian postal code FSA areas that were different colors based on how many properties were in each FSA. It ended up looking something like the screenshot below.

map1

This exercise turned out to be relatively easy using the awesome open-source Javascript map library Leaflet.js.

I used this Leaflet.js tutorial as the foundation for my map.

One of the biggest challenges was finding a suitable data source for the FSAs. Chad Skelton (now former) data journalist at the Vancouver Sun wrote a helpful blog post about his experience getting a suitable FSA data source. I ended up using his BC FSA data source for my map.

Statistics Canada hosts a Canada Post FSA boundary files for all of Canada. As Chad Skelton notes these have boundaries that extend out into the ocean among other challenges.

Here is a summary of the steps that I followed to get my choropleth map:

1. Find and download FSA boundary file. See above.

2. Convert FSA boundary file to geoJSON from SHP file using qGIS.

3. Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

4. Create Leaflet.js map in HTML page basically the HTML DIV that holds the map and separate Javascript script that loads Leaflet.js, the FSA geoJSON boundary data and processes it to create the desired map.

Find and download FSA boundary file.

See above.

Convert FSA boundary file to geoJSON from SHP file using qGIS.

Go to http://www.qgis.org/en/site/ and download qGIS. Its free and open source.

Use qGIS to convert the data file from Canada Post or other source to geoJSON format. Lots of blog posts and documentation about how to use qGIS for this just a Google search away.

My geoJSON data source looked like this:

var bcData = {
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::4269" } },
    "features": [
    { "type": "Feature", "properties": { "CFSAUID": "V0A", "PRUID": "59", "PRNAME": "British Columbia \/ Colombie-Britannique" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -115.49499542, 50.780018587000029 ], [ -115.50032807, 50.77718343600003 ], [ -115.49722732099997, 50.772528975000057 ], [ -115.49321284, 50.770504059000075 ], [ -115.49393662599999, 50.768143038000062 ], [ -115.50289288699997, 50.762270941000054 ], [ -115.50846411599997, 50.754243300000041 ], [ -115.5104796, 50.753297703000044 ], [ -115.51397592099994, 50.748953800000038 ], [ -115.51861431199995, 50.745737989000077 ], [ -115.52586378899997, 50.743771099000071 ], [ -115.53026371899995, 50.74397910700003 ], [ -115.53451319199996,

 

Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

I used a SQL query in the Django View to get count of properties by FSA.

This dataset looks like this in the template. These results have only one FSA, if it had more it would have more FSA / count pairs.

   var fsa_array = [["V3J", 19]];

Below is code for  the Django view query to create the fsa_array FSA / counts data source.

    cursor = connection.cursor()
    cursor.execute(
    "select fsa, count(*) \
    from properties \
    group by fsa \
    order by fsa;")
    fsas_cursor = list(cursor.fetchall())

    fsas_array = [(x[0].encode('utf8'), int(x[1])) for x in fsas_cursor]

My Javascript largely retains the Leaflet tutorial code with some modifications:

1. How the legend colors and intervals are assigned is changed but otherwise legend functions the same.

2. Significantly changed how the color for each FSA is assigned. The tutorial had the color in its geoJSON file so only had to reference it directly. My colors were coming from the View so I had to change code to include new function to match FSA’s in both my Django view data and the geoJSON FSA boundary file and return the appropriate color based on the Django View data set count.


var map = L.map('map',{scrollWheelZoom:false}).setView([ active_city_center_lat, active_city_center_lon], active_city_zoom);

map.once('focus', function() { map.scrollWheelZoom.enable(); });

var fsa_array = fsas_array_safe;

L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw', {
    maxZoom: 18,
    attribution: 'Map data © OpenStreetMap contributors, ' +
        'CC-BY-SA, ' +
        'Imagery © Mapbox',
    id: 'mapbox.light'
}).addTo(map);

// control that shows state info on hover
var info = L.control();

info.onAdd = function (map) {
    this._div = L.DomUtil.create('div', 'info');
    this.update();
    return this._div;
};

info.update = function (props) {
    this._div.innerHTML = (props ?
        '' + props.CFSAUID + ' ' + getFSACount(props.CFSAUID) + ' lonely homes' 
        : 'Hover over each postal area to see lonely home counts to date.');
};

info.addTo(map);

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';
}     

function getFSACount(CFSAUID) {
    var fsaCount;
    for (var i = 0; i < fsa_array.length; i++) {
        if (fsa_array[i][0] === CFSAUID) {
            fsaCount = ' has ' + fsa_array[i][1];
            break;
        }
    }
    if (fsaCount == null) {
         fsaCount = ' has no '; 
    }
    return fsaCount;
}

function getFSAColor(CFSAUID) {
    var color;
    for (var i = 0; i < fsa_array.length; i++) {
    if (fsa_array[i][0] === CFSAUID) {
        color = getColor(fsa_array[i][1]);
        //console.log(fsa_array[i][1] + '-' + color)
        break;
        }
    }
    return color;
}
    
function style(feature) {
    return {
        weight: 1,
        opacity: 1,
        color: 'white',
        dashArray: '3',
        fillOpacity: 0.7,
        fillColor: getFSAColor(feature.properties.CFSAUID)
    };
}

function highlightFeature(e) {
    var layer = e.target;
    layer.setStyle({
        weight: 2,
        color: '#333',
        dashArray: '',
        fillOpacity: 0.7
    });

    if (!L.Browser.ie && !L.Browser.opera) {
        layer.bringToFront();
    }

    info.update(layer.feature.properties);
}

var geojson;

function resetHighlight(e) {
    geojson.resetStyle(e.target);
    info.update();
}

function zoomToFeature(e) {
    map.fitBounds(e.target.getBounds());
}

function onEachFeature(feature, layer) {
    layer.on({
        mouseover: highlightFeature,
        mouseout: resetHighlight,
        click: zoomToFeature
    });
}

geojson = L.geoJson(bcData, {
    style: style,
    onEachFeature: onEachFeature
}).addTo(map);

var legend = L.control({position: 'bottomright'});

legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend'),
        grades = [0, 1, 5, 10, 15, 20, 25, 30],
        labels = [],
        from, to;

    for (var i = 0; i < grades.length; i++) {
        from = grades[i];
        if (i === 0) {
            var_from_to = grades[i];
            var_color = getColor(from);
        } else {
            var_from_to =  from + (grades[i + 1] ? '–' + grades[i + 1] : '+') ;
            var_color = getColor(from + 1);
        }
        
        labels.push(
            ' ' +
             var_from_to);
    }

    div.innerHTML = labels.join('
'); return div; }; legend.addTo(map);

That is pretty much all there is to creating very nice looking interactive free open-source choropleth maps for your Django website application!

Categories
Postgres Python

Python get image color palette

I created a web application that included screenshots of about 190 country’s national statistics agencies website home page. I created the site to host the results of comparisons of each country’s website home page features.

One of the features I wanted to include was the top 5 colors used on each home page. For example here is an image of the Central Statistical Office of Zambia website home page.

zambia

I wanted a list of the top 5 colors used in the web page, in my case I wanted these as a list of  rgb color values that I could save to a database and use to create a color palette image.

For example the Central Statistical Office of Zambia website home page top 5 rgb colors were:

  1. 138, 187, 223
  2. 174, 212, 235
  3. 101, 166, 216
  4. 93, 92, 88
  5. 242, 245, 247

These 5 colors were not equally distributed on the home page and when plotted as stacked bar plot and saved as an image, looked like this:zambia

How to identify top 5 colors on webpage

Identifying dominant colors in an image is a common task for a variety of use cases so there was a number of options available.

The general technique essentially involves reducing an image to a list of pixels and then identifying each pixel’s color and the relative proportion of that color to all other colors and then taking the top 5 pixel counts to get top 5 colors.

However a web page can have many similar colors for example many shades of blue. So  identification of colors involves a statistical categorization and enumeration to group similar colors into one category, for example, group all shades of blue into one ‘blue’ category.

There are a couple of different statistical methods to do this categorization that are discussed below.

Modified Median Cut Quantization

This method involves counting image pixels by color and charting them on a histogram from which peaks are counted to get dominant colors.

This method is used in a Python module called color-thief-py. This module uses Pillow to process the image and modified median cut quantization

I used this module in the code below where I loop through a folder of screenshots to open image and then pass it to color-thief-py to process. Then the top 5 colors’ rgb strings are written into my database as a list so they can be used later.

    import os, os.path
    from PIL import Image
    import psycopg2
    from colorthief import ColorThief

    conn_string = \
        "host='localhost' \
        dbname='databasename' \
        user='username' \
        password='userpassword'" 
        
    conn = psycopg2.connect(conn_string)     
    cur = conn.cursor()

    ## dev paths
    screenshots_path = 'C:/screenshots/'

    screenshots_dir = os.listdir(screenshots_path)
    for screenshot in screenshots_dir:
        if screenshot != 'Thumbs.db':
            img = Image.open(screenshots_path + screenshot)
            width, height = img.size
            quantized = img.quantize(colors=5, kmeans=3)
            palette = quantized.getpalette()[:15]
            convert_rgb = quantized.convert('RGB')
            colors = convert_rgb.getcolors(width*height)
            color_str = str(sorted(colors, reverse=True))
            color_str = str([x[1] for x in colors])
            print screenshot + ' ' + str(img.size[1]) + ' ' + color_str
            
            cur.execute("UPDATE screenshots \
            set color_palette = %s,  \
            height = %s \
            WHERE filename like %s", \
            (str(color_str),img.size[1], '%' + screenshot + '%',))
            conn.commit()
        
    cur.close()
    conn.close()

K-means clustering

Another statistical method to identify dominant colors in an image is using K-means clustering to group image pixels by rgb values into centroids. Centroid’s with highest counts are identified as dominant colors.

This method is used in the process I found on this awesome blog post on pyimagesearch.com.

This method also uses Python module sklearn k-means to identify the dominant colors. It produced very similar results to the other method.

This code from the pyimagesearch website used Python module matplotlib to create plot of the palette and saved these as images. I modified the code to instead simply save the rgb color values as strings to insert into database and save the matplotlib palette rendered plot as an image. I  added plt.close() after each loop to close the rendered plot after the plot image was saved because if they aren’t closed, they accumulate in memory and crashed the program.

# USAGE
# python color_kmeans.py --image images/jp.png --clusters 3

# import the necessary packages
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import argparse
import utils
import cv2
import os, os.path
import csv

# construct the argument parser and parse the arguments
#ap = argparse.ArgumentParser()
#ap.add_argument("-i", "--image", required = True, help = "Path to the image")
#ap.add_argument("-c", "--clusters", required = True, type = int, help = "# of clusters")
#args = vars(ap.parse_args())

screenshots_path = 'screenshots/'

screenshot_palette = list()

## Create csv file to write results to
file = csv.writer(open('palettes.csv', 'wb'))
file.writerow(['screenshot','palette'])

screenshots_dir = os.listdir(screenshots_path)
for screenshot in screenshots_dir:
    if screenshot != 'Thumbs.db':
        print screenshot

        # load the image and convert it from BGR to RGB so that
        # we can dispaly it with matplotlib
        image = cv2.imread(screenshots_path + screenshot)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # show our image
        #plt.figure()
        #plt.axis("off")
        #plt.imshow(image)
        #plt.close()

        # reshape the image to be a list of pixels
        image = image.reshape((image.shape[0] * image.shape[1], 3))

        # cluster the pixel intensities
        clt = KMeans(n_clusters = 3)
        clt.fit(image)

        # build a histogram of clusters and then create a figure
        # representing the number of pixels labeled to each color
        hist = utils.centroid_histogram(clt)
        bar = utils.plot_colors(hist, clt.cluster_centers_)
        #color_palette = utils.plot_colors(hist, clt.cluster_centers_)
        #print color_palette
        #row = (screenshot, [tuple(x) for x in color_palette])
        #screenshot_palette.append(row)
        #print row
        
        #print row[0]
        #print row[1]
        
        # show our color bar
        plt.figure()
        plt.axis("off")
        plt.imshow(bar)
        plt.show()
        plt.savefig('palettes/' + screenshot)
        plt.close()
       
        #file.writerow([
        #        row[0].encode('utf-8', 'ignore'),
        #        row[1],
        #        ])
Categories
Django Python

Django recreate database table

Django’s makemigrations and migrate commands are very useful to update existing database tables to reflect model changes.

However if you have made many existing table column name changes, migrate will ask you a series of ‘y/N’ questions about which column names are changed. This can be tedious to cycle through especially if there are many changes.

Depending on the relationships your table has, it may be easier and quicker to:

  • Create a backup of the table by copying and renaming table or exporting table data to csv
  • Drop the table
  • Recreate table from scratch
  • Reload data into the new updated table

The question is how to recreate the table?

After you drop the table you can remove your table model from the models.py field and then run makemigrations and then run migrate —fake which is the special trick to get past migrate wanting your table to exist before it can delete it.

Then after you run migrate –fake, you can put your update model for your table back into your models.py and then makemigrations and migrate and you will get your new updated table recreated in database.

Then you can recover your data from the backup with SQL INSERT or by using database data import feature.

 

Categories
Python Tableau Twitter API

Canadian TCS FDI Officers Twitter list member analysis

Blog post updated to add the following:

  • Updated Python “TCS Members Details” code to get additional information from the List members’ profiles.
  • New Python code “TCS Members Tweets” to download all List members’ Tweets. Will provide some more analysis on this new data soon.

————————————————–

The Canadian Trade Commissioner Service maintains a Twitter List named CDN TCS FDI Officers that has a bunch of Canadian Trade Commissioners as members. Investment Canada also has a similar list.

I was interested to learn how many people were on these lists, how long they had been on the lists, what they were Tweeting about.

So I used the Twitter API and Python Tweepy to retrieve data about list members including:

  • screen_name
  • name
  • followers_count
  • friends_count
  • statuses_count (# of tweets)
  • favourites_count
  • created_at (Twitter account create date)
  • account_age_days (calculated as # days from July 9, 2016)
  • listed_count
  • verified
  • profile_image_url
  • profile_sidebar_fill_color
  • profile_text_color
  • profile_image_url_https
  • profile_use_background_image
  • default_profile_image
  • profile_sidebar_border_color
  • profile_background_color
  • profile_link_color

You can get find complete definitions about these over at Twitter.

The data was output as a simple csv file that I used to create a Tableau Public dashboard which is embedded below. The csv file is attached here.

The Tableau dashboard is interactive and can be sorted by any of the columns by using the sort icon which appears when you hover over the top of column as illustrated in screenshot below.

tcs_click_to_sort

 

 

 

 

It would be interesting to try to determine if this Twitter activity has measurable impacts on FDI and Canadian Trade.  For example perhaps foreign investment finds it way to Canada after reading Tweet by one of our Trade Commissioners.

This would require that TCS maintains a CRM (client relationship manager) system and process that records lead sources.

There is some disparity between use of Twitter by the CDN TCS FDI Officers list members as shown by Tweets/Day which total Tweets divided by # days since the account was created. If there is a measurable lift in lead generation by Twitter use then this would be actionable metric.

For the technically minded the Python code is shown below. Note that you need an API account to use with this code.

There is another file tweet_auth import not shown here that contains Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

Here is the “TCS Member Details” code:

 

 #twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
  import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, 'path/to/folder/with/your/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
     #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    a
    pi = tweepy.API(auth)

    today = datetime.now().date()
            
    tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')

    member_details_csv = csv.writer(open('tcs_member_details.csv', 'wb'))
    member_details_csv.writerow([
        'screen_name',
        'name',
        'followers_count',
        'friends_count',
        'statuses_count',
        'favourites_count',
        'created_at',
        'account_age_days',
        'time_zone',
        'listed_count',
        'profile_image_url',
        'profile_sidebar_fill_color',
        'profile_text_color',
        'profile_image_url_https',
        'profile_use_background_image',
        'default_profile_image',
        'verified',
        'profile_sidebar_border_color',
        'profile_background_color',
        'profile_link_color'])

    members = []
    member_tweets = []
        
    for member in tcs_list_members.items():
        
        member_details_csv.writerow([
            member.screen_name.encode('utf-8', 'ignore'),
            member.name.encode('utf-8', 'ignore'),
            member.followers_count,
            member.friends_count,
            member.statuses_count,
            member.favourites_count,
            member.created_at,
            (today-member.created_at.date()).days,
            member.time_zone,
            member.listed_count,
            member.profile_image_url,
            member.profile_sidebar_fill_color,
            member.profile_text_color,
            member.profile_image_url_https,
            member.profile_use_background_image,
            member.default_profile_image,
            member.verified,
            member.profile_sidebar_border_color,
            member.profile_background_color,
            member.profile_link_color
            ])
            

It would be interesting to see what Tweet topics, other Twitter user mentions, links to webpages, etc. So the next step is to loop through each of the list member’s api.user_timeline to retrieve their Tweet content and do some analysis on them.  For now here is the code and some analysis and visualization in Tableau later.

Here is the “TCS Members’ Tweets” Python code:

    import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, '/path/to/your/folder/with/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading
    #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
    #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    #api = tweepy.API(auth)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    today = datetime.now().date()

    def get_list_member_tweets():
        ## get invest_canada and cdn-tcs-fdi-officers list members
        tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')
      
        ## open csv file 
        member_tweets_csv = csv.writer(open('tcs_member_tweets.csv', 'wb'))
        
        ## write header row column names
        member_tweets_csv.writerow([
            'status_id',
            'date_time_UTC',
            'screen_name',
            'hashtags',
            'user_mentions',
            'tweet_text'
            ])
        
        ## loop through list members and get their tweets
        for member in tcs_list_members.items():
            ## get list member tweets
            member_tweets = get_member_tweets(member.screen_name)
            
            for status in member_tweets:
                
                ## check tweets for hashtags
                if status.entities['hashtags']:
                        hastags=[]
                        for hashtag in status.entities['hashtags']:
                            hastags.append(hashtag['text'].encode('utf8','ignore'))
                
                ## check tweets for user_mentions
                if status.entities['user_mentions']:
                        user_mentions=[]
                        for user_mention in status.entities['user_mentions']:
                            user_mentions.append(user_mention['screen_name'].encode('utf8','ignore'))
                
                ## write to csv file      
                member_tweets_csv.writerow([
                    status.id,
                    str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)),
                    member.screen_name,
                    hastags,
                    user_mentions,
                    status.text.replace('\n',' ').replace('\r',' ').encode('utf8','ignore')
                    ])

    def get_member_tweets(screen_name):
        ## get list member's tweets
        
        alltweets = []

        ## can only get max 200 tweets at a time
        new_tweets = api.user_timeline(screen_name = screen_name, count=200)
        alltweets.extend(new_tweets)

        ## get oldest tweet already retrieved
        oldest = alltweets[-1].id - 1

        ## iteratively get remaining tweets
        while len(new_tweets) > 0:
            new_tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
            alltweets.extend(new_tweets)
            oldest = alltweets[-1].id - 1
        
        ## print out member and # tweets retrieved to show progress
        print screen_name + " %s tweets downloaded" % (len(alltweets))
        
        ## return all tweets
        return alltweets

    if __name__ == '__main__':
        get_list_member_tweets()

Full code on Github:

https://github.com/sitrucp/tcs_fdi_list_twitter_data

Categories
API Google Adsense Python

How to use Google Adsense API to download Adsense data

Google’s APIs make getting Adsense (or any other Google service) data easy to download. The code below downloads Adsense data saving results to csv file.

The code uses Google’s AdSense Management API, OAuth 2.0 authorization and the google-api-python-client SDK.

Overview

When you run this code for the first time it will open a web browser to get approval for the API application to access your Adsense account data.

When this approval is granted the code saves a refresh token that is used for all future executions of the code instead of requiring the web browser approval. This is handy if you want to run this code using a cron job like I do.

Here is the code

In summary, the code does the following:

  • Authenticates against API application
  • Queries the scope to get list of accounts
  • Loops through accounts
  • Returns requested metrics and dimensions from Adsense account
  • Writes results to csv file
 
import csv
from datetime import datetime, timedelta
from apiclient import sample_tools
from apiclient.discovery import build
from oauth2client import client

todaydate = datetime.today().strftime('%Y-%m-%d')

def main():    
    ## Authenticate and construct service
    scope = ['https://www.googleapis.com/auth/adsense.readonly']
    client_secrets_file = 'client_secrets.json' 
    
    ## Authenticate and construct service
    service, flags = sample_tools.init('', 'adsense', 'v1.4', __doc__, client_secrets_file, parents=[], scope=scope)
    
    ## Get accounts
    accounts = service.accounts().list().execute()
     
    try:
        ## Get account(s) data
        results = service.accounts().reports().generate(
            accountId='pub-xxxxxxxxxxxxxxxxxx',
            startDate='2012-01-01', # choose your own start date
            endDate= todaydate,
            metric=['EARNINGS', 'CLICKS','AD_REQUESTS'],
            dimension=['DOMAIN_NAME','DATE','AD_FORMAT_NAME']).execute()
    
    except client.AccessTokenRefreshError:
        print ('The credentials have been revoked or expired, please re-run the '
           'application to re-authorize')
           
    ## output results into csv
    header = ['hostname','date','type','earnings','clicks','ad_requests']
    
    with open('output_adsense.csv', 'wb') as file:
        file.write(','.join(header) + '\n')        
        for row in results.get('rows'):
            file.write(','.join(row) + '\n')

    ## print status for log
    print str(datetime.now()) + ' adsense'
    
if __name__ == '__main__':
	main()

Create API Application, get client_secrets.json

As alluded to above you need to create an API application. This is done in the Google Developers Console . If you haven’t already signed up for this you have to do that first.

Create Project

If you haven’t already you will also have to create a Project. Your API application will be inside this Project. In the Developers Console you can create new Project with drop down at top right.

create app - create project

 

Once you have a Project you can go to the Enabled APIs tab to select which Google service API(s) you want to include in the project. For this code’s purposes you can just select Adsense Management API.

Create Application – Select API Type

Use the Create credentials button to create your new application. (You are creating the credentials for your new app.)

When you click the button you get choice of three options (see below).  Important point that raises lots of questions: Adsense cannot use a Service Account to authenticate.  That was what I thought I could use to avoid having to do user authentication since I am running code on my server in cron job.

No problem though because as you will see later, there is a one time only step to have user authorize Adsense account access. From that point on, the authentication is done using a refresh token.

So my code above uses the OAuth client ID. So select that for your application too.

create app - choose auth type

Create Application – Select Application Type

Then you will be asked to choose the application type. For my purposes I do not want web application which will require real redirect URIs. I just want simple plain app. So I chose Other.

create app - choose app type

You will then get client secret and client id which you can copy or get later. You don’t need these as you will get them in the client_secret.json file you download next.

So just change default application name to something unique so you can identify it later.

Create Application – OAuth Consent Screen

This is not something critical but you have to provide an email and application name to show in the browser pop up for authentication.

The idea is to give a user (you) information about who is asking for permission to access your Adsense account (you are).

create app - consent screen

 

Create Application – Download Client Secret JSON File

Then click the download button to download the client_secret.sjon file.  Once the JSON file is downloaded click create.

create app - download secret json

The JSON file downloads with longer name like “client_secret-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json”. You will have to rename the JSON file so it is spelled “client_secret.json”.

Use Code and client_secret.json File to Authenticate 

Put the python code above into .py file along with the client_secret.json file into a folder.

Local server (desktop/laptop)

Run the .py file using the command line which should do the following:

  • Pop up browser and go to Google login page where and ask you to sign in and allow or allow if you already logged in.
  • The command line will advance and finish.
  • Creates a .dat file in your folder.

The .dat file name will be whatever you named your .py file. This dat file is your refresh token that will be used to authenticate access in future.

Remote Server

Copy your Python, client_secret and .dat (refresh token) files to server and run their by cron jobs.

 

Code Variables

You may want to change the Python code to select different Adsense metrics and dimensions. I am only selecting a handful of the API metrics and dimensions. You can check out  these in more detail AdSense Management API Metrics and Dimensions documentation.

 

Categories
Python

Create Python 3 Virtualenv on machine with Python 2 & 3 installed

I have been using Python 2.7 for most of my Python work but have a few projects where I want to use Python 3.x. In these cases I want to create a Virtualenv virtual environment with Python 3.x.

Turns out that is is no problem to have Python 2.7 and Python 3.x installed on the same machine as long as they are installed in their own folders.

You just have to selectively call either when you want to use them but you choose which version you want to have in your Windows path so when you call ‘python’ at command line, you run that version by default. I have Python 2.7 in my Windows environment Path.

So to create a new Python 2.7 virtual environment just call Virtualenv normally:

c:\path\to\my\project>virtualenv venv

To create a new Python 3.x virtual environment just use the ‘-p’ switch and use the Python 3.x path and executable to have Virtualenv create new virtual environment with Python 3.x:

c:\path\to\my\project>virtualenv -p c:\python34\python.exe venv

Categories
Azure Office 365 OneDrive Python Sharepoint

Use OneDrive API to upload files to Office 365 Sharepoint Site

I have automated uploading files from my web site host’s server to my Office 365 Sharepoint site using scheduled cron jobs running Python scripts on my web host.

The Python scripts use Microsoft’s Azure Active Directory Library (ADAL) to authenticate off Azure Active Directory (Azure AD or ADD), and  OneDrive API and Python Requests to use the authentication to upload the files to Sharepoint from my web host.

Here is the code

import adal
import urllib
import requests

## set variables
username = 'user@contoso.onmicrosoft.com'
password = 'password'
authorization_url = 'https://login.windows.net/contoso.onmicrosoft.com' # aka Authority
redirect_uri = 'https://login.microsoftonline.com/login.srf' # from Azure AD application
client_id = 'd84cbf4f-dc23-24d1-8a7d-08ff8359879a' # from Azure AD application
file_url = 'https://contoso.sharepoint.com/_api/v2.0/drive/root:/myfoldername/myfilename.csv:/content'

## use ADAL to create token response
token_response = adal.acquire_token_with_username_password(
        authorization_url,
        username,
        password
    )

## Use ADAL to create refresh token and save as text file to reuse
refresh_token = token_response['refreshToken']
refresh_token_file = open('refresh_token.txt', 'w')
refresh_token_file.write(refresh_token)
refresh_token_file.close()

## Get saved refresh token and use it to get new token response
refresh_token = open('refresh_token.txt', 'r').read()
token_response = adal.acquire_token_with_refresh_token(authorization_url, str(refresh_token))

## get access_token from token response JSON string
access_token = token_response.get('accessToken')

## create http header to send access token to authenticate
headers = {'Authorization':'BEARER ' + str(access_token)}

## example to upload file
upload_file = requests.put(file_url, data = open('myfilename.csv', 'rb'), headers=headers)

 

There are many things to consider when working with Microsoft’s APIs to work with its online services such as Office 365.

The first is how to authenticate. Microsoft is trying to move everyone to use Azure AD to do oAuth authentication. Microsoft services still have their own authentication methods but this exercise I used Azure AD.

The second is what API to use. Microsoft has recently released their Graph API that is ‘one endpoint to rule them all’. However Microsoft services still have their own API’s so while Graph API looks tempting for this exercise I used the OneDrive API.

Azure AD Authentication

The authentication will be done in two parts.

  1. Create Azure AD application to do the authentication for the Microsoft service(s) you want to interact with.
  2. Use ADAL to interact with Azure AD to do the oAuth flow.

Setup Azure AD – create application

Microsoft provides free use of Azure AD for light authentication needs. You can register and create account. Once you have your account you need to create a new application.

For my purposes I created an Azure AD native client application. Azure AD also has web application and web APIs but both require user to enter username and password in web browser. The native client application does technically also require user to enter these too but I hacked past this by using ADAL user authentication and hard coding username and password into the Python code. Since these are going onto my web host in protected directory to run as cron jobs they will be safe.

I am not going to go through the detail of creating an Azure AD application there are some good blog posts and Microsoft does good job of describing it. For example take a look at this site which has decent information about creating a new Azure AD application.

The Azure AD applications allow you to choose which Microsoft services it will be used to authenticate. Confusingly these are also called ‘applications’ too. They are represent Microsoft Services such as Office 365 Sharepoint Online, OneNote, Power BI, etc and is the place where you assign the permissions (also called ‘scopes’) that authentication will allows with that Microsoft service.

An Azure AD application might provide authentication for more than one Microsoft Service. But my native client application has only Windows Azure Active Directory permissions (which are there by default) and Office 365 Sharepoint Online permissions set to Read and write user files and Read and write items in all site collections.

After you have created your client application make sure to copy the client_id and resource_uri to use in code below. The client_id is automatically assigned and the resource_uri for a native client app can be any url and is just a unique identifier. I chose the Office 365 login url. The web applications need a real url because that is where the user will be prompted to enter credentials.

Azure Active Directory Library (ADAL)

Microsoft’s Azure Active Directory Library (ADAL) authentication libraries are created for developer’s to use with Azure AD. I used the ADAL Python SDK which was easily installed with pip install adal. 

The oAuth authentication flow can seem very complex but you don’t have to worry about that if you use ADAL. ADAL uses your Azure AD application credentials (client_id, resource_uri in case of native client application) to retrieve a token response which is a text string in JSON format.

This JSON string includes the actual access token that is used to authenticate accessing Sharepoint and upload the files. You can use Python to retrieve the access token (it is a Dictionary). Then you simply put the access token into a header that will be used in the Put Request as the method of passing the access token to the OneDrive API.

ADAL also takes care of refreshing tokens which expire. In my case where the scripts are running on the server as cron jobs I want the token to refresh automatically. ADAL gets a refresh token that you can save to get a new access token when previous one expires. I actually write the refresh token to a text file on the server and refresh the access token each time code is run. I could only refresh it if the previous one expires.

OneDrive API

The OneDrive API has different configurations depending on whether you are using it to access a OneDrive Personal, OneDrive Business or Sharepoint Online account.

Be warned that the documentation for OneDrive API can be very dense and there are different ways of presenting required syntax to identify interactions. Of course the representations vary with different SDKs too.  Also there are different versions of Microsofts file storage services over the years. So I recommend to focus on the newest OneDrive API and make sure you are looking at documentation relevant to newest version.

The Gotchas

ADAL Default Values

ADAL has default client_id and resource values that it uses for the username authentication. I changed these default values to match my Azure AD application.

Before changing these I was getting an Invalid audience Uri error

{“error”:”invalid_client”,”error_description”:”Invalid audience Uri ‘https:\/\/m
anagement.core.windows.net\/’.”}

This error means the url being used to create the token response was not same as the one that the file was being uploaded to.

EDIT August 21, 2016 Microsoft has updated the ADAL library so that you can specify the client id and the resource value because authentication against different services needs different client id and resource endpoint urls. That means the hack I used below is no longer required. For more details seehttps://github.com/AzureAD/azure-activedirectory-library-for-python

In ADAL’s __init__.py file look for the class _DefaultValues class at bottom of code and replace the default values:

  • I changed client_id to my application’s client_id
  • I changed resource from https://management.core.windows.net/ to https://tenant.sharepoint.com/

The acquire_token_with_username_password function sets these to None so they get set to default values. So this could be changed so they accept values from the code.

Sharepoint Site and Folder Paths

The OneDrive API dev documentation https://dev.onedrive.com/getting-started.htm demonstrates the different service urls:

  • OneDrive – https://api.onedrive.com/v1.0
  • OneDrive for Business – https://{tenant}-my.sharepoint.com/_api/v2.0
  • SharePoint Online – https://{tenant}.sharepoint.com/{site-relative-path}/_api/v2.0

The {site-relative-path} notation indicates the Sharepoint site name. My site didn’t have this because it was default site. However you might have to add your site relative path.

Also the Sharepoint url for the file I was uploading looked like this:

https://contoso.sharepoint.com/Shared%20Documents/myfoldername/myfilename.csv

However you will note that the file_url in the code doesn’t make any reference to the Shared%20Documents:

https://contoso.sharepoint.com/_api/v2.0/drive/root:/myfoldername/myfilename.csv:/content

 

Categories
API Azure Office 365 OneDrive Power BI Python Sharepoint

OneDrive API features

Microsoft has three file storage options:

  1. OneDrive Personal
  2. OneDrive Business
  3. Sharepoint

These have recently been unified into one new OneDrive API https://dev.onedrive.com and oAuth is preferred method of authentication.

However there are some key differences how the API:

  • OneDrive Personal authenticates against oAuth account created at Microsoft Application Registration Portal using a Microsoft account (Live, Microsoft.com). Authentication url is: https://login.live.com/oauth20_authorize.srf
  • OneDrive Business and Sharepoint authenticate against oAuth account created in Azure Active Directory and must be done with Office 365 account.  Authentication url is: https://login.microsoftonline.com/common/oauth2/

You can create two types of applications that will have different methods and parameters:

  • Web application – web site based application that user can sign into. Require definition of an active redirect url and definition of client secret. Scopes or permissions are identified on the fly when authentication is made.
  • Native client application – Android, ‘head-less’ server app, etc. Requires only definition of an unique redirect uri and scopes or permissions that Office 365 account / user have available eg read & write files, etc.

The process for authentication is similar:

  • Sign-in with user account/password or send authentication url to authentication server to get authentication code.
  • Server sends back url with authentication code.
  • Retrieve authentication code from url.
  • Send another url comprised of code and other parameters back to server to get tokens.
  • Use tokens to list, view, upload, download files etc.

There are development SDKs available for popular languages.

I was only interested in thePython SDK . Some key notes about it include:

  • It is created specifically for web applications, not native client applications. The SDK authentication method relies on using a web browser to pass urls with parameters, codes and tokens back and forth. A native client application will not use web browser. A work around was to use head-less browser but that is a bit hacky.
  • It only has OneDrive Personal authentication urls. To use this with OneDrive Business or Sharepoint these urls are easily replaced with the OneDrive Business authentication urls in two files: auth_provider.py and the onedrivesdk_helper.py.

The change to the unified OneDrive API and oAuth authentication only happened in late 2015 so this is pretty new.

There weren’t many well developed or well documented OneDrive Python example code available.

Note it is still also possible to work with OneDrive Business, Sharepoint and OneDrive Personal without using oAuth authentication and this new OneDrive API by simply using urllib, request and templating along with hard coded Office 365 username and password in your code to authenticate.

Finally Microsoft Graph API can be used to interact with OneDrive Business, Sharepoint and OneDrive Personal  once oAuth is setup.

 

Categories
Google Charts MySQL Python Twitter API Web Development

Tracking Cuba Gooding Jr’s Twitter follower count

Happened to see Cuba Gooding Jr’s first tweet about 30 minutes or so after he created it.

Update: @cubagoodingjr is no longer active so not getting tweets from it any longer

At the time his profile said he had 559 followers. A few minutes later I refreshed his profile and saw the follower count had increased to 590 and every few minutes it kept increasing. By the end of the day he had about 4,000 followers.

I thought it would be interesting to track how his follower count changed going forward. So I used the Twitter API to get his follower count once per day, save the data, and then created a web page to visualize his follower count per day.

cubatweet

After 2 days Cuba had about 7,000 followers which averaged out to about 175 new followers per hour.  However, the number of new followers slowed down quickly to 30 or so new followers per day and after about 3 months he only gets about 10 new followers per day.  In fact, some days, he has net loss of followers, eg more people unfollow than him, than new follows on that day.

For the technically inclined, I setup an automatic retrieval of Cuba’s Tweets using Twitter’s API and the Tweepy Python module scheduled to run daily using a cron job.

The follower counts get put into a database. I created a PHP web page application to retrieve the data from the database, and create a web page that includes the Google Charts API to create a simple line chart to show Cuba’s regularly updated follower count by day.

You can get the cron job and PHP web page code from my  Github repository. 

If you want to run this code yourself you will need a Twitter developer account and an OAuth file.

Categories
BeautifulSoup Python

Dell ecommerce web site scraping analysis

Once upon a time, I needed to find Dell monitor data to analyse.

A quick search brought me to their eCommerce web site which had all the monitor data I needed and all I had to do was get the data out of the website.

To get the data from the website I used the Python and Python module Scrapy to scrape the webpage and write data to a csv file.

Based on the data I got from the site the counts of monitors by size and country are presented below.


 

However this data is probably not accurate. In fact I know it isn’t. There was a surprising number of variances in the monitor descriptions including screen size which made it hard to get quick accurate counts. I had to do some data munging to clean up the data but there is still a bit more to do.

The surprising thing is that there do not appear to be specific data points for each of the monitor descriptions components. This website is being generated from a data source likely a database that contains Dell’s products. This database does not appear to have fields for each independent data point that are used to categorize and describe Dell monitors.

The reason I say this is that the monitor descriptions single string of text. Within the text string are things like the monitor size, model, common name, and various other features.

These are not in same order, do not all have same spelling, format such as use of text separators, lower or upper case.

Most descriptions are formatted like this example:

Dell UltraSharp 24 InfinityEdge Monitor – U2417H”.

However the many variations on this format at listed below. There is obviously no standardization for Dell to enter monitor descriptions for their ecommerce site.

  • Monitor Dell S2240T serie S 21.5″
  • Dell P2214H – Monitor LED – 22-pulgadas – 1920 x 1080 – 250 cd/m2 – 1000:1 – 8 ms – DVI-D
  • Dell 22 Monitor | P2213 56cm(22′) Black No Stand
  • Monitor Dell UltraSharp de 25″ | Monitor UP2516D
  • Dell Ultrasharp 25 Monitor – UP2516D with PremierColor
  • Dell 22 Monitor – S2216M
  • Monitor Dell UltraSharp 24: U2415
  • Dell S2340M 23 Inch LED monitor – Widescreen 60Hz Full HD Monitor

Some descriptions include the monitor size unit of measurement, usually in inches, sometimes in centimeters, and sometimes none at all.

Sometimes hyphens are used to separate description sections but other times the pipe character ( | ) is used to separate content. Its a real mish mash.

Description do not have consistent order of description components. Sometimes part number is after monitor size, sometimes it is elsewhere.

The problem with this is that customers browsing the site will have to work harder to compare monitors taking into account these variances.

I’d bet this leads to lost sales or poorly chosen sales that result in refunds or disappointed customers.

I’d also bet that Dell enterprise customers and resellers also have a hard time parsing these monitor descriptions too.

This did affect my ability to easily get the data to do analysis of monitors by description categories because they were not in predictable locations and were presented in many different formats.

Another unusual finding was that it looks like Dell has designated default set of 7 monitors to a large number of two digit country codes. For example Bhutan (bt) and Bolivia (rb) both have the same 7 records, as do many others. Take look at the count of records per country at bottom of page. Many countries have only 7 monitors.

Here is the step by step process used to scrape this data.

The screenshot below shows the ecommerce web site page structure. The monitor information is presented on the page in a set of nested HTML tags which contain the monitor data.

dell ecommerce screenshot

These nested HTML tags can be scraped relatively easily. A quick review revealed that the web pages contained identifiable HTML tags that held the data I needed. Those tags are named in Python code below.

The website’s url also had consistent structure so I could automate navigating through paged results as well as navigate through multiple countries to get monitor data for more than one Dell country in the same sessions.

Below is an example of the url for the Dell Canada eCommerce web site’s page 1:

http://accessories.dell.com/sna/category.aspx?c=ca&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=1

The only two variables in url that change for the crawling purposes are:

  • The “c” variable was a 2 character country code eg “ca” = Canada, “sg” = Singapore, “my” = Malaysia, etc.
  • The “p” variable was a number representing the count of web pages that a country’s monitors are shown on about 10 monitors per page. No country I looked at had more than 5 pages of monitors.

Dell is a multi-national corporation so likely has many countries in this eCommerce database.

Rather than guess what they are I got a list of two character country codes from Wikipedia that I could use to create urls to see if that country has data. As a bonus the Wikipedia list gives me the country name.

The Wikipedia country code list needs a bit of clean-up. Some entries are clearly not countries but some type of administrative designation. Some countries are listed twice with two country codes. For example Argentina has “ar” and “ra”. For practical purposes if the Dell url can’t be created from this country codes in this list then the code just skips to next one country code.

The Python code I used is shown below. It outputs a csv file with the website data for each country with the following columns:

  • date (of scraping)
  • country_code (country code entered from Wikipedia)
  • country (country name from Wikipedia)
  • page (page number of website results)
  • desc (HTML tag containing string of text)
  • prod_name (parsed from desc)
  • size (parsed from desc)
  • model (parsed from desc)
  • delivery (HTML tag containing just this string)
  • price (HTML tag containing just this string)
  • url (url generated from country code and page)

The code loops through the list of countries that I got from Wikipedia and within each country it also loops through the pages of results while pagenum < 6:.

I hard coded the number of page loops to 6 as no country had more than 5 pages of results. I could have used other methods perhaps looping until url returned 404 or page not found. It was easier to hard code based on manual observation.

Dell eCommerce website scraping Python code

#-*- coding: utf-8 -*-
import urllib2
import urllib
from cookielib import CookieJar

from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

countries={
    'AC':'Ascension Island',
    'AD':'Andorra',
    'AE':'United Arab Emirates',
     ... etc
    'ZM':'Zambia',
    'ZR':'Zaire',
    'ZW':'Zimbabwe'
}

def main():

    output = list()
    todaydate = datetime.today().strftime('%Y-%m-%d')
    
    with open('dell_monitors.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['date', 'country_code', 'country', 'page', 'desc', 'prod_name', 'size', 'model', 'delivery', 'price', 'url'], delimiter = ',')
        writer.writeheader()
        
        for key in sorted(countries):
            country_code = key.lower()
            country = countries[key]
            pagenum = 1      
            while pagenum < 6:
                url = "http://accessories.dell.com/sna/category.aspx?c="+country_code+"&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=" + str(pagenum)
                #HTTPCookieProcessor allows cookies to be accepted and avoid timeout waiting for prompt
                page = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url).read()
                soup = BeautifulSoup(page)           
                if soup.find("div", {"class":"rgParentH"}):
                    tablediv = soup.find("div", {"class":"rgParentH"})
                    tables = tablediv.find_all('table')
                    data_table = tables[0] # outermost table parent =0 or no parent
                    rows = data_table.find_all("tr")
                    
                    for row in rows:
                        rgDescription = row.find("div", {"class":"rgDescription"})
                        rgMiscInfo = row.find("div", {"class":"rgMiscInfo"})
                        pricing_retail_nodiscount_price = row.find("span", {"class":"pricing_retail_nodiscount_price"})

                        if rgMiscInfo: 
                            delivery = rgMiscInfo.get_text().encode('utf-8')
                        else:
                            delivery = ''
                            
                        if pricing_retail_nodiscount_price:
                            price = pricing_retail_nodiscount_price.get_text().encode('utf-8').replace(',','')
                        else:
                            price = ''
                            
                        if rgDescription:
                            desc = rgDescription.get_text().encode('utf-8')
                            prod_name = desc.split("-")[0].strip()
                            try:
                                size1 = [int(s) for s in prod_name.split() if s.isdigit()]
                                size = str(size1[0])
                            except:
                                size = 'unknown'
                            try:
                                model = desc.split("-")[1].strip()
                            except:
                                model = desc
                                
                            results = str(todaydate)+","+country_code+","+country+","+str(pagenum)+","+desc+","+prod_name+","+size+","+model+","+delivery+","+price+","+url
                            
                            file.write(results + '\n')
                    
                    pagenum +=1
                else:
                    #skip to next country
                    pagenum = 6 
                    continue

                
if __name__ == '__main__':
    main()


The Python code scraping output is attached here as a csv file.

The summary is a list of the scraping output that shows a list of country codes, countries and count of Dell monitor records scraped from a web page using the country code Wikipedia had for these countries.

af – Afghanistan – 7 records
ax – Aland – 7 records
as – American Samoa – 7 records
ad – Andorra – 7 records
aq – Antarctica – 7 records
ar – Argentina – 12 records
ra – Argentina – 7 records
ac – Ascension Island – 7 records
au – Australia – 36 records
at – Austria – 6 records
bd – Bangladesh – 7 records
be – Belgium – 6 records
bx – Benelux Trademarks and Design Offices – 7 records
dy – Benin – 7 records
bt – Bhutan – 7 records
rb – Bolivia – 7 records
bv – Bouvet Island – 7 records
br – Brazil – 37 records
io – British Indian Ocean Territory – 7 records
bn – Brunei Darussalam – 7 records
bu – Burma – 7 records
kh – Cambodia – 7 records
ca – Canada – 46 records
ic – Canary Islands – 7 records
ct – Canton and Enderbury Islands – 7 records
cl – Chile – 44 records
cn – China – 46 records
rc – China – 7 records
cx – Christmas Island – 7 records
cp – Clipperton Island – 7 records
cc – Cocos (Keeling) Islands – 7 records
co – Colombia – 44 records
ck – Cook Islands – 7 records
cu – Cuba – 7 records
cw – Curacao – 7 records
cz – Czech Republic – 6 records
dk – Denmark – 23 records
dg – Diego Garcia – 7 records
nq – Dronning Maud Land – 7 records
tp – East Timor – 7 records
er – Eritrea – 7 records
ew – Estonia – 7 records
fk – Falkland Islands (Malvinas) – 7 records
fj – Fiji – 7 records
sf – Finland – 7 records
fi – Finland – 5 records
fr – France – 17 records
fx – Korea – 7 records
dd – German Democratic Republic – 7 records
de – Germany – 17 records
gi – Gibraltar – 7 records
gr – Greece – 5 records
gl – Greenland – 7 records
wg – Grenada – 7 records
gu – Guam – 7 records
gw – Guinea-Bissau – 7 records
rh – Haiti – 7 records
hm – Heard Island and McDonald Islands – 7 records
va – Holy See – 7 records
hk – Hong Kong – 47 records
in – India – 10 records
ri – Indonesia – 7 records
ir – Iran – 7 records
ie – Ireland – 7 records
im – Isle of Man – 7 records
it – Italy – 1 records
ja – Jamaica – 7 records
jp – Japan – 49 records
je – Jersey – 7 records
jt – Johnston Island – 7 records
ki – Kiribati – 7 records
kr – Korea – 34 records
kp – Korea – 7 records
rl – Lebanon – 7 records
lf – Libya Fezzan – 7 records
li – Liechtenstein – 7 records
fl – Liechtenstein – 7 records
mo – Macao – 7 records
rm – Madagascar – 7 records
my – Malaysia – 25 records
mv – Maldives – 7 records
mh – Marshall Islands – 7 records
mx – Mexico – 44 records
fm – Micronesia – 7 records
mi – Midway Islands – 7 records
mc – Monaco – 7 records
mn – Mongolia – 7 records
mm – Myanmar – 7 records
nr – Nauru – 7 records
np – Nepal – 7 records
nl – Netherlands – 8 records
nt – Neutral Zone – 7 records
nh – New Hebrides – 7 records
nz – New Zealand – 37 records
rn – Niger – 7 records
nu – Niue – 7 records
nf – Norfolk Island – 7 records
mp – Northern Mariana Islands – 7 records
no – Norway – 19 records
pc – Pacific Islands – 7 records
pw – Palau – 6 records
ps – Palestine – 7 records
pg – Papua New Guinea – 7 records
pe – Peru – 43 records
rp – Philippines – 7 records
pi – Philippines – 7 records
pn – Pitcairn – 7 records
pl – Poland – 4 records
pt – Portugal – 7 records
bl – Saint Barthelemy – 7 records
sh – Saint Helena – 7 records
wl – Saint Lucia – 7 records
mf – Saint Martin (French part) – 7 records
pm – Saint Pierre and Miquelon – 7 records
wv – Saint Vincent – 7 records
ws – Samoa – 7 records
sm – San Marino – 7 records
st – Sao Tome and Principe – 7 records
sg – Singapore – 37 records
sk – Slovakia – 23 records
sb – Solomon Islands – 7 records
gs – South Georgia and the South Sandwich Islands – 7 records
ss – South Sudan – 7 records
es – Spain – 10 records
lk – Sri Lanka – 7 records
sd – Sudan – 7 records
sj – Svalbard and Jan Mayen – 7 records
se – Sweden – 6 records
ch – Switzerland – 21 records
sy – Syrian Arab Republic – 7 records
tw – Taiwan – 43 records
th – Thailand – 40 records
tl – Timor-Leste – 7 records
tk – Tokelau – 7 records
to – Tonga – 7 records
ta – Tristan da Cunha – 7 records
tv – Tuvalu – 7 records
uk – United Kingdom – 35 records
un – United Nations – 7 records
us – United States of America – 7 records
hv – Upper Volta – 7 records
su – USSR – 7 records
vu – Vanuatu – 7 records
yv – Venezuela – 7 records
vd – Viet-Nam – 7 records
wk – Wake Island – 7 records
wf – Wallis and Futuna – 7 records
eh – Western Sahara – 7 records
yd – Yemen – 7 records
zr – Zaire – 7 records

Grand Total – 1760 records

Categories
Django Python

Always remember to start Python Virtualenv

When learning to develop Django and Flask Python applications using virtual environment I did not know that I should activate the virtual environment before running the application. Well, more precisely, I did know the virtual env should be activated but rather I assumed it would be activated somehow auto-magically.

I was using MS Visual Studio 2015 RC and its new Python Tools which work fine by the way. One can install new Python/Flask/Django/etc modules via the IDE.

This was also the first time I was using Python virtual environment. I hadn’t previously done any Python web development. I had used Python for analytical purposes in files, command line and iPython but all of these used the ‘native’ operating system Python installation.

In MS VS 2015 RC, after Python Tools are installed, there are templates to create new Django or Flask projects, and one of the steps asks if I want to install virtual env or not.

So after a bit of search and read I realized virtual env is the way to go. Its better for many reasons to have a standalone Python environment for each new Python/Flask/Django project.

I just assumed that since I had created my new Python application with a virtual env that when I opened Visual Studio and started working on it, it would be in virtual environment by default somehow auto-magically.

But no, the virtual environment has to be activated manually each time the project is opened or before being able to interact with the project via web browser. So remember to activate your virtual env before running your Python/Flash/Django application.

What mislead me was that running the application without first activating the virtual environment can often be ok because the native operating system Python installation has the required modules, so application runs just fine.

But I ran into problems when after installing new Python modules only to see the application complaining that they weren’t available eg got error message in browser and command line from the server saying ‘no module named xxx’. This was confusing because I was thinking hey I just installed that.

So remember activating the virtual env before running the Python application is required. Now it is second nature to do this.

To activate the virtual env in Windows simply navigate to the ‘Scripts’ folder in your virtual env folder and run the ‘activate.bat’ file.

In command line enter ‘activate’ in that folder. Or you can enter the full path to the activate.bat from anywhere in command line.

You can easily see if virtual env has been started because when it is you will see ‘(env)’ at the start of the command line.

Then you can go back to your application folder with the ‘runserver.py’ (or whatever you call it) and then start the application!

Categories
Flask Python SQLAlchemy

Flask SQLAlchemy relationship associations and backref

Flask-SQLAlchemy relationship associations are useful to easily retrieve related records from a related Model.

In a Python Flask application I wanted to use a many-to-many relationship between Photos and Tags (eg many categories assigned to one or more photos).

To do this I had the following things set up in my application:

  • 3 database tables
    • Photos
    • Tags
    • Photo_Tags
  • 3 model.py sections
    • Photos
    • Tags
    • photo_tags
  • 1 views.py section ‘photo’
  • 1 template ‘photo.html’

The models looked like this:


class Photos(db.Model):
….id = db.Column(db.Integer, primary_key=True)
….filename = db.Column(db.String(100))
….tags = db.relationship(‘Tags’,  secondary=photo_tags, backref=’photos’)

class Tags(db.Model):
…. id = db.Column(db.Integer, primary_key=True)
…. tagname = db.Column(db.String(100))

photo_tags = db.Table(‘photo_tags’,
…. db.Column(‘tag_id’, db.Integer, db.ForeignKey(‘tags.id’)),
…. db.Column(‘photo_id’, db.Integer, db.ForeignKey(‘photos.id’))
)

The bolded text above shows how the ‘tags’ relationship between the Photos and the Tags models that is added to the Photos model. SQLAlchemy wants us to add this ‘tags’ association to one of the models that are in it (only one more on that below).

The ‘tags’ relationship is physically recorded in the database table ‘photo_tags’ and it doesn’t get a ‘class Model’ like the Photos and Tags have. Instead, it is setup as shown above to indicate that it is a table with foreign keys to the Photos and Tags tables.

Key things to note about the ‘tags’ reference in the Photos model:

  • ‘Tags’ model is related model
  • ‘photo_tags’ is referenced as the ‘secondary’ or relationship table.
  • ‘photos’ is referenced as ‘backref’. The significance of this is that you don’t have to add a ‘tags’ to the Tags model. The ‘backref’ will allow you to reversibly reference Photos and Tags eg you can use the ‘tags’ to get
    • Tags related to a specific Photo
    • Photos related to a specific Tag

An example of a view that can use the relationship established above follows. This view retrieves all tags for a specific photo:


@app.route(‘/photo/’)
def photo(requested_photo_id=None):
….requested_photo= Photos.query.get(requested_photo_id)
….return render_template( ‘photo.html’, requested_photo = requested_photo
)

Here is what happens in this view:

  • The requested_photo_id is retrieved from the url.
  • The SQLAlchemy query Photos.query.get(requested_photo_id) uses that requested_photo_id to retrieve the photo from the Photos model/MySQL database table, and its related tags.
  • The retrieved requested_photo object is passed to template ‘photo.html’ along with the related tag info.
  • Finally it is in the photo.html template where everything comes together. The template below is simple HTML page that presents the requested_photo_id and its related tags.






Photo Details

Photo ID: {{ requested_photo.id }}

Photo file name: {{ requested_photo.filename }}

{% for tag in requested_photo.tags %}

tag.id: {{ tag.id}}, tag.tagname: {{ phototag.tagname }}

{% endfor %}

The requested_photo id and filename come from the Photos Model/MySQL table.

The photo’s related tags come from the requested_photo ‘tags’ association by referencing it as ‘requested_photo.tags’ (this is the awesome part) and then looping through the tags to retrieve each tag so they can be printed on webpage individually.

Categories
Google Analytics Python

Get list of custom segments from Google Analytics API

This is a post on how to create custom Google Analytics Profile Segments for the purpose of removing referral spam (and there is increasingly soo much of it!) from GA reporting.

However if you want to use these new Custom Segments to filter results using Google Analytics API with a Service Account there are some challenges.

If you are retrieving GA results for many web sites you need to get the GA API to loop through each sites’s View / Profiles in your GA Account to retrieve the data for each.

The challenge is that each Profile has its own Custom Segment. In order to filter out referral spam completely, two types of filters are required. The ‘exclude’ filter which is same for all Profile, and the ‘include’ filter which is specific to each Profile as it refers to the Profile’s domain.

So that makes looping through each Profile a bit more challenging. You need a dictionary of each Profile’s Custom Segment Id so it can be applied for each Profile’s data.

These Custom Segment Id’s look something like “gaid::BXxFLXZfSAeXbm4RZuFd9w”

The Custom Segment Id needs to be used in the service.data.ga().get().execute() criteria.


data = service.data().ga().get(
ids=ids,
start_date=”2015-07-01″,
end_date=”2015-07-19″,
segment: “gaid::BXxFLXZfSAeXbm4RZuFd9w”,
metrics=metrics
).execute()

It wasn’t easy to find these Custom Segment Id’s. First I tried looping through the segments() as follows:


    # Authenticate and construct service.
    service = get_service(‘analytics’, ‘v3’, scope, key_file_location,
    service_account_email)
    
    segments = service.management().segments().list().execute()
    
    for segment in segments.get(‘items’, []):
      print ‘Segment ID ‘ + segment.get(‘id’) + ” – ” + segment.get(‘name’)

But that only retrieved the Standard Google Segments, but not the Custom Segments and apparently this is not possible with a Service Account.

So I found that you are able to see the Custom Segment Ids in the https://ga-dev-tools.appspot.com/qery-explorer.

But while you can see the Custom Segments here it wasn’t very helpful as you have to go one by one in the Segments criteria field. If you have many sites it will be time consuming.

Then I finally found the “stand alone explorer” at the bottom of the GA API Segments documentation page.

https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/segments/list#try-it

This outputs a json file containing all of the Segment details. Unfortunately this isn’t useful as a ready dictionary as it only has the segment details, not the account id. But it does have the Custom Segment Ids which can be used to create manual dictionary of Account Id and Segment Id that can be used in the loop.

Perhaps it might also be possible to do a reverse lookup and find the Custom Segment Id by looping through the Segments and finding those with the name.

Hope that helps someone!

Categories
Power BI Power Query Python Twitter API

BC Hydro’s amazing #BCStorm social media turnaround

BC Hydro made an amazing social media turnaround to communicate with customers in crisis! Go BC Hydro!

On August 29, 2015 high winds caused tree falls that took out BC Hydro power lines throughout BC’s Lower Mainland, Vancouver Island, Sunshine Coast leaving up to 500,000 customers without electricity. Many including me were without power for 48 hours.

BC Hydro’s web site was their primary channel for communicating with customers about extent of damage and expected time for repairs, but the site also went down during this time. H

They had used Twitter in the past to communicate with customers but they weren’t using it much on first day of crisis.

However on the second day of the crisis with 100,000+ customers still without power BC Hydro dramatically increased Twitter communication by responding directly to customer tweets about extent of damage and expected time for repairs.

To visualize this dramatic increase in BC Hydro Twitter usage I downloaded all @BCHydro tweets for August 29 and 30 from Twitter’s API using Python Tweepy and used Microsoft Power BI to create a visualization of BC Hydro tweet counts which is shown below.

Some notes on chart:

  • x axis shows date and hour of day
  • y-axis shows count of tweets
  • ‘response’ tweets are light green part of bar
  • ‘status’  tweets are dark green part of bar

You can see that on August 30 at 11 AM, about 28 hours after the storm hit, BC Hydro suddenly starts tweeting responses to customers’ questions. They continued for the next few days until their website was back up in addition to their regular ‘status’ tweets.

The chart clearly shows a very amazing social media turnaround! BC Hydro dramatically increased their use of Twitter to get customers answers and information. Go BC Hydro!

Note: the Twitter data was last updated Aug 31, 2015 at 16.45 PM PST.

bchtweets

 

Historically BC Hydro did not use Twitter this way.  The chart below shows BC Hydro’s tweeting before the storm. They were tweeting once or twice per day with occasional spikes to 7 tweets per day.

 

bchtweets_before

 

The ‘response‘ category includes tweets by BC Hydro responding to a customer tweet. Example ‘response’ tweet:

2015-08-30 14:26:01, @Adamhillz Crews will be on site as soon as possible. Stay back from any downed lines and call 911 if it is on fire.

The ‘status‘ category includes tweets by BC Hydro about ongoing status of repair work etc. Example ‘status’ tweet:

2015-08-30 14:28:32, Crews have been brought in from across province, including Prince George, Terrace, Kamloops, Smithers, Vernon & Vancouver Island #bcstorm

FYI if anyone is interested, here is the text file with tweet text – bchtweets

Here is the Python code used to get the BC Hydro Tweets using the Twitter API.

import tweepy
#import MySQLdb
from datetime import datetime
import random
import sys, os
import json
from dateutil import tz
from tweet_auth import * #this is another .py file with the twitter api credentials

#get twitter auth
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

def main():
with open('bchtweets.csv', 'wb') as file:
for status in tweepy.Cursor(api.user_timeline, id='bchydro').items(1800):
file.write(str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)) + ', ' + status.text.encode('utf8') + '\n')

if __name__ == '__main__':
main()

Note that there is a separate file with my Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

#twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'