BC Executive Compensation Tableau Visualization

The Province of British Columbia publishes public agency executive compensation each year. The data is grouped by:

  • Sector – Employers are categorized in one of six sectors – Colleges, Crown Corporations, Education, Health, Public Service, Universities
  • Employer – Employers belong to one Sector. There are about 190 employers such as BC Assessment Authority, BC Housing Management Commission, BC Hydro, etc
  • Employee – An employee belongs to one Employer and they publish the full name of employee.

The employee’s salary is given as:

  • Base Salary
  • Total Compensation

The data is provided each year as pdf and csv data files. I downloaded csv data files for each year from 2012 to 2016 and combined them into one dataset.

Here is link to Tableau Public Dashboard visualizing the data and the chart is embedded below.

 

I used Python to combine the data. Some of the Sector values where missing spaces, one of the years had a “$” symbol in salary column values which had to be removed, and I added a new year column based on the data file’s name.

Here is the Python code used to combine and clean up the data:

import csv
    import pandas as pd
    from datetime import date, datetime
    import glob
    import os

    def main():

        ## path for data files
        data_path = 'C:/bc_salary_data/'

        allFiles = glob.glob(data_path + "*.csv")
        frame = pd.DataFrame()
        list_ = []
        
        for file_ in allFiles:
            df = pd.read_csv(file_,index_col=None, header=0)
            df['filename'] = os.path.basename(file_).split('.')[0]
            df['Sector'] = df['Sector'].str.replace(' ','').str.replace('CrownCorporations', 'Crown Corporations').str.replace('CrownCorporation', 'Crown Corporations').str.replace('PublicService', 'Public Service')
            df['year'] =  df['filename'].str[-4:]
            df['Base Salary'] = df['Base Salary'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            df['Total Compensation'] = df['Total Compensation'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            list_.append(df)
            
        frame = pd.concat(list_)

        ## Create csv file to write results to
        frame.to_csv('merged.csv')

        print str(datetime.now()) + ' data files merged'
        
        return

    if __name__ == '__main__':
        main()

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 

Canadian Canola seed crushing more efficient at extracting canola oil

Statistics Canada regularly tweet links to various Canadian statistics. I have occasionally created quick Tableau visualizations of the data and replied with a link to my Tableau Public site.

The idea is to encourage Statistics Canada to start communicating more visually and create their own visualizations for all data. Its extra work but value will be realized when Statistics Canada visitors can more easily understand the data instantly by looking at visualizations instead of munging about with boring data in tables.

This particular tweet was to Statistics Canada data reporting on Canadian canola seed crush, canola oil and meal output. http://www5.statcan.gc.ca/cansim/a47

The data shows Canada’s canola seed production and efficiency at extracting canola oil has increased significantly since 1971.

Canola seed is crushed to extract the oil and the seed meal is leftover. The ratio of oil to meal is about .8 in 2016 compared to .3 in 1971. That is a impressive increase in oil extraction efficiency.

Canadian TCS FDI Officers Twitter list member analysis

Blog post updated to add the following:

  • Updated Python “TCS Members Details” code to get additional information from the List members’ profiles.
  • New Python code “TCS Members Tweets” to download all List members’ Tweets. Will provide some more analysis on this new data soon.

————————————————–

The Canadian Trade Commissioner Service maintains a Twitter List named CDN TCS FDI Officers that has a bunch of Canadian Trade Commissioners as members. Investment Canada also has a similar list.

I was interested to learn how many people were on these lists, how long they had been on the lists, what they were Tweeting about.

So I used the Twitter API and Python Tweepy to retrieve data about list members including:

  • screen_name
  • name
  • followers_count
  • friends_count
  • statuses_count (# of tweets)
  • favourites_count
  • created_at (Twitter account create date)
  • account_age_days (calculated as # days from July 9, 2016)
  • listed_count
  • verified
  • profile_image_url
  • profile_sidebar_fill_color
  • profile_text_color
  • profile_image_url_https
  • profile_use_background_image
  • default_profile_image
  • profile_sidebar_border_color
  • profile_background_color
  • profile_link_color

You can get find complete definitions about these over at Twitter.

The data was output as a simple csv file that I used to create a Tableau Public dashboard which is embedded below. The csv file is attached here.

The Tableau dashboard is interactive and can be sorted by any of the columns by using the sort icon which appears when you hover over the top of column as illustrated in screenshot below.

tcs_click_to_sort

 

 

 

 

It would be interesting to try to determine if this Twitter activity has measurable impacts on FDI and Canadian Trade.  For example perhaps foreign investment finds it way to Canada after reading Tweet by one of our Trade Commissioners.

This would require that TCS maintains a CRM (client relationship manager) system and process that records lead sources.

There is some disparity between use of Twitter by the CDN TCS FDI Officers list members as shown by Tweets/Day which total Tweets divided by # days since the account was created. If there is a measurable lift in lead generation by Twitter use then this would be actionable metric.

For the technically minded the Python code is shown below. Note that you need an API account to use with this code.

There is another file tweet_auth import not shown here that contains Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

Here is the “TCS Member Details” code:

 

 #twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
  import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, 'path/to/folder/with/your/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
     #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    a
    pi = tweepy.API(auth)

    today = datetime.now().date()
            
    tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')

    member_details_csv = csv.writer(open('tcs_member_details.csv', 'wb'))
    member_details_csv.writerow([
        'screen_name',
        'name',
        'followers_count',
        'friends_count',
        'statuses_count',
        'favourites_count',
        'created_at',
        'account_age_days',
        'time_zone',
        'listed_count',
        'profile_image_url',
        'profile_sidebar_fill_color',
        'profile_text_color',
        'profile_image_url_https',
        'profile_use_background_image',
        'default_profile_image',
        'verified',
        'profile_sidebar_border_color',
        'profile_background_color',
        'profile_link_color'])

    members = []
    member_tweets = []
        
    for member in tcs_list_members.items():
        
        member_details_csv.writerow([
            member.screen_name.encode('utf-8', 'ignore'),
            member.name.encode('utf-8', 'ignore'),
            member.followers_count,
            member.friends_count,
            member.statuses_count,
            member.favourites_count,
            member.created_at,
            (today-member.created_at.date()).days,
            member.time_zone,
            member.listed_count,
            member.profile_image_url,
            member.profile_sidebar_fill_color,
            member.profile_text_color,
            member.profile_image_url_https,
            member.profile_use_background_image,
            member.default_profile_image,
            member.verified,
            member.profile_sidebar_border_color,
            member.profile_background_color,
            member.profile_link_color
            ])
            

It would be interesting to see what Tweet topics, other Twitter user mentions, links to webpages, etc. So the next step is to loop through each of the list member’s api.user_timeline to retrieve their Tweet content and do some analysis on them.  For now here is the code and some analysis and visualization in Tableau later.

Here is the “TCS Members’ Tweets” Python code:

    import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, '/path/to/your/folder/with/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading
    #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
    #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    #api = tweepy.API(auth)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    today = datetime.now().date()

    def get_list_member_tweets():
        ## get invest_canada and cdn-tcs-fdi-officers list members
        tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')
      
        ## open csv file 
        member_tweets_csv = csv.writer(open('tcs_member_tweets.csv', 'wb'))
        
        ## write header row column names
        member_tweets_csv.writerow([
            'status_id',
            'date_time_UTC',
            'screen_name',
            'hashtags',
            'user_mentions',
            'tweet_text'
            ])
        
        ## loop through list members and get their tweets
        for member in tcs_list_members.items():
            ## get list member tweets
            member_tweets = get_member_tweets(member.screen_name)
            
            for status in member_tweets:
                
                ## check tweets for hashtags
                if status.entities['hashtags']:
                        hastags=[]
                        for hashtag in status.entities['hashtags']:
                            hastags.append(hashtag['text'].encode('utf8','ignore'))
                
                ## check tweets for user_mentions
                if status.entities['user_mentions']:
                        user_mentions=[]
                        for user_mention in status.entities['user_mentions']:
                            user_mentions.append(user_mention['screen_name'].encode('utf8','ignore'))
                
                ## write to csv file      
                member_tweets_csv.writerow([
                    status.id,
                    str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)),
                    member.screen_name,
                    hastags,
                    user_mentions,
                    status.text.replace('\n',' ').replace('\r',' ').encode('utf8','ignore')
                    ])

    def get_member_tweets(screen_name):
        ## get list member's tweets
        
        alltweets = []

        ## can only get max 200 tweets at a time
        new_tweets = api.user_timeline(screen_name = screen_name, count=200)
        alltweets.extend(new_tweets)

        ## get oldest tweet already retrieved
        oldest = alltweets[-1].id - 1

        ## iteratively get remaining tweets
        while len(new_tweets) > 0:
            new_tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
            alltweets.extend(new_tweets)
            oldest = alltweets[-1].id - 1
        
        ## print out member and # tweets retrieved to show progress
        print screen_name + " %s tweets downloaded" % (len(alltweets))
        
        ## return all tweets
        return alltweets

    if __name__ == '__main__':
        get_list_member_tweets()

Full code on Github:

https://github.com/sitrucp/tcs_fdi_list_twitter_data

Tableau vizualization of Toronto Dine Safe data

The City of Toronto’s open data site includes the results of the city’s regular food and restaurant inspections. This data was as of August 2014.

The interactive chart below allows filtering by review criteria and can be zoomed into to view more detail and specific locations.

 

The data file for Dine Safe contained about 73,000 rows and was in XML format. In order to work with it I transformed it to csv format.

from xml.dom.minidom import parse
from csv import DictWriter
 
fields = [
	'row_id',
	'establishment_id',
	'inspection_id',
	'establishment_name',
	'establishmenttype',
	'establishment_address',
	'establishment_status',
	'minimum_inspections_peryear',
	'infraction_details',
	'inspection_date',
	'severity',
	'action',
	'court_outcome',
	'amount_fined'
]
 
doc = parse(file('dinesafe.xml'))
 
writer = DictWriter(file('dinesafe.csv', 'w'), fields)
writer.writeheader()
 
row_data = doc.getElementsByTagName('ROWDATA')[0]
 
for row in row_data.getElementsByTagName('ROW'):
	row_values = dict()
	for field in fields:
		text_element = row.getElementsByTagName(field.upper())[0].firstChild
		value = ''
		if text_element:
			value = text_element.wholeText.strip()
			row_values[field] = value
	writer.writerow(row_values)

This data was not geocoded so I had to do that before it could be mapped. I wanted to use a free geocoding service but they have limits on the number of records that could be geooded per day. I used MapQuest’s geocoding API using a Python library that would automate the geocoding in daily batched job so that I could maximize free daily geocoding.

The Dine Safe data file addresses needed some cleaning up so that the geocoding service could read them. For example street name variations needed to be conformed to something that MapQuest would accept. This was a manual batch find and replace effort. If I was automating this I would use a lookup table of street name variations and replace them with accepted spellling/format.

#Uses MapQuest's Nominatim mirror.

import anydbm
import urllib2
import csv
import json
import time

# set up the cache. 'c' means create if necessary
cache = anydbm.open('geocode_cache', 'c')

# Use MapQuest's open Nominatim server.
# https://developer.mapquest.com/web/products/open/nominatim
API_ENDPOINT = 'http://open.mapquestapi.com/nominatim/v1/search.php?format=json&q={}'

def geocode_location(location):
    '''
    Fetch the geodata associated with the given address and return
    the entire response object (loaded from json).
    '''
    if location not in cache:
        # construct the URL
        url = API_ENDPOINT.format(urllib2.quote(location))
        
        # load the content at the URL
        print 'fetching %s' % url
        result_json = urllib2.urlopen(url).read()
        
        # put the content into the cache
        cache[location] = result_json
        
        # pause to throttle requests
        time.sleep(1)
    
    # the response is (now) in the cache, so load it
    return json.loads(cache[location])

if __name__ == '__main__':
    # open the input and output file objects
    with open('dinesafe.csv') as infile, open('dinesafe_geocoded.csv', 'w') as outfile:
      
        # wrap the files with CSV reader objects.
        # the output file has two additional fields, lat and lon
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, reader.fieldnames + ['lat', 'lon'])
        
        # write the header row to the output file
        writer.writeheader()
        
        # iterate over the file by record 
        for record in reader:
            # construct the full address
            address = record['establishment_address']
            address += ', Toronto, ON, Canada'
            
            # log the address to the console
            print address
            
            try:
                # Nominatim returns a list of matches; take the first
                geo_data = geocode_location(address)[0]
                record['lat'] = geo_data['lat']
                record['lon'] = geo_data['lon']
            except IndexError:
                # if there are no matches, don't raise an error
                pass
            writer.writerow(record)

After the Dine Safe data was geocoded so that it had two new columns, one for latitude and another for longitude, all that was left to do was bring the data into Tableau and create the Tableau map visualization which is shown below.

Tableau vizualization of Vancouver Car2Go vehicle locations

I happened to have an dataset of Car2Go vehicle locations in Vancouver and used Tableau to plot them by latitude and longitude.

 

 

There are a lot of observations over the time data was collected. There are limited physical number of parking spots so cars are parked in locations that are very close to other car locations. This makes for a very dense plot.

But if you zoom into the view the detail will resolve. For example below is a screenshot of a zoomed into downtown Vancouver. You can very clearly see that cars are parked along city streets. Fixed parking spots are identified by darker color.

 

car2go vehicle locations