How to use Python as a datasource for Power BI

Step 1 – Download latest version of Power BI to get new features (version that came out in early August 2018)

Step  2 – Enable Python in the “Options – Preview Features” check the “Python support” option. Close and restart Power BI.

Step 3 – Click “Get Data”, select “Other” and “Python script” and click “Connect” button.

Step 4 – Write or paste Python script in the popup Python script window and click Ok button.

I chose to use Pandas to get csv file:

import numpy as np
import pandas as pd
data = pd.read_csv(‘C:/mydatafile.csv’, delimiter=’,’)

Note if you are using Windows as I am, you have to write file path so Windows can recognize it otherwise you will get an error. I used foreslashes but you could also escape the backslashes.

You could also write any Python script that gets or generates data from any data source eg a database, from AWS S3 buckets, etc. Of course this also includes transforming data. The important thing to note is your script must output a named dataset(s). The named dataset(s) will be thing Power BI will use when you are finished coding.

Step 5 – After clicking ok, you get the familiar Power Query data source selection window. Select Python and you will see your named data set(s). Just check the one(s) you want and click ok.

Now just continue working with Power BI as normal! You can always go back to edit the Python script by select “Source” in query editor.

BC Executive Compensation Tableau Visualization

The Province of British Columbia publishes public agency executive compensation each year. The data is grouped by:

  • Sector – Employers are categorized in one of six sectors – Colleges, Crown Corporations, Education, Health, Public Service, Universities
  • Employer – Employers belong to one Sector. There are about 190 employers such as BC Assessment Authority, BC Housing Management Commission, BC Hydro, etc
  • Employee – An employee belongs to one Employer and they publish the full name of employee.

The employee’s salary is given as:

  • Base Salary
  • Total Compensation

The data is provided each year as pdf and csv data files. I downloaded csv data files for each year from 2012 to 2016 and combined them into one dataset.

Here is link to Tableau Public Dashboard visualizing the data and the chart is embedded below.

 

I used Python to combine the data. Some of the Sector values where missing spaces, one of the years had a “$” symbol in salary column values which had to be removed, and I added a new year column based on the data file’s name.

Here is the Python code used to combine and clean up the data:

import csv
    import pandas as pd
    from datetime import date, datetime
    import glob
    import os

    def main():

        ## path for data files
        data_path = 'C:/bc_salary_data/'

        allFiles = glob.glob(data_path + "*.csv")
        frame = pd.DataFrame()
        list_ = []
        
        for file_ in allFiles:
            df = pd.read_csv(file_,index_col=None, header=0)
            df['filename'] = os.path.basename(file_).split('.')[0]
            df['Sector'] = df['Sector'].str.replace(' ','').str.replace('CrownCorporations', 'Crown Corporations').str.replace('CrownCorporation', 'Crown Corporations').str.replace('PublicService', 'Public Service')
            df['year'] =  df['filename'].str[-4:]
            df['Base Salary'] = df['Base Salary'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            df['Total Compensation'] = df['Total Compensation'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            list_.append(df)
            
        frame = pd.concat(list_)

        ## Create csv file to write results to
        frame.to_csv('merged.csv')

        print str(datetime.now()) + ' data files merged'
        
        return

    if __name__ == '__main__':
        main()

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 

Twitter Followers and List Membership Tracking

I have created my own historical reporting of my Twitter account followers and list memberships. I have hosted the reporting on this open webpage.

Most people know what followers are but the list memberships aren’t as well known so here are definitions of them.

Twitter Followers – Twitter followers are other Twitter accounts who follow my account. A Twitter Follower may follow my account for a few days and then unfollow me at any time afterwards.

Twitter List Memberships – Twitter accounts can also create their own Lists and then add other Twitter accounts to the lists. For example a Twitter account may create a List called “Tweet about Big Data” to track Twitter accounts they believe “Tweet about Big Data”. Twitter accounts may add my account to their List and remove it at any time afterwards.

The Twitter data retrieval, writing to database, and data querying are all done on a web server.

In order to record changes in counts of these you need to have daily historical data. The Twitter API doesn’t provide historical data so you need create it yourself by retrieving and saving daily data somewhere.

Three Python scripts using Twitter API, Python Tweepy and AWS SDK are scheduled to run daily using cron jobs.

Two scripts retrieve followers and list memberships and insert the data into a PostgreSQL database. This daily retrieval builds the daily history.

Another script queries the database table to create reporting datasets of new, active and dropped followers and list memberships that are exported as csv files to a AWS S3 folder which also has files for a AWS S3 hosted static website.

The AWS S3 hosted static website uses Chart.js and D3.js Javascript charting libraries to read and visualize the data. This post does not describe how to read the csv files but I have written another post that describes this AWS S3 csv file as D3 report data source

Screenshots of visualizations are shown below.

Daily Follower Counts

Weekly New and Dropped Follow Counts

Follower Duration Days

The Python code to retrieve the Twitter data, transform it and create csv files and upload them to AWS is below.

The code to retrieve Twitter followers:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get secrets
sys.path.insert(0, '/home/secrets/')
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_followers():

    twitter_followers = []
    
    for user in tweepy.Cursor(api.followers, screen_name="twitterusername").items():
        twitter_followers.append(user)
        
    for user in twitter_followers:
        query = "INSERT INTO twitter_followers \
                (date, \
                follower_id, \
                name, \
                screen_name) \
                VALUES \
                (%s, %s, %s, %s)";
        data = (today,
                str(user.id).strip(),
                str(user.name.encode('utf8','ignore')).replace(',','').strip(),
                str(user.screen_name.encode('utf8','ignore')).strip()
                )
        cursor.execute(query, data)
        conn.commit()

    conn.close()
    
    ## print job status to log
    print str(datetime.now()) + ' twitter followers'    
            
if __name__ == '__main__':
	get_followers()

The code to retrieve Twitter list memberships:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get database creds
sys.path.insert(0, '/home/secrets/') 
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_list_memberships():

    twitter_list_memberships = []
    
    for list in tweepy.Cursor(api.lists_memberships, screen_name="twitterusername").items():
        twitter_list_memberships.append(list)
        #print list.full_name
        
    for list in twitter_list_memberships:

        query = "INSERT INTO twitter_list_memberships \
                (date, \
                id_str, \
                name, \
                full_name, \
                member_count, \
                subscriber_count, \
                mode, \
                following, \
                user_screen_name) \
                VALUES \
                (%s, %s, %s, %s, %s, %s, %s, %s, %s)";
        data = (today,
                list.id_str.encode('utf8','ignore'),
                list.name.encode('utf8','ignore'),
                list.full_name.encode('utf8','ignore'),
                list.member_count,
                list.subscriber_count,
                list.mode,
                list.following,
                list.user.screen_name.encode('utf8','ignore'))
        cursor.execute(query, data)
        conn.commit()
        
    conn.close()
    
    ## print status for log
    print str(datetime.now()) + ' twitter list memberships'    
            
if __name__ == '__main__':
	get_list_memberships()

The code to create csv files and upload to AWS S3 bucket:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
from datetime import datetime
from os import listdir
from os.path import isfile, join
import sys
import csv
import psycopg2
import re
from collections import Counter

upload_path = '/home/twitter/aws_s3_site/'

sys.path.insert(0, '/home/secret/')
from aws_keys import aws_keys
from secrets import secrets

## create aws S3 connection
conn = S3Connection(aws_keys['AWS_KEY'], aws_keys['AWS_SECRET'])
bucket = conn.get_bucket('twitter-bucket', validate=False)

## create db connection
conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cur = conn.cursor()

## get data sets from db

## followers
cur.execute ("select something here;")
followers = list(cur.fetchall())

## lists
cur.execute ("select something here;")
lists = list(cur.fetchall())

conn.close()

## write db data to csv files, save in upload folder

## followers
with open(upload_path + 'followers.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in followers:
        writer.writerow(row)

## lists
with open(upload_path + 'lists.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in lists:
        writer.writerow(row)
            	
## upload csv files to S3 twitter-bucket
    
upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

# delete existing bucket keys to reset
for key in bucket.list():
    if '.csv' in key:
        bucket.delete_key(key)
  
# complete uploading to AWS
for file in upload_files:
    k = Key(bucket)
    k.key = file
    k.set_contents_from_filename(upload_path + file)
    
## write success to cron log
print str(datetime.now()) + ' aws s3 twitter upload'

Django form geocode submitted address to get lat, lon and postal code

One of my Django applications includes a form where user can enter and submit a property address.

django form

The user submitting the form might not know the postal code so I left it optional. However the postal code is a key piece of information for this particular application so I wanted to ensure that I was getting it.

I also wanted to geocode the address immediately to get the address latitude and longitude so it could be shown to user on a Leaflet.js map.

There are lots of free geocoding services for low intensity usage but I ended up using Google Geocoding which is free under certain usage level. You just need to create a Geocoding API project and use the credentials to set up the geocoding.

To interact with the geocoding API I tried Python gecoding modules geopy and geocoder but in the end just used Python Requests module instead as it was less complicated.

When the user clicked the Submit button, behind the scenes, Requests submitted the address to Google’s Geocoding API, gets the JSON response containing the latitude, longitude and postal code which are then written to the application database.

I will update the code in future to check if the user’s postal code is correct and replace it if it is incorrect. Will wait to see how the postal code accuracy looks. Making geocoding API requests too often could bump me over the free usage limit.

The Django View that contains the code described above is shown below.

def property_add(request):
   
    property_list = Property.objects.filter(user_id=request.user.id).order_by('created')
    
    if request.method == 'POST':
        form = PropertyForm(request.POST)
        if form.is_valid():
            new_property = form.save(commit=False)
            address = "%s, %s, %s, %s" % (new_property.address1, new_property.city, new_property.state, new_property.postcode)
            google_geocode_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'
            url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + "'" + address + "'" + '&key=' + google_geocode_key
            
            try:
                response = requests.get(url)
                geoArray = response.json()
                new_property.lat = geoArray['results'][0]['geometry']['location']['lat']
                new_property.lon = geoArray['results'][0]['geometry']['location']['lng']
                new_postcode = geoArray['results'][0]['address_components'][7]['long_name']
                new_fsa = geoArray['results'][0]['address_components'][7]['short_name'][:3]
            except:
                new_property.lat = None
                new_property.lon = None
                new_postcode = None
                new_fsa = None
           
            if new_property.postcode:
                new_property.fsa = new_property.postcode[:3]
            else:
                new_property.postcode = new_postcode
                new_property.fsa = new_fsa
           
            new_property.user_id = request.user.id
            new_property = form.save()
            return HttpResponseRedirect(reverse(property, args=(new_property.pk,)))
    else:
        form = PropertyForm()

    context_dict = {
        'form': form, 
        'property_list': property_list,
    }
        
    return render(
        request,
        'property_form.html',
        context_dict,
        context_instance = RequestContext(
            request,
            {
                'title':'Add Property',
             }
            )
    )    

 

Leaflet.js choropleth map color by count using geoJSON datasource

I have a Django web application that needed an interactive map with shapes corresponding to Canadian postal code FSA areas that were different colors based on how many properties were in each FSA. It ended up looking something like the screenshot below.

map1

This exercise turned out to be relatively easy using the awesome open-source Javascript map library Leaflet.js.

I used this Leaflet.js tutorial as the foundation for my map.

One of the biggest challenges was finding a suitable data source for the FSAs. Chad Skelton (now former) data journalist at the Vancouver Sun wrote a helpful blog post about his experience getting a suitable FSA data source. I ended up using his BC FSA data source for my map.

Statistics Canada hosts a Canada Post FSA boundary files for all of Canada. As Chad Skelton notes these have boundaries that extend out into the ocean among other challenges.

Here is a summary of the steps that I followed to get my choropleth map:

1. Find and download FSA boundary file. See above.

2. Convert FSA boundary file to geoJSON from SHP file using qGIS.

3. Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

4. Create Leaflet.js map in HTML page basically the HTML DIV that holds the map and separate Javascript script that loads Leaflet.js, the FSA geoJSON boundary data and processes it to create the desired map.

Find and download FSA boundary file.

See above.

Convert FSA boundary file to geoJSON from SHP file using qGIS.

Go to http://www.qgis.org/en/site/ and download qGIS. Its free and open source.

Use qGIS to convert the data file from Canada Post or other source to geoJSON format. Lots of blog posts and documentation about how to use qGIS for this just a Google search away.

My geoJSON data source looked like this:


var bcData = {
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::4269" } },
    "features": [
    { "type": "Feature", "properties": { "CFSAUID": "V0A", "PRUID": "59", "PRNAME": "British Columbia \/ Colombie-Britannique" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -115.49499542, 50.780018587000029 ], [ -115.50032807, 50.77718343600003 ], [ -115.49722732099997, 50.772528975000057 ], [ -115.49321284, 50.770504059000075 ], [ -115.49393662599999, 50.768143038000062 ], [ -115.50289288699997, 50.762270941000054 ], [ -115.50846411599997, 50.754243300000041 ], [ -115.5104796, 50.753297703000044 ], [ -115.51397592099994, 50.748953800000038 ], [ -115.51861431199995, 50.745737989000077 ], [ -115.52586378899997, 50.743771099000071 ], [ -115.53026371899995, 50.74397910700003 ], [ -115.53451319199996,

 

Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

I used a SQL query in the Django View to get count of properties by FSA.

This dataset looks like this in the template. These results have only one FSA, if it had more it would have more FSA / count pairs.

Below is code for  the Django view query to create the fsa_array FSA / counts data source.

    

cursor = connection.cursor()
    cursor.execute(
    "select fsa, count(*) \
    from properties \
    group by fsa \
    order by fsa;")
    fsas_cursor = list(cursor.fetchall())

    fsas_array = [(x[0].encode('utf8'), int(x[1])) for x in fsas_cursor]

My Javascript largely retains the Leaflet tutorial code with some modifications:

1. How the legend colors and intervals are assigned is changed but otherwise legend functions the same.

2. Significantly changed how the color for each FSA is assigned. The tutorial had the color in its geoJSON file so only had to reference it directly. My colors were coming from the View so I had to change code to include new function to match FSA’s in both my Django view data and the geoJSON FSA boundary file and return the appropriate color based on the Django View data set count.


var fsa_array = [["V3J", 19]];

var map = L.map('map',{scrollWheelZoom:false}).setView([ active_city_center_lat, active_city_center_lon], active_city_zoom);

map.once('focus', function() { map.scrollWheelZoom.enable(); });

var fsa_array = fsas_array_safe;

L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw', {
    maxZoom: 18,
    attribution: 'Map data ©' + OpenStreetMap CC-BY-SA' + 'Imagery ©' + 'Mapbox'
    id: 'mapbox.light'
}).addTo(map);

// control that shows state info on hover
var info = L.control();

info.onAdd = function (map) {
    this._div = L.DomUtil.create('div', 'info');
    this.update();
    return this._div;
};

info.update = function (props) {
    this._div.innerHTML = (props ?
        '' + props.CFSAUID + ' ' + getFSACount(props.CFSAUID) + ' lonely homes' 
        : 'Hover over each postal area to see lonely home counts to date.');
};

info.addTo(map);

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';
}     

function getFSACount(CFSAUID) {
    var fsaCount;
    for (var i = 0; i < fsa_array.length; i++) {
        if (fsa_array[i][0] === CFSAUID) {
            fsaCount = ' has ' + fsa_array[i][1];
            break;
        }
    }
    if (fsaCount == null) {
         fsaCount = ' has no '; 
    }
    return fsaCount;
}

function getFSAColor(CFSAUID) {
    var color;
    for (var i = 0; i < fsa_array.length; i++) {
    if (fsa_array[i][0] === CFSAUID) {
        color = getColor(fsa_array[i][1]);
        //console.log(fsa_array[i][1] + '-' + color)
        break;
        }
    }
    return color;
}
    
function style(feature) {
    return {
        weight: 1,
        opacity: 1,
        color: 'white',
        dashArray: '3',
        fillOpacity: 0.7,
        fillColor: getFSAColor(feature.properties.CFSAUID)
    };
}

function highlightFeature(e) {
    var layer = e.target;
    layer.setStyle({
        weight: 2,
        color: '#333',
        dashArray: '',
        fillOpacity: 0.7
    });

    if (!L.Browser.ie && !L.Browser.opera) {
        layer.bringToFront();
    }

    info.update(layer.feature.properties);
}

var geojson;

function resetHighlight(e) {
    geojson.resetStyle(e.target);
    info.update();
}

function zoomToFeature(e) {
    map.fitBounds(e.target.getBounds());
}

function onEachFeature(feature, layer) {
    layer.on({
        mouseover: highlightFeature,
        mouseout: resetHighlight,
        click: zoomToFeature
    });
}

geojson = L.geoJson(bcData, {
    style: style,
    onEachFeature: onEachFeature
}).addTo(map);

var legend = L.control({position: 'bottomright'});

legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend'),
        grades = [0, 1, 5, 10, 15, 20, 25, 30],
        labels = [],
        from, to;

    for (var i = 0; i < grades.length; i++) {
        from = grades[i];
        if (i === 0) {
            var_from_to = grades[i];
            var_color = getColor(from);
        } else {
            var_from_to =  from + (grades[i + 1] ? '–' + grades[i + 1] : '+') ;
            var_color = getColor(from + 1);
        }
        
        labels.push(
            ' ' +
             var_from_to);
    }

    div.innerHTML = labels.join('');
    return div;
};

legend.addTo(map);

That is pretty much all there is to creating very nice looking interactive free open-source choropleth maps for your Django website application!