Leaflet.js choropleth map color by count using geoJSON datasource

I have a Django web application that needed an interactive map with shapes corresponding to Canadian postal code FSA areas that were different colors based on how many properties were in each FSA. It ended up looking something like the screenshot below.


This exercise turned out to be relatively easy using the awesome open-source Javascript map library Leaflet.js.

I used this Leaflet.js tutorial as the foundation for my map.

One of the biggest challenges was finding a suitable data source for the FSAs. Chad Skelton (now former) data journalist at the Vancouver Sun wrote a helpful blog post about his experience getting a suitable FSA data source. I ended up using his BC FSA data source for my map.

Statistics Canada hosts a Canada Post FSA boundary files for all of Canada. As Chad Skelton notes these have boundaries that extend out into the ocean among other challenges.

Here is a summary of the steps that I followed to get my choropleth map:

1. Find and download FSA boundary file. See above.

2. Convert FSA boundary file to geoJSON from SHP file using qGIS.

3. Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

4. Create Leaflet.js map in HTML page basically the HTML DIV that holds the map and separate Javascript script that loads Leaflet.js, the FSA geoJSON boundary data and processes it to create the desired map.

Find and download FSA boundary file.

See above.

Convert FSA boundary file to geoJSON from SHP file using qGIS.

Go to http://www.qgis.org/en/site/ and download qGIS. Its free and open source.

Use qGIS to convert the data file from Canada Post or other source to geoJSON format. Lots of blog posts and documentation about how to use qGIS for this just a Google search away.

My geoJSON data source looked like this:

var bcData = {
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::4269" } },
    "features": [
    { "type": "Feature", "properties": { "CFSAUID": "V0A", "PRUID": "59", "PRNAME": "British Columbia \/ Colombie-Britannique" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -115.49499542, 50.780018587000029 ], [ -115.50032807, 50.77718343600003 ], [ -115.49722732099997, 50.772528975000057 ], [ -115.49321284, 50.770504059000075 ], [ -115.49393662599999, 50.768143038000062 ], [ -115.50289288699997, 50.762270941000054 ], [ -115.50846411599997, 50.754243300000041 ], [ -115.5104796, 50.753297703000044 ], [ -115.51397592099994, 50.748953800000038 ], [ -115.51861431199995, 50.745737989000077 ], [ -115.52586378899997, 50.743771099000071 ], [ -115.53026371899995, 50.74397910700003 ], [ -115.53451319199996,


Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

I used a SQL query in the Django View to get count of properties by FSA.

This dataset looks like this in the template. These results have only one FSA, if it had more it would have more FSA / count pairs.

   var fsa_array = [["V3J", 19]];

Below is code for  the Django view query to create the fsa_array FSA / counts data source.

    cursor = connection.cursor()
    "select fsa, count(*) \
    from properties \
    group by fsa \
    order by fsa;")
    fsas_cursor = list(cursor.fetchall())

    fsas_array = [(x[0].encode('utf8'), int(x[1])) for x in fsas_cursor]

My Javascript largely retains the Leaflet tutorial code with some modifications:

1. How the legend colors and intervals are assigned is changed but otherwise legend functions the same.

2. Significantly changed how the color for each FSA is assigned. The tutorial had the color in its geoJSON file so only had to reference it directly. My colors were coming from the View so I had to change code to include new function to match FSA’s in both my Django view data and the geoJSON FSA boundary file and return the appropriate color based on the Django View data set count.

var map = L.map('map',{scrollWheelZoom:false}).setView([ active_city_center_lat, active_city_center_lon], active_city_zoom);

map.once('focus', function() { map.scrollWheelZoom.enable(); });

var fsa_array = fsas_array_safe;

L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw', {
    maxZoom: 18,
    attribution: 'Map data © OpenStreetMap contributors, ' +
        'CC-BY-SA, ' +
        'Imagery © Mapbox',
    id: 'mapbox.light'

// control that shows state info on hover
var info = L.control();

info.onAdd = function (map) {
    this._div = L.DomUtil.create('div', 'info');
    return this._div;

info.update = function (props) {
    this._div.innerHTML = (props ?
        '' + props.CFSAUID + ' ' + getFSACount(props.CFSAUID) + ' lonely homes' 
        : 'Hover over each postal area to see lonely home counts to date.');


// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';

function getFSACount(CFSAUID) {
    var fsaCount;
    for (var i = 0; i < fsa_array.length; i++) {
        if (fsa_array[i][0] === CFSAUID) {
            fsaCount = ' has ' + fsa_array[i][1];
    if (fsaCount == null) {
         fsaCount = ' has no '; 
    return fsaCount;

function getFSAColor(CFSAUID) {
    var color;
    for (var i = 0; i < fsa_array.length; i++) {
    if (fsa_array[i][0] === CFSAUID) {
        color = getColor(fsa_array[i][1]);
        //console.log(fsa_array[i][1] + '-' + color)
    return color;
function style(feature) {
    return {
        weight: 1,
        opacity: 1,
        color: 'white',
        dashArray: '3',
        fillOpacity: 0.7,
        fillColor: getFSAColor(feature.properties.CFSAUID)

function highlightFeature(e) {
    var layer = e.target;
        weight: 2,
        color: '#333',
        dashArray: '',
        fillOpacity: 0.7

    if (!L.Browser.ie && !L.Browser.opera) {


var geojson;

function resetHighlight(e) {

function zoomToFeature(e) {

function onEachFeature(feature, layer) {
        mouseover: highlightFeature,
        mouseout: resetHighlight,
        click: zoomToFeature

geojson = L.geoJson(bcData, {
    style: style,
    onEachFeature: onEachFeature

var legend = L.control({position: 'bottomright'});

legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend'),
        grades = [0, 1, 5, 10, 15, 20, 25, 30],
        labels = [],
        from, to;

    for (var i = 0; i < grades.length; i++) {
        from = grades[i];
        if (i === 0) {
            var_from_to = grades[i];
            var_color = getColor(from);
        } else {
            var_from_to =  from + (grades[i + 1] ? '–' + grades[i + 1] : '+') ;
            var_color = getColor(from + 1);
            ' ' +

    div.innerHTML = labels.join('
'); return div; }; legend.addTo(map);

That is pretty much all there is to creating very nice looking interactive free open-source choropleth maps for your Django website application!

Canadian Canola seed crushing more efficient at extracting canola oil

Statistics Canada regularly tweet links to various Canadian statistics. I have occasionally created quick Tableau visualizations of the data and replied with a link to my Tableau Public site.

The idea is to encourage Statistics Canada to start communicating more visually and create their own visualizations for all data. Its extra work but value will be realized when Statistics Canada visitors can more easily understand the data instantly by looking at visualizations instead of munging about with boring data in tables.

This particular tweet was to Statistics Canada data reporting on Canadian canola seed crush, canola oil and meal output. http://www5.statcan.gc.ca/cansim/a47

The data shows Canada’s canola seed production and efficiency at extracting canola oil has increased significantly since 1971.

Canola seed is crushed to extract the oil and the seed meal is leftover. The ratio of oil to meal is about .8 in 2016 compared to .3 in 1971. That is a impressive increase in oil extraction efficiency.

Chart.js tooltip format number with commas

Chart.js V2.0 is a useful javascript charting library. It looks great, has ton of features though it is new enough that there is still some work to find out how to get some relatively simple things done.

In this case I wanted to format the chart’s tooltip. Tooltips are the pop-ups that show when you hover mouse over a bar or line in a chart and show the yAxis value along with any other information you want to include.

By default Chart.js tooltips do not format numbers with commas and there was no simple option to do this.

Instead after some Googling about I found out it required using Chart.js callbacks feature which can be used to format chart elements. Note V1 used a different method that modified tooltip template but that is deprecated in V2.0.

The callback is in the Options’ tooltips section. You put function into the callback that uses regex to insert commas.

callbacks: {
    label: function(tooltipItem, data) {
        return tooltipItem.yLabel.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",");

This can be done as a global change to all charts in the page or just to a specific chart which is what I used and is shown in the example below.

The result is that the tooltip now has a commas.

chartjs tooltip number comma

Python get image color palette

I created a web application that included screenshots of about 190 country’s national statistics agencies website home page. I created the site to host the results of comparisons of each country’s website home page features.

One of the features I wanted to include was the top 5 colors used on each home page. For example here is an image of the Central Statistical Office of Zambia website home page.


I wanted a list of the top 5 colors used in the web page, in my case I wanted these as a list of  rgb color values that I could save to a database and use to create a color palette image.

For example the Central Statistical Office of Zambia website home page top 5 rgb colors were:

  1. 138, 187, 223
  2. 174, 212, 235
  3. 101, 166, 216
  4. 93, 92, 88
  5. 242, 245, 247

These 5 colors were not equally distributed on the home page and when plotted as stacked bar plot and saved as an image, looked like this:zambia

How to identify top 5 colors on webpage

Identifying dominant colors in an image is a common task for a variety of use cases so there was a number of options available.

The general technique essentially involves reducing an image to a list of pixels and then identifying each pixel’s color and the relative proportion of that color to all other colors and then taking the top 5 pixel counts to get top 5 colors.

However a web page can have many similar colors for example many shades of blue. So  identification of colors involves a statistical categorization and enumeration to group similar colors into one category, for example, group all shades of blue into one ‘blue’ category.

There are a couple of different statistical methods to do this categorization that are discussed below.

Modified Median Cut Quantization

This method involves counting image pixels by color and charting them on a histogram from which peaks are counted to get dominant colors.

This method is used in a Python module called color-thief-py. This module uses Pillow to process the image and modified median cut quantization

I used this module in the code below where I loop through a folder of screenshots to open image and then pass it to color-thief-py to process. Then the top 5 colors’ rgb strings are written into my database as a list so they can be used later.

    import os, os.path
    from PIL import Image
    import psycopg2
    from colorthief import ColorThief

    conn_string = \
        "host='localhost' \
        dbname='databasename' \
        user='username' \
    conn = psycopg2.connect(conn_string)     
    cur = conn.cursor()

    ## dev paths
    screenshots_path = 'C:/screenshots/'

    screenshots_dir = os.listdir(screenshots_path)
    for screenshot in screenshots_dir:
        if screenshot != 'Thumbs.db':
            img = Image.open(screenshots_path + screenshot)
            width, height = img.size
            quantized = img.quantize(colors=5, kmeans=3)
            palette = quantized.getpalette()[:15]
            convert_rgb = quantized.convert('RGB')
            colors = convert_rgb.getcolors(width*height)
            color_str = str(sorted(colors, reverse=True))
            color_str = str([x[1] for x in colors])
            print screenshot + ' ' + str(img.size[1]) + ' ' + color_str
            cur.execute("UPDATE screenshots \
            set color_palette = %s,  \
            height = %s \
            WHERE filename like %s", \
            (str(color_str),img.size[1], '%' + screenshot + '%',))

K-means clustering

Another statistical method to identify dominant colors in an image is using K-means clustering to group image pixels by rgb values into centroids. Centroid’s with highest counts are identified as dominant colors.

This method is used in the process I found on this awesome blog post on pyimagesearch.com.

This method also uses Python module sklearn k-means to identify the dominant colors. It produced very similar results to the other method.

This code from the pyimagesearch website used Python module matplotlib to create plot of the palette and saved these as images. I modified the code to instead simply save the rgb color values as strings to insert into database and save the matplotlib palette rendered plot as an image. I  added plt.close() after each loop to close the rendered plot after the plot image was saved because if they aren’t closed, they accumulate in memory and crashed the program.

# python color_kmeans.py --image images/jp.png --clusters 3

# import the necessary packages
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import argparse
import utils
import cv2
import os, os.path
import csv

# construct the argument parser and parse the arguments
#ap = argparse.ArgumentParser()
#ap.add_argument("-i", "--image", required = True, help = "Path to the image")
#ap.add_argument("-c", "--clusters", required = True, type = int, help = "# of clusters")
#args = vars(ap.parse_args())

screenshots_path = 'screenshots/'

screenshot_palette = list()

## Create csv file to write results to
file = csv.writer(open('palettes.csv', 'wb'))

screenshots_dir = os.listdir(screenshots_path)
for screenshot in screenshots_dir:
    if screenshot != 'Thumbs.db':
        print screenshot

        # load the image and convert it from BGR to RGB so that
        # we can dispaly it with matplotlib
        image = cv2.imread(screenshots_path + screenshot)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # show our image

        # reshape the image to be a list of pixels
        image = image.reshape((image.shape[0] * image.shape[1], 3))

        # cluster the pixel intensities
        clt = KMeans(n_clusters = 3)

        # build a histogram of clusters and then create a figure
        # representing the number of pixels labeled to each color
        hist = utils.centroid_histogram(clt)
        bar = utils.plot_colors(hist, clt.cluster_centers_)
        #color_palette = utils.plot_colors(hist, clt.cluster_centers_)
        #print color_palette
        #row = (screenshot, [tuple(x) for x in color_palette])
        #print row
        #print row[0]
        #print row[1]
        # show our color bar
        plt.savefig('palettes/' + screenshot)
        #        row[0].encode('utf-8', 'ignore'),
        #        row[1],
        #        ])

Excel Power Query tutorial using Canadian potato production statistics

This data comes from Statistics Canada.


Statistics Canada download pages often provide the opportunity to modify the data structure and content before it is downloaded.

For example clicking on the [Add/Remove data] link provides options select different groupings of data by provinces. I choose to group data by provinces.


I also limited the content to only yield and production. There are other metrics available though they are often differing presentations of the data eg relative amounts by category etc.


I selected option to download the data as a csv formatted text file.

This csv file data had to be transformed before it could be used for analysis.

Statistics Canada often includes non-data text such as report titles and descriptions in the first rows of text files and footer notes in rows below the actual data.


Before we can work with this data these title and footer rows have to be removed.

For this job I used Excel Power Query which can do the ETL. It can extract the data from the csv file, transform the data to a format that is amenable to analysis and load the data into a worksheet (or model) so it can be used for analysis.

Power Query has capabilities and features that match those of many more advanced ETL / integration software such as SSIS, Talend, Informatica, etc.

I have been encouraging and providing training to Excel users to move all of their ETL work in to Power Query. It brings the advanced capabilities of these tools to the relatively non technical desktop business user.

People too often copy and paste data from other sources into their Excel file.

With Power Query the original data file is a data source, and is linked into the Power Query, thus remains untouched and can simply be refreshed to get new or modified data at any time, providing that the columns and content remain unchanged. This makes it easy to get updated time sequence data as it comes available.

So back to our job. The first step was to link to the csv data file downloaded from Statistics Canada into Power Query by using New Query – From File – From CSV


Select the data file and click Import.


Here is where we begin to bump into data file format challenges. Power Query can’t automatically determine file format because there are non-data text strings at top (report titles, descriptions) and bottom (footer notes).

I would like to encourage Statistics Canada and any other data provider to avoid doing this. A data file should be in a predictable tabular (CSV, tab separated) or other format (JSON, XML) without any other content that needs to be processed out.


But no problem.  Power Query is equipped to deal with these situations.

Power Query assumed this is csv file but the top rows of titles and descriptions do not have commas so it doesn’t  split top rows into columns and as a result the split rows below are hidden.

One solution which I used was to go into the Power Query Source step settings to change the delimiter from Comma to Equals Sign. Since there are no equals signs the data is not split into columns and remains in one column that we can remove top and bottom non-data rows.



The Power Query Source step now looks like this.

= Csv.Document(File.Contents("C:\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252])


Now we can remove the non-data text from the data file. Select the Remove Rows – Remove Top Rows and enter the number of rows to remove (in this case it is top 7 rows are non-data rows).

= Table.Skip(#"Changed Type",7)


Do the same for the bottom rows but select Remove Rows – Remove Bottom Rows.

= Table.RemoveLastN(#"Removed Top Rows",12)

Now we can load the data to the worksheet and we will see one column of comma separated values which we just need to split to get nice columns of data.


Tell Power Query to split columns by comma.

= Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"})


Also tell Power Query to use first row as headers. If the non-data text rows were not in the file Power Query would have done this automatically.

= Table.PromoteHeaders(#"Changed Type1")


A key data transformation to get this data into a useful tabular format is to unpivot the years columns to rows. This is one of Power Query’s most useful transformations. Many data sources, especially in enterprise business, often have time series or other categories presented as columns but to do dimensional analysis these have to be unpivoted so they are in rows instead of columns.

Note that the Statistics Canada data formatting feature often includes moving time series to rows which would be similar transformation to this Power Query Unpivot.

Here is the pivoted data.


And here is the unpivoted data. This now gives us unique columns of data that can be used by any analytical software.



I also did some other things to clean up the data:

  • Renaming column headers to be shorter, more readable.
  • Changing the dimension values to shorter strings.
  • Changing hundredweight values to pounds multiplying production values by 1000 to get full values.
  • Filtered out Newfoundland because it was missing 2013, 2014, 2015 values.
  • Removed the 2016 column which had no values.

Once loaded back to worksheet we now have nice clean dimensional data ready for us to begin working with in any analytical software. This can be used as data source for Tableau, Qlikview, Cognos, Birst, etc.


Lets take a quick look at the data using Excel Pivot Tables and Charts. Select this worksheet table and select Insert – Pivot Table to create a new pivot table like the one below.



Select Insert Pivot Chart to create new pivot chart from this pivot table.

In this case I selected a stacked bar chart that show total Canadian potato production by province.

This does a good job of demonstrating relative contribution to total by province.

A stacked line chart does a better job of illustrating changes in production by provinces by year.



Here is a Tableau 9 report from the Power Query results.


Here is the complete Power Query M-code copied from the Power Query that gets and transforms the Statistics Canada raw csv data file into the transformed data set ready to be used in Excel Pivot Tables/Charts and Tableau or any other reporting tool.

    Source = Csv.Document(File.Contents("C:\Users\bb\Documents\Dropbox\Data\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Removed Top Rows" = Table.Skip(#"Changed Type",7),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Removed Top Rows",12),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type1"),
    #"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"2016"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Columns", each ([Geography] <> "Newfoundland and Labrador (2)")),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Filtered Rows", {"Geography", "Area, production and farm value of potatoes"}, "Attribute", "Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Columns",{{"Attribute", "Year"}, {"Area, production and farm value of potatoes", "Dimension"}, {"Geography", "Province"}, {"Value", "Hundredweight Value"}}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Renamed Columns",{{"Hundredweight Value", type number}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type2","Average yield, potatoes (hundredweight per harvested acres) (5,6)","pounds per acre",Replacer.ReplaceText,{"Dimension"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Production, potatoes (hundredweight x 1,000)","production",Replacer.ReplaceText,{"Dimension"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Pounds", each if [Dimension] = "pounds per acre" then [Hundredweight Value] * 100 else if [Dimension] = "production" then [Hundredweight Value] * 100 * 1000 else null)
    #"Added Custom"

Django recreate database table

Django’s makemigrations and migrate commands are very useful to update existing database tables to reflect model changes.

However if you have made many existing table column name changes, migrate will ask you a series of ‘y/N’ questions about which column names are changed. This can be tedious to cycle through especially if there are many changes.

Depending on the relationships your table has, it may be easier and quicker to:

  • Create a backup of the table by copying and renaming table or exporting table data to csv
  • Drop the table
  • Recreate table from scratch
  • Reload data into the new updated table

The question is how to recreate the table?

After you drop the table you can remove your table model from the models.py field and then run makemigrations and then run migrate —fake which is the special trick to get past migrate wanting your table to exist before it can delete it.

Then after you run migrate –fake, you can put your update model for your table back into your models.py and then makemigrations and migrate and you will get your new updated table recreated in database.

Then you can recover your data from the backup with SQL INSERT or by using database data import feature.


Canadian TCS FDI Officers Twitter list member analysis

Blog post updated to add the following:

  • Updated Python “TCS Members Details” code to get additional information from the List members’ profiles.
  • New Python code “TCS Members Tweets” to download all List members’ Tweets. Will provide some more analysis on this new data soon.


The Canadian Trade Commissioner Service maintains a Twitter List named CDN TCS FDI Officers that has a bunch of Canadian Trade Commissioners as members. Investment Canada also has a similar list.

I was interested to learn how many people were on these lists, how long they had been on the lists, what they were Tweeting about.

So I used the Twitter API and Python Tweepy to retrieve data about list members including:

  • screen_name
  • name
  • followers_count
  • friends_count
  • statuses_count (# of tweets)
  • favourites_count
  • created_at (Twitter account create date)
  • account_age_days (calculated as # days from July 9, 2016)
  • listed_count
  • verified
  • profile_image_url
  • profile_sidebar_fill_color
  • profile_text_color
  • profile_image_url_https
  • profile_use_background_image
  • default_profile_image
  • profile_sidebar_border_color
  • profile_background_color
  • profile_link_color

You can get find complete definitions about these over at Twitter.

The data was output as a simple csv file that I used to create a Tableau Public dashboard which is embedded below. The csv file is attached here.

The Tableau dashboard is interactive and can be sorted by any of the columns by using the sort icon which appears when you hover over the top of column as illustrated in screenshot below.






It would be interesting to try to determine if this Twitter activity has measurable impacts on FDI and Canadian Trade.  For example perhaps foreign investment finds it way to Canada after reading Tweet by one of our Trade Commissioners.

This would require that TCS maintains a CRM (client relationship manager) system and process that records lead sources.

There is some disparity between use of Twitter by the CDN TCS FDI Officers list members as shown by Tweets/Day which total Tweets divided by # days since the account was created. If there is a measurable lift in lead generation by Twitter use then this would be actionable metric.

For the technically minded the Python code is shown below. Note that you need an API account to use with this code.

There is another file tweet_auth import not shown here that contains Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

Here is the “TCS Member Details” code:


 #twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
  import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, 'path/to/folder/with/your/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
     #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    pi = tweepy.API(auth)

    today = datetime.now().date()
    tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')

    member_details_csv = csv.writer(open('tcs_member_details.csv', 'wb'))

    members = []
    member_tweets = []
    for member in tcs_list_members.items():
            member.screen_name.encode('utf-8', 'ignore'),
            member.name.encode('utf-8', 'ignore'),

It would be interesting to see what Tweet topics, other Twitter user mentions, links to webpages, etc. So the next step is to loop through each of the list member’s api.user_timeline to retrieve their Tweet content and do some analysis on them.  For now here is the code and some analysis and visualization in Tableau later.

Here is the “TCS Members’ Tweets” Python code:

    import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, '/path/to/your/folder/with/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading
    #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
    #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    #api = tweepy.API(auth)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    today = datetime.now().date()

    def get_list_member_tweets():
        ## get invest_canada and cdn-tcs-fdi-officers list members
        tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')
        ## open csv file 
        member_tweets_csv = csv.writer(open('tcs_member_tweets.csv', 'wb'))
        ## write header row column names
        ## loop through list members and get their tweets
        for member in tcs_list_members.items():
            ## get list member tweets
            member_tweets = get_member_tweets(member.screen_name)
            for status in member_tweets:
                ## check tweets for hashtags
                if status.entities['hashtags']:
                        for hashtag in status.entities['hashtags']:
                ## check tweets for user_mentions
                if status.entities['user_mentions']:
                        for user_mention in status.entities['user_mentions']:
                ## write to csv file      
                    status.text.replace('\n',' ').replace('\r',' ').encode('utf8','ignore')

    def get_member_tweets(screen_name):
        ## get list member's tweets
        alltweets = []

        ## can only get max 200 tweets at a time
        new_tweets = api.user_timeline(screen_name = screen_name, count=200)

        ## get oldest tweet already retrieved
        oldest = alltweets[-1].id - 1

        ## iteratively get remaining tweets
        while len(new_tweets) > 0:
            new_tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
            oldest = alltweets[-1].id - 1
        ## print out member and # tweets retrieved to show progress
        print screen_name + " %s tweets downloaded" % (len(alltweets))
        ## return all tweets
        return alltweets

    if __name__ == '__main__':

Full code on Github:


Use Excel Power Query to scrape & combine Wikipedia tables

Power Query is quick and easy way to scrape HTML tables on web pages. Here is step by step on getting multiple tables from Wikipedia article and appending them into one Power Query Excel table.

The Wikipedia article List of national and international statistical services has multiple tables with lists of countries’ statistics agencies and their website urls. These are separated by world region.


Step 1: Click the Power Query From Web menu icon and paste the url above into the URL address field and click OK.


Step 2: Power Query finds and shows you a list of tables that are in this web page. Select the tables you want and click Load.

By default you can only select one of the tables.


However you can check the Select multiple items checkbox and you can then select more than one of the tables.


You can also select any of the tables to get a preview of the data they contain. After you have selected all the tables you want, click Load.


After you click Load Power Query will create a new query for each of the selected tables.


Step 3: Append all of the tables into one new combined dataset. You do this with the Table.Combine feature which is confusingly also called Append in the Power Query menu icons.  The menu icon feature combines only two tables by default, however you can simply manually edit the resulting code in the address bar to include all of the tables in the Table.Combine formula.


After you update the Table.Combine formula you will have a new query with all of the tables combined into one dataset.


You can refresh the queries to get any changes to the Wikipedia tables.

Here is M code for the Power Query that gets the Wikipedia table


Source = Web.Page(Web.Contents("https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services")),
Data0 = Source{0}[Data]

Here is M code for the Power Query that combines the tables into one dataset:

Source = Table.Combine({#"Africa[edit]",#"Americas[edit]",#"Asia[edit]",#"Europe[edit]",#"Oceania[edit]"})

How to update Office 365 password in Power BI dataset refresh

I recently changed my Office 365 user password for an account that I was using for a Power BI Dataset Scheduled Refresh.

The result was that my Power BI Refresh failed which looked like screenshot below.

powerbi sharepoint dataset - update password-01

So all I had to do was update the authentication Power BI was using to access my Office 365 Sharepoint folder.

It was clear that the Edit credentials link was where I needed to update the password.

That link got me the following page where I selected oAuth2 which is referring to the authentication that Power BI uses with my Office 365 user credentials.

powerbi sharepoint dataset - update password-02

Selecting oAuth2 popped a new browser window where I could enter my Office 365 user and new password and authenticate Power BI.

powerbi sharepoint dataset - update password-1

After clicking Sign In I was returned to the Power BI page and the credential errors above were gone and I could successfully refresh the dataset from my Office 365 Sharepoint files.

powerbi sharepoint dataset - update password-4

The Refresh Schedule log showed the previous failed refresh attempt and the just completed successful refresh. Back in business!

powerbi sharepoint dataset - update password-3

How to avoid wide margins on a Power BI dashboard

A Power BI Report with multiple charts or other objects can be added to a Dashboard in Power BI Online using the pin to dashboard feature.

However this results in a dashboard with very wide margins. This is especially problematic on a mobile device as the screenshot from the Power BI Android application shows. There is a lot of wasted white space.

wide margins

The Desktop app view is a bit better but there is still a lot of white space around the edges.

wide margins desktop


The resolution, until Power BI team make the margins smaller or add feature to change margin width, is to pin charts one by one to the dashboard in order to have them fill out width.

The screenshot below highlights where you can pin your report to the dashboard using different pins.

You can select an individual report’s pin (the one to the right) which will give you dashboard without the wide margins.

Using the pin on the top toolbar will add the entire report with the multiple reports to the dashboard and results in the wide margins seen above.
pin to dashboard

How to use Google Adsense API to download Adsense data

Google’s APIs make getting Adsense (or any other Google service) data easy to download. The code below downloads Adsense data saving results to csv file.

The code uses Google’s AdSense Management API, OAuth 2.0 authorization and the google-api-python-client SDK.


When you run this code for the first time it will open a web browser to get approval for the API application to access your Adsense account data.

When this approval is granted the code saves a refresh token that is used for all future executions of the code instead of requiring the web browser approval. This is handy if you want to run this code using a cron job like I do.

Here is the code

In summary, the code does the following:

  • Authenticates against API application
  • Queries the scope to get list of accounts
  • Loops through accounts
  • Returns requested metrics and dimensions from Adsense account
  • Writes results to csv file
import csv
from datetime import datetime, timedelta
from apiclient import sample_tools
from apiclient.discovery import build
from oauth2client import client

todaydate = datetime.today().strftime('%Y-%m-%d')

def main():    
    ## Authenticate and construct service
    scope = ['https://www.googleapis.com/auth/adsense.readonly']
    client_secrets_file = 'client_secrets.json' 
    ## Authenticate and construct service
    service, flags = sample_tools.init('', 'adsense', 'v1.4', __doc__, client_secrets_file, parents=[], scope=scope)
    ## Get accounts
    accounts = service.accounts().list().execute()
        ## Get account(s) data
        results = service.accounts().reports().generate(
            startDate='2012-01-01', # choose your own start date
            endDate= todaydate,
            metric=['EARNINGS', 'CLICKS','AD_REQUESTS'],
    except client.AccessTokenRefreshError:
        print ('The credentials have been revoked or expired, please re-run the '
           'application to re-authorize')
    ## output results into csv
    header = ['hostname','date','type','earnings','clicks','ad_requests']
    with open('output_adsense.csv', 'wb') as file:
        file.write(','.join(header) + '\n')        
        for row in results.get('rows'):
            file.write(','.join(row) + '\n')

    ## print status for log
    print str(datetime.now()) + ' adsense'
if __name__ == '__main__':

Create API Application, get client_secrets.json

As alluded to above you need to create an API application. This is done in the Google Developers Console . If you haven’t already signed up for this you have to do that first.

Create Project

If you haven’t already you will also have to create a Project. Your API application will be inside this Project. In the Developers Console you can create new Project with drop down at top right.

create app - create project


Once you have a Project you can go to the Enabled APIs tab to select which Google service API(s) you want to include in the project. For this code’s purposes you can just select Adsense Management API.

Create Application – Select API Type

Use the Create credentials button to create your new application. (You are creating the credentials for your new app.)

When you click the button you get choice of three options (see below).  Important point that raises lots of questions: Adsense cannot use a Service Account to authenticate.  That was what I thought I could use to avoid having to do user authentication since I am running code on my server in cron job.

No problem though because as you will see later, there is a one time only step to have user authorize Adsense account access. From that point on, the authentication is done using a refresh token.

So my code above uses the OAuth client ID. So select that for your application too.

create app - choose auth type

Create Application – Select Application Type

Then you will be asked to choose the application type. For my purposes I do not want web application which will require real redirect URIs. I just want simple plain app. So I chose Other.

create app - choose app type

You will then get client secret and client id which you can copy or get later. You don’t need these as you will get them in the client_secret.json file you download next.

So just change default application name to something unique so you can identify it later.

Create Application – OAuth Consent Screen

This is not something critical but you have to provide an email and application name to show in the browser pop up for authentication.

The idea is to give a user (you) information about who is asking for permission to access your Adsense account (you are).

create app - consent screen


Create Application – Download Client Secret JSON File

Then click the download button to download the client_secret.sjon file.  Once the JSON file is downloaded click create.

create app - download secret json

The JSON file downloads with longer name like “client_secret-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json”. You will have to rename the JSON file so it is spelled “client_secret.json”.

Use Code and client_secret.json File to Authenticate 

Put the python code above into .py file along with the client_secret.json file into a folder.

Local server (desktop/laptop)

Run the .py file using the command line which should do the following:

  • Pop up browser and go to Google login page where and ask you to sign in and allow or allow if you already logged in.
  • The command line will advance and finish.
  • Creates a .dat file in your folder.

The .dat file name will be whatever you named your .py file. This dat file is your refresh token that will be used to authenticate access in future.

Remote Server

Copy your Python, client_secret and .dat (refresh token) files to server and run their by cron jobs.


Code Variables

You may want to change the Python code to select different Adsense metrics and dimensions. I am only selecting a handful of the API metrics and dimensions. You can check out  these in more detail AdSense Management API Metrics and Dimensions documentation.


Create Python 3 Virtualenv on machine with Python 2 & 3 installed

I have been using Python 2.7 for most of my Python work but have a few projects where I want to use Python 3.x. In these cases I want to create a Virtualenv virtual environment with Python 3.x.

Turns out that is is no problem to have Python 2.7 and Python 3.x installed on the same machine as long as they are installed in their own folders.

You just have to selectively call either when you want to use them but you choose which version you want to have in your Windows path so when you call ‘python’ at command line, you run that version by default. I have Python 2.7 in my Windows environment Path.

So to create a new Python 2.7 virtual environment just call Virtualenv normally:

c:\path\to\my\project>virtualenv venv

To create a new Python 3.x virtual environment just use the ‘-p’ switch and use the Python 3.x path and executable to have Virtualenv create new virtual environment with Python 3.x:

c:\path\to\my\project>virtualenv -p c:\python34\python.exe venv

Use OneDrive API to upload files to Office 365 Sharepoint Site

I have automated uploading files from my web site host’s server to my Office 365 Sharepoint site using scheduled cron jobs running Python scripts on my web host.

The Python scripts use Microsoft’s Azure Active Directory Library (ADAL) to authenticate off Azure Active Directory (Azure AD or ADD), and  OneDrive API and Python Requests to use the authentication to upload the files to Sharepoint from my web host.

Here is the code

import adal
import urllib
import requests

## set variables
username = 'user@contoso.onmicrosoft.com'
password = 'password'
authorization_url = 'https://login.windows.net/contoso.onmicrosoft.com' # aka Authority
redirect_uri = 'https://login.microsoftonline.com/login.srf' # from Azure AD application
client_id = 'd84cbf4f-dc23-24d1-8a7d-08ff8359879a' # from Azure AD application
file_url = 'https://contoso.sharepoint.com/_api/v2.0/drive/root:/myfoldername/myfilename.csv:/content'

## use ADAL to create token response
token_response = adal.acquire_token_with_username_password(

## Use ADAL to create refresh token and save as text file to reuse
refresh_token = token_response['refreshToken']
refresh_token_file = open('refresh_token.txt', 'w')

## Get saved refresh token and use it to get new token response
refresh_token = open('refresh_token.txt', 'r').read()
token_response = adal.acquire_token_with_refresh_token(authorization_url, str(refresh_token))

## get access_token from token response JSON string
access_token = token_response.get('accessToken')

## create http header to send access token to authenticate
headers = {'Authorization':'BEARER ' + str(access_token)}

## example to upload file
upload_file = requests.put(file_url, data = open('myfilename.csv', 'rb'), headers=headers)


There are many things to consider when working with Microsoft’s APIs to work with its online services such as Office 365.

The first is how to authenticate. Microsoft is trying to move everyone to use Azure AD to do oAuth authentication. Microsoft services still have their own authentication methods but this exercise I used Azure AD.

The second is what API to use. Microsoft has recently released their Graph API that is ‘one endpoint to rule them all’. However Microsoft services still have their own API’s so while Graph API looks tempting for this exercise I used the OneDrive API.

Azure AD Authentication

The authentication will be done in two parts.

  1. Create Azure AD application to do the authentication for the Microsoft service(s) you want to interact with.
  2. Use ADAL to interact with Azure AD to do the oAuth flow.

Setup Azure AD – create application

Microsoft provides free use of Azure AD for light authentication needs. You can register and create account. Once you have your account you need to create a new application.

For my purposes I created an Azure AD native client application. Azure AD also has web application and web APIs but both require user to enter username and password in web browser. The native client application does technically also require user to enter these too but I hacked past this by using ADAL user authentication and hard coding username and password into the Python code. Since these are going onto my web host in protected directory to run as cron jobs they will be safe.

I am not going to go through the detail of creating an Azure AD application there are some good blog posts and Microsoft does good job of describing it. For example take a look at this site which has decent information about creating a new Azure AD application.

The Azure AD applications allow you to choose which Microsoft services it will be used to authenticate. Confusingly these are also called ‘applications’ too. They are represent Microsoft Services such as Office 365 Sharepoint Online, OneNote, Power BI, etc and is the place where you assign the permissions (also called ‘scopes’) that authentication will allows with that Microsoft service.

An Azure AD application might provide authentication for more than one Microsoft Service. But my native client application has only Windows Azure Active Directory permissions (which are there by default) and Office 365 Sharepoint Online permissions set to Read and write user files and Read and write items in all site collections.

After you have created your client application make sure to copy the client_id and resource_uri to use in code below. The client_id is automatically assigned and the resource_uri for a native client app can be any url and is just a unique identifier. I chose the Office 365 login url. The web applications need a real url because that is where the user will be prompted to enter credentials.

Azure Active Directory Library (ADAL)

Microsoft’s Azure Active Directory Library (ADAL) authentication libraries are created for developer’s to use with Azure AD. I used the ADAL Python SDK which was easily installed with pip install adal. 

The oAuth authentication flow can seem very complex but you don’t have to worry about that if you use ADAL. ADAL uses your Azure AD application credentials (client_id, resource_uri in case of native client application) to retrieve a token response which is a text string in JSON format.

This JSON string includes the actual access token that is used to authenticate accessing Sharepoint and upload the files. You can use Python to retrieve the access token (it is a Dictionary). Then you simply put the access token into a header that will be used in the Put Request as the method of passing the access token to the OneDrive API.

ADAL also takes care of refreshing tokens which expire. In my case where the scripts are running on the server as cron jobs I want the token to refresh automatically. ADAL gets a refresh token that you can save to get a new access token when previous one expires. I actually write the refresh token to a text file on the server and refresh the access token each time code is run. I could only refresh it if the previous one expires.

OneDrive API

The OneDrive API has different configurations depending on whether you are using it to access a OneDrive Personal, OneDrive Business or Sharepoint Online account.

Be warned that the documentation for OneDrive API can be very dense and there are different ways of presenting required syntax to identify interactions. Of course the representations vary with different SDKs too.  Also there are different versions of Microsofts file storage services over the years. So I recommend to focus on the newest OneDrive API and make sure you are looking at documentation relevant to newest version.

The Gotchas

ADAL Default Values

ADAL has default client_id and resource values that it uses for the username authentication. I changed these default values to match my Azure AD application.

Before changing these I was getting an Invalid audience Uri error

{“error”:”invalid_client”,”error_description”:”Invalid audience Uri ‘https:\/\/m

This error means the url being used to create the token response was not same as the one that the file was being uploaded to.

EDIT August 21, 2016 Microsoft has updated the ADAL library so that you can specify the client id and the resource value because authentication against different services needs different client id and resource endpoint urls. That means the hack I used below is no longer required. For more details seehttps://github.com/AzureAD/azure-activedirectory-library-for-python

In ADAL’s __init__.py file look for the class _DefaultValues class at bottom of code and replace the default values:

  • I changed client_id to my application’s client_id
  • I changed resource from https://management.core.windows.net/ to https://tenant.sharepoint.com/

The acquire_token_with_username_password function sets these to None so they get set to default values. So this could be changed so they accept values from the code.

Sharepoint Site and Folder Paths

The OneDrive API dev documentation https://dev.onedrive.com/getting-started.htm demonstrates the different service urls:

  • OneDrive – https://api.onedrive.com/v1.0
  • OneDrive for Business – https://{tenant}-my.sharepoint.com/_api/v2.0
  • SharePoint Online – https://{tenant}.sharepoint.com/{site-relative-path}/_api/v2.0

The {site-relative-path} notation indicates the Sharepoint site name. My site didn’t have this because it was default site. However you might have to add your site relative path.

Also the Sharepoint url for the file I was uploading looked like this:


However you will note that the file_url in the code doesn’t make any reference to the Shared%20Documents:



Careful! Don’t click “Try Power BI for free”

This was a weird quirk.

I have Power BI Free account and uploaded a report to Power BI Service. The report has dataset that gets data from a Sharepoint file.

In Power BI Service I went to the dataset “Schedule Refresh”, selected “Connect Directly”, “Enter Credentials” as oAuth, then entered my Office 365 credentials. This setup the connection successfully to the Sharepoint file, and then I could switch the “Keep your data up to date” to “Yes”.

Then I accidentally clicked the “Try Pro for Free” button.

From that point on, every time I selected the Power BI Service dataset or the report, I got a pop up blocking message “To see this report upgrade to power bi pro”.

pro upgrade

The only way to make it stop was to switch the “Keep your data up to date” to “No”.

The only Pro feature is hourly updates. The Free Power BI Service version only allows daily updates. I hadn’t selected hourly updates so that wasn’t the problem. Just some weird quirk.

The resolution was to delete the report and dataset that I just scheduled refresh for, and then upload the report again and then redo the schedule refresh as per above (without accidentally clicking on the Try Pro for free button) to make it work again.

How to schedule Power BI dataset refresh

Do you want to create a Power BI Report that gets a daily scheduled refresh of data from a Sharepoint csv file?

The first step is to create your Power BI report in Power BI Desktop using the Sharepoint csv file as data source.

In Power BI Desktop use Get Data – File – Sharepoint Folder to connect to your Sharepoint Folder.

The resulting dataset query (Power Query) will look something like mine below. You will replace “mydomain” with your Sharepoint account name or domain.

You will also replace “datafile.csv” with your csv file name. The Power BI connection is to a Sharepoint folder which might have more than one file like I did. If you have only one file in the folder the filter will be redundant but can’t hurt.

Source = SharePoint.Files("https://mydomain.sharepoint.com", [ApiVersion = 15]),
#"Filtered Rows" = Table.SelectRows(Source, each ([Name] = "datafile.csv")),
#"Combined Binaries" = Binary.Combine(#"Filtered Rows"[Content]),
#"Imported CSV" = Csv.Document(#"Combined Binaries",[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV")
#"Promoted Headers"

After you publish your report to your Power BI Online account you can select your newly uploaded dataset’s “Schedule Refresh” property where you can set up the refresh schedule.



First go to “Gateway connection”.

I selected “Connect Directly” which requires that you also enter Sharepoint credentials in the “Edit credentials” link which pops up a web page that prompts you to login into your Sharepoint account. This gives Power BI Service permission to access your Sharepoint account to refresh file.

If you have an enterprise gateway setup you could try “Enterprise Gateway” and enter the required credentials for that.




If you entered credentials correctly you should now be able to select the “Keep your data up to date” switch to “Yes”.

Then you can select which four 6-hour window you want refresh to run. Power BI Service free accounts can do daily refreshes. Pro accounts can have hourly updates.

As an aside be warned that if you click the “Try Pro for free” button you might get a blocking message that you are using Pro feature. This happened to me and was clearly a quirky error. I had to delete my report and dataset and re-upload them and redo the scheduling to get rid of the error.



You can try refreshing the dataset manually (On demand) or wait for the next scheduled refresh (Scheduled) to happen to see if the data does refresh. You can see refreshes are successful and when they ran by clicking the “refresh history ” link.




Power BI Online – get data from Office 365 Sharepoint file

I want to create a Power BI Online report with a data source from a file on a remote web server that updates automatically so my Power BI report is always up to date.

Power BI Desktop and Online have lots of data connectors to third party ‘Online Services’ eg Salesforce, Mailchimp, Github, etc, as well as file and database connectors. But none of these help to get the file from my remote server directly.

There is no feature to connect to a file on a remote server. I could put my remote file data into MySQL or Postgres database and Power BI could connect to those but my remote server doesn’t allow external connections to hosted databases. So that is not an option for me.

A Power BI Online report can get data from a Sharepoint site file that will update automatically on schedule.

Since I have an Office 365 E3 account which has Sharepoint site I upload my remote file to the Sharepoint site and create a Power BI Online report linked to that Sharepoint file.

I would still have to figure out how to automate uploading my remote server file to my Office 365 E3 account Sharepoint site. But I am pretty sure I can do that with the OneDrive API but more on that in Part 2.

Here is a diagram outlining what I think my solution could be.


In the meantime to test using Office 365 Sharepoint file as data source for a Power BI Online report, I created a Power BI report in Desktop with file data source from my Office 365 Sharepoint site and Published it to my Power BI Online account.

After publishing the report to my Power BI Online account, I logged into Power BI Online, opened the newly published report and went to data source options and selected ‘Schedule Refresh’ which produced screen below.


I set ‘Keep your data up to date’ to ‘Yes’ and selected ‘Connect Directly’ which gave me error message telling me I had to update credentials.

Not surprisingly the report I published to Power BI Online didn’t ‘remember’ that I had already authorized the Power BI Desktop report to get file from my Office 365 Sharepoint site so I have to do it again in Power BI Online.

So I selected ‘Edit Credentials’ and then selected ‘oAuth’ as type of credentials which popped up Office 365 login screen where I entered my user and password clicked login and was returned back to Power BI online page.

The error message was gone so this must have created oAuth authentication to link the file data source from my Office 365 Sharepoint site into the data source in Power BI Online.

Then I set the automatic refresh to one of the daily 6-hour windows (below i have selected 12 pm to 6 pm) for refresh to run (hourly refresh is a Power BI Pro feature).

The report data source now refreshes from my Sharepoint file on daily automatic schedule so it looks like I am half way to my solution.

I will write another blog post detailing how I will automate moving my data file from my remote server to my Office 365 Sharepoint site. Pretty sure I will be using the OneDrive API https://dev.onedrive.com but there are other options too.

In the meantime talk a look at the previous blog post summarizing the OneDrive and Sharepoint API options.

One challenge I have encountered so far is that the OneDrive Python SDK is made for web apps and I want to setup server app (native client app).  More to come.

OneDrive API features

Microsoft has three file storage options:

  1. OneDrive Personal
  2. OneDrive Business
  3. Sharepoint

These have recently been unified into one new OneDrive API https://dev.onedrive.com and oAuth is preferred method of authentication.

However there are some key differences how the API:

  • OneDrive Personal authenticates against oAuth account created at Microsoft Application Registration Portal using a Microsoft account (Live, Microsoft.com). Authentication url is: https://login.live.com/oauth20_authorize.srf
  • OneDrive Business and Sharepoint authenticate against oAuth account created in Azure Active Directory and must be done with Office 365 account.  Authentication url is: https://login.microsoftonline.com/common/oauth2/

You can create two types of applications that will have different methods and parameters:

  • Web application – web site based application that user can sign into. Require definition of an active redirect url and definition of client secret. Scopes or permissions are identified on the fly when authentication is made.
  • Native client application – Android, ‘head-less’ server app, etc. Requires only definition of an unique redirect uri and scopes or permissions that Office 365 account / user have available eg read & write files, etc.

The process for authentication is similar:

  • Sign-in with user account/password or send authentication url to authentication server to get authentication code.
  • Server sends back url with authentication code.
  • Retrieve authentication code from url.
  • Send another url comprised of code and other parameters back to server to get tokens.
  • Use tokens to list, view, upload, download files etc.

There are development SDKs available for popular languages.

I was only interested in thePython SDK . Some key notes about it include:

  • It is created specifically for web applications, not native client applications. The SDK authentication method relies on using a web browser to pass urls with parameters, codes and tokens back and forth. A native client application will not use web browser. A work around was to use head-less browser but that is a bit hacky.
  • It only has OneDrive Personal authentication urls. To use this with OneDrive Business or Sharepoint these urls are easily replaced with the OneDrive Business authentication urls in two files: auth_provider.py and the onedrivesdk_helper.py.

The change to the unified OneDrive API and oAuth authentication only happened in late 2015 so this is pretty new.

There weren’t many well developed or well documented OneDrive Python example code available.

Note it is still also possible to work with OneDrive Business, Sharepoint and OneDrive Personal without using oAuth authentication and this new OneDrive API by simply using urllib, request and templating along with hard coded Office 365 username and password in your code to authenticate.

Finally Microsoft Graph API can be used to interact with OneDrive Business, Sharepoint and OneDrive Personal  once oAuth is setup.


Tracking Cuba Gooding Jr’s Twitter follower count

Happened to see Cuba Gooding Jr’s first tweet about 30 minutes or so after he created it.

Update: @cubagoodingjr is no longer active so not getting tweets from it any longer

At the time his profile said he had 559 followers. A few minutes later I refreshed his profile and saw the follower count had increased to 590 and every few minutes it kept increasing. By the end of the day he had about 4,000 followers.

I thought it would be interesting to track how his follower count changed going forward. So I used the Twitter API to get his follower count once per day, save the data, and then created a web page to visualize his follower count per day.


After 2 days Cuba had about 7,000 followers which averaged out to about 175 new followers per hour.  However, the number of new followers slowed down quickly to 30 or so new followers per day and after about 3 months he only gets about 10 new followers per day.  In fact, some days, he has net loss of followers, eg more people unfollow than him, than new follows on that day.

For the technically inclined, I setup an automatic retrieval of Cuba’s Tweets using Twitter’s API and the Tweepy Python module scheduled to run daily using a cron job.

The follower counts get put into a database. I created a PHP web page application to retrieve the data from the database, and create a web page that includes the Google Charts API to create a simple line chart to show Cuba’s regularly updated follower count by day.

You can get the cron job and PHP web page code from my  Github repository. 

If you want to run this code yourself you will need a Twitter developer account and an OAuth file.

Dell ecommerce web site scraping analysis

Once upon a time, I needed to find Dell monitor data to analyse.

A quick search brought me to their eCommerce web site which had all the monitor data I needed and all I had to do was get the data out of the website.

To get the data from the website I used the Python and Python module Scrapy to scrape the webpage and write data to a csv file.

Based on the data I got from the site the counts of monitors by size and country are presented below.


However this data is probably not accurate. In fact I know it isn’t. There was a surprising number of variances in the monitor descriptions including screen size which made it hard to get quick accurate counts. I had to do some data munging to clean up the data but there is still a bit more to do.

The surprising thing is that there do not appear to be specific data points for each of the monitor descriptions components. This website is being generated from a data source likely a database that contains Dell’s products. This database does not appear to have fields for each independent data point that are used to categorize and describe Dell monitors.

The reason I say this is that the monitor descriptions single string of text. Within the text string are things like the monitor size, model, common name, and various other features.

These are not in same order, do not all have same spelling, format such as use of text separators, lower or upper case.

Most descriptions are formatted like this example:

Dell UltraSharp 24 InfinityEdge Monitor – U2417H”.

However the many variations on this format at listed below. There is obviously no standardization for Dell to enter monitor descriptions for their ecommerce site.

  • Monitor Dell S2240T serie S 21.5″
  • Dell P2214H – Monitor LED – 22-pulgadas – 1920 x 1080 – 250 cd/m2 – 1000:1 – 8 ms – DVI-D
  • Dell 22 Monitor | P2213 56cm(22′) Black No Stand
  • Monitor Dell UltraSharp de 25″ | Monitor UP2516D
  • Dell Ultrasharp 25 Monitor – UP2516D with PremierColor
  • Dell 22 Monitor – S2216M
  • Monitor Dell UltraSharp 24: U2415
  • Dell S2340M 23 Inch LED monitor – Widescreen 60Hz Full HD Monitor

Some descriptions include the monitor size unit of measurement, usually in inches, sometimes in centimeters, and sometimes none at all.

Sometimes hyphens are used to separate description sections but other times the pipe character ( | ) is used to separate content. Its a real mish mash.

Description do not have consistent order of description components. Sometimes part number is after monitor size, sometimes it is elsewhere.

The problem with this is that customers browsing the site will have to work harder to compare monitors taking into account these variances.

I’d bet this leads to lost sales or poorly chosen sales that result in refunds or disappointed customers.

I’d also bet that Dell enterprise customers and resellers also have a hard time parsing these monitor descriptions too.

This did affect my ability to easily get the data to do analysis of monitors by description categories because they were not in predictable locations and were presented in many different formats.

Another unusual finding was that it looks like Dell has designated default set of 7 monitors to a large number of two digit country codes. For example Bhutan (bt) and Bolivia (rb) both have the same 7 records, as do many others. Take look at the count of records per country at bottom of page. Many countries have only 7 monitors.

Here is the step by step process used to scrape this data.

The screenshot below shows the ecommerce web site page structure. The monitor information is presented on the page in a set of nested HTML tags which contain the monitor data.

dell ecommerce screenshot

These nested HTML tags can be scraped relatively easily. A quick review revealed that the web pages contained identifiable HTML tags that held the data I needed. Those tags are named in Python code below.

The website’s url also had consistent structure so I could automate navigating through paged results as well as navigate through multiple countries to get monitor data for more than one Dell country in the same sessions.

Below is an example of the url for the Dell Canada eCommerce web site’s page 1:


The only two variables in url that change for the crawling purposes are:

  • The “c” variable was a 2 character country code eg “ca” = Canada, “sg” = Singapore, “my” = Malaysia, etc.
  • The “p” variable was a number representing the count of web pages that a country’s monitors are shown on about 10 monitors per page. No country I looked at had more than 5 pages of monitors.

Dell is a multi-national corporation so likely has many countries in this eCommerce database.

Rather than guess what they are I got a list of two character country codes from Wikipedia that I could use to create urls to see if that country has data. As a bonus the Wikipedia list gives me the country name.

The Wikipedia country code list needs a bit of clean-up. Some entries are clearly not countries but some type of administrative designation. Some countries are listed twice with two country codes. For example Argentina has “ar” and “ra”. For practical purposes if the Dell url can’t be created from this country codes in this list then the code just skips to next one country code.

The Python code I used is shown below. It outputs a csv file with the website data for each country with the following columns:

  • date (of scraping)
  • country_code (country code entered from Wikipedia)
  • country (country name from Wikipedia)
  • page (page number of website results)
  • desc (HTML tag containing string of text)
  • prod_name (parsed from desc)
  • size (parsed from desc)
  • model (parsed from desc)
  • delivery (HTML tag containing just this string)
  • price (HTML tag containing just this string)
  • url (url generated from country code and page)

The code loops through the list of countries that I got from Wikipedia and within each country it also loops through the pages of results while pagenum < 6:.

I hard coded the number of page loops to 6 as no country had more than 5 pages of results. I could have used other methods perhaps looping until url returned 404 or page not found. It was easier to hard code based on manual observation.

Dell eCommerce website scraping Python code

#-*- coding: utf-8 -*-
import urllib2
import urllib
from cookielib import CookieJar

from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

    'AC':'Ascension Island',
    'AE':'United Arab Emirates',
     ... etc

def main():

    output = list()
    todaydate = datetime.today().strftime('%Y-%m-%d')
    with open('dell_monitors.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['date', 'country_code', 'country', 'page', 'desc', 'prod_name', 'size', 'model', 'delivery', 'price', 'url'], delimiter = ',')
        for key in sorted(countries):
            country_code = key.lower()
            country = countries[key]
            pagenum = 1      
            while pagenum < 6:
                url = "http://accessories.dell.com/sna/category.aspx?c="+country_code+"&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=" + str(pagenum)
                #HTTPCookieProcessor allows cookies to be accepted and avoid timeout waiting for prompt
                page = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url).read()
                soup = BeautifulSoup(page)           
                if soup.find("div", {"class":"rgParentH"}):
                    tablediv = soup.find("div", {"class":"rgParentH"})
                    tables = tablediv.find_all('table')
                    data_table = tables[0] # outermost table parent =0 or no parent
                    rows = data_table.find_all("tr")
                    for row in rows:
                        rgDescription = row.find("div", {"class":"rgDescription"})
                        rgMiscInfo = row.find("div", {"class":"rgMiscInfo"})
                        pricing_retail_nodiscount_price = row.find("span", {"class":"pricing_retail_nodiscount_price"})

                        if rgMiscInfo: 
                            delivery = rgMiscInfo.get_text().encode('utf-8')
                            delivery = ''
                        if pricing_retail_nodiscount_price:
                            price = pricing_retail_nodiscount_price.get_text().encode('utf-8').replace(',','')
                            price = ''
                        if rgDescription:
                            desc = rgDescription.get_text().encode('utf-8')
                            prod_name = desc.split("-")[0].strip()
                                size1 = [int(s) for s in prod_name.split() if s.isdigit()]
                                size = str(size1[0])
                                size = 'unknown'
                                model = desc.split("-")[1].strip()
                                model = desc
                            results = str(todaydate)+","+country_code+","+country+","+str(pagenum)+","+desc+","+prod_name+","+size+","+model+","+delivery+","+price+","+url
                            file.write(results + '\n')
                    pagenum +=1
                    #skip to next country
                    pagenum = 6 

if __name__ == '__main__':

The Python code scraping output is attached here as a csv file.

The summary is a list of the scraping output that shows a list of country codes, countries and count of Dell monitor records scraped from a web page using the country code Wikipedia had for these countries.

af – Afghanistan – 7 records
ax – Aland – 7 records
as – American Samoa – 7 records
ad – Andorra – 7 records
aq – Antarctica – 7 records
ar – Argentina – 12 records
ra – Argentina – 7 records
ac – Ascension Island – 7 records
au – Australia – 36 records
at – Austria – 6 records
bd – Bangladesh – 7 records
be – Belgium – 6 records
bx – Benelux Trademarks and Design Offices – 7 records
dy – Benin – 7 records
bt – Bhutan – 7 records
rb – Bolivia – 7 records
bv – Bouvet Island – 7 records
br – Brazil – 37 records
io – British Indian Ocean Territory – 7 records
bn – Brunei Darussalam – 7 records
bu – Burma – 7 records
kh – Cambodia – 7 records
ca – Canada – 46 records
ic – Canary Islands – 7 records
ct – Canton and Enderbury Islands – 7 records
cl – Chile – 44 records
cn – China – 46 records
rc – China – 7 records
cx – Christmas Island – 7 records
cp – Clipperton Island – 7 records
cc – Cocos (Keeling) Islands – 7 records
co – Colombia – 44 records
ck – Cook Islands – 7 records
cu – Cuba – 7 records
cw – Curacao – 7 records
cz – Czech Republic – 6 records
dk – Denmark – 23 records
dg – Diego Garcia – 7 records
nq – Dronning Maud Land – 7 records
tp – East Timor – 7 records
er – Eritrea – 7 records
ew – Estonia – 7 records
fk – Falkland Islands (Malvinas) – 7 records
fj – Fiji – 7 records
sf – Finland – 7 records
fi – Finland – 5 records
fr – France – 17 records
fx – Korea – 7 records
dd – German Democratic Republic – 7 records
de – Germany – 17 records
gi – Gibraltar – 7 records
gr – Greece – 5 records
gl – Greenland – 7 records
wg – Grenada – 7 records
gu – Guam – 7 records
gw – Guinea-Bissau – 7 records
rh – Haiti – 7 records
hm – Heard Island and McDonald Islands – 7 records
va – Holy See – 7 records
hk – Hong Kong – 47 records
in – India – 10 records
ri – Indonesia – 7 records
ir – Iran – 7 records
ie – Ireland – 7 records
im – Isle of Man – 7 records
it – Italy – 1 records
ja – Jamaica – 7 records
jp – Japan – 49 records
je – Jersey – 7 records
jt – Johnston Island – 7 records
ki – Kiribati – 7 records
kr – Korea – 34 records
kp – Korea – 7 records
rl – Lebanon – 7 records
lf – Libya Fezzan – 7 records
li – Liechtenstein – 7 records
fl – Liechtenstein – 7 records
mo – Macao – 7 records
rm – Madagascar – 7 records
my – Malaysia – 25 records
mv – Maldives – 7 records
mh – Marshall Islands – 7 records
mx – Mexico – 44 records
fm – Micronesia – 7 records
mi – Midway Islands – 7 records
mc – Monaco – 7 records
mn – Mongolia – 7 records
mm – Myanmar – 7 records
nr – Nauru – 7 records
np – Nepal – 7 records
nl – Netherlands – 8 records
nt – Neutral Zone – 7 records
nh – New Hebrides – 7 records
nz – New Zealand – 37 records
rn – Niger – 7 records
nu – Niue – 7 records
nf – Norfolk Island – 7 records
mp – Northern Mariana Islands – 7 records
no – Norway – 19 records
pc – Pacific Islands – 7 records
pw – Palau – 6 records
ps – Palestine – 7 records
pg – Papua New Guinea – 7 records
pe – Peru – 43 records
rp – Philippines – 7 records
pi – Philippines – 7 records
pn – Pitcairn – 7 records
pl – Poland – 4 records
pt – Portugal – 7 records
bl – Saint Barthelemy – 7 records
sh – Saint Helena – 7 records
wl – Saint Lucia – 7 records
mf – Saint Martin (French part) – 7 records
pm – Saint Pierre and Miquelon – 7 records
wv – Saint Vincent – 7 records
ws – Samoa – 7 records
sm – San Marino – 7 records
st – Sao Tome and Principe – 7 records
sg – Singapore – 37 records
sk – Slovakia – 23 records
sb – Solomon Islands – 7 records
gs – South Georgia and the South Sandwich Islands – 7 records
ss – South Sudan – 7 records
es – Spain – 10 records
lk – Sri Lanka – 7 records
sd – Sudan – 7 records
sj – Svalbard and Jan Mayen – 7 records
se – Sweden – 6 records
ch – Switzerland – 21 records
sy – Syrian Arab Republic – 7 records
tw – Taiwan – 43 records
th – Thailand – 40 records
tl – Timor-Leste – 7 records
tk – Tokelau – 7 records
to – Tonga – 7 records
ta – Tristan da Cunha – 7 records
tv – Tuvalu – 7 records
uk – United Kingdom – 35 records
un – United Nations – 7 records
us – United States of America – 7 records
hv – Upper Volta – 7 records
su – USSR – 7 records
vu – Vanuatu – 7 records
yv – Venezuela – 7 records
vd – Viet-Nam – 7 records
wk – Wake Island – 7 records
wf – Wallis and Futuna – 7 records
eh – Western Sahara – 7 records
yd – Yemen – 7 records
zr – Zaire – 7 records

Grand Total – 1760 records

Power Query MySQL database connection

Excel Power Query can make a connection to MySQL database but requires that you have a MySQL Connector/Net 6.6.5 for Microsoft Windows. Instructions for that on Microsoft site.

Once you have the connector you can get MySQL db table data into your Excel file using Power Query in two ways. The method you select will be dependent on how you want to work with the MySQL data you retrieve.

The choice is made after you have selected Power Query – From Database – From MySQL Database.

After you select From MySQL Database, you will see the MySQL Database connection popup.

Now you can make a choice. Either method requires you to specify the Server and Database to make direct connection to a single table (or view) but you can optionally enter a SQL Statement to effectively connect to multiple tables and views in one connection. Bonus is that the SQL query is pushed back into the server so the Power Query client doesn’t have to do the work.

Method # 1. Native SQL query – when connecting you have option to enter SQL, and if you enter SQL query there you will get:

    Source = MySQL.Database(“”, “database_name”, [Query=”select * from database_table_name”])

Method # 2. Power Query Navigation – instead of entering SQL, just leave that field blank, then continue on, and Power Query will present you with list of MySQL server database names (Schemas) and tables names. Simply select the “Table” link in the database and table that you want and that will add Navigation step and retrieve that table’s data.

    Source = MySQL.Database(“”, “database_name”),
   database_table_name = Source{[Schema=”database_name”,Item=”database_table_name”]}[Data]

Either way you get the same table results in the Power Query.

Of course, if you want to join in other tables from your MySQL database(s), then method #1 will be more direct. Method #1 also assumes that you can use SQL to get what you want from the MySQL database tables.

However, you could also use method #2 to retrieve all desired tables (even from different databases on that server), and then use Merge or Append to get the desired results.

Method #2 allows you to retrieve and work with your MySQL database table data without using SQL and rely on Excel Power Query instead. That opens the door for relatively non-technical data workers to use the data which is a pretty cool thing!

You can use these methods with MySQL databases on your local computer or on a remote computer. You just have to make sure to enter the correct server url, database name, user and password.

On occasion I have had challenges before I could get a remote MySQL database connection to work. I had to clear the Power Query cache, update Power Query, turn off the Privacy option to make the connection work.