Javascript Web Development

Legend and polygon colors for Leaflet choropleth using Chroma.js

A Leaflet tutorial uses the following hard-coded getColor function to return colors.

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';

However, I wanted to use Chroma.js to generate the legend colors dynamically. So I needed a new getColor function.

Chroma.js has a variety of methods to return colors. The one I choose was using scale and classes. These can then be sent as variables to a getColor function to return colors to use in legend and map.

scale can be single value or an array of two colors (either as hex values or color words). In my case, the first is a light blue and the second is a darker blue. Chroma.js will then return gradients between these two colors. See colorHex variable below.

classes is an array of legend ‘breaks’ for the color gradients. For example they could be the numerical values from the Leaflet tutorial getColor function above (eg 10, 20, 50, etc). See classBreaks variable below.

The new getColor function is shown below:

var classBreaks = [1,50,100,250,500,1000,2000,3000,6000,9000];
var colorHex = ['#deebf7','#08306b'];

function getColor(n,classBreaks,colorHex) {
    var mapScale = chroma.scale(colorHex).classes(classBreaks);
    if (n === 0) {
        var regionColor = '#ffffff';
    } else { 
        var regionColor = mapScale(n).hex();
    return regionColor

This getColor function can then be used as described in the Leaflet tutorial to set choropleth polygon fill colors. It also be used similarly to create the legend by looping through the classes to get a color for each legend entry.

However there is important consideration when creating the legend. Using scale and classes, Chroma.js only returns classes – 1 colors. For example the variable classBreaks array with 10 elements will only return 9 colors. To hack this I push a dummy element (‘999’) to the array so Chroma.js would return 10 colors and then ignore the dummy element when creating the legend.

The legend code is below includes hard-coded zero (0) value set to color white (#ffffff). Looping through the classBreaks each time using getColor function to return legend color based on break value.

var legend = L.control({position: 'topright'});

legend.onAdd = function (map) {
    var div = L.DomUtil.create('div', 'legend');
    div.innerHTML += '<i style="background: #ffffff;"></i>0
    classBreaks.push(999); // add dummy class to extend to get last class color, chroma only returns class.length - 1 colors
    for (var i = 0; i &lt; classBreaks.length; i++) {
        if (i+2 === classBreaks.length) {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '+';
        } else {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '–' + classBreaks[i+1] + '
    return div;

The final map legend looks like this:

Amazon Web Services (AWS) BeautifulSoup Python

Scraping public health web page to get data

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = ''
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

D3.js Excel Pivot Chart Excel Pivot Table Javascript Power Query Web Development

Heat maps of Canadian activity changes due to COVID-19 using Google Community Mobility Reports

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 related activity changes across Canada.

Google has helpfully provided freely available global “Community Mobility Reporting” which shows Google location history change compared to baseline by country, and country sub-regions. These provide changes in activity by location categories: Workplace, Retail & Recreation, Transit Stations, Grocery & Pharmacy and Parks locations, and Residential locations. For Canada it is available by province. As of April 19, data contained daily values from Feb 15 to Apr 11.

The Community Mobility Reporting data is available as a single csv file for all countries at Google Community Mobility Report site. In addition, Google provides feature to filter for specific country or country sub regions eg state or provinces, etc and download resulting PDF format.

As the COVID-19 lockdowns occurred across Canada you would expect that people were less likely to be in public spaces and more likely to be at home. The Community Mobility Reporting location history allows us to get some insight into whether or not this happened, and if it did, to what degree and how this changed over time.

I used the Community Mobility Report data to create a D3.js heat map visualization which is described in more detail below and in this Github repository.

I also created an Excel version of this heat map visualization using Pivot Table & Chart plus conditional formatting. This Excel file, described in more detail below, is available in the Github repository.

More detail and screenshots of visualizations is provided below:

Heatmaps are grids where columns represent date and rows province/territory. Each heatmap is a grid representing a single mobility report category. The grid cell colors represent value of percent change which could be positive or negative. Changes can be observed as lockdowns occurred where locations in public areas decreased relative to baseline. Inversely, residential location increased relative to baseline as people sheltered in place at their homes.

1) Heatmap created using Excel / Power Query: For this heatmap visualization the global csv data was transformed using Excel Power Query. The Excel file has two Pivot Table and Chart combos. The Excel files and Power Query M Code are in the repository. Excel files are available in Github repository.

2) Heatmap created using D3.js: For this heatmap visualization the global csv data was transformed using Excel Power Query. The heatmap visualization was created using slightly modified code from ONSvisual.

Bar charts
These were created using Excel to visualize percent change by Province/Territory and location category using Excel / Power Query. These allow comparison between provinces by date and category. This Excel / Power Query file can be used for analytical purposes to slice and dice global data by date, country, sub region 1 & 2 and category. Excel files are available in Github repository.

D3.js Geocoding Javascript Python

Choropleth map of Canada COVID-19 cases by health region using Leaflet and D3

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 cases across Canada. So I set about to make a choropleth map visualization of confirmed COVID-19 case counts in Canada. I also created a separate choropleth map for Montreal which is Canada’s COVID-19 “hotspot” with about 25-30% of Canada’s total confirmed COVID-19 cases.

View live Canada map here:

View live Montreal map here:

The only similar geographical boundaries that have confirmed case counts for all of Canada that I could find for was by: 1) province/territory and 2) health region.

I choose to use health regions in these choropleth maps because there are lots of maps by province / territory. The health regions are geographical boundaries described by provincial health authorities. They likely roughly correspond to population size.

I used Leaflet.js open-source JavaScript library to create the interactive choropleth maps, using D3.js to retrieve and transform csv format data, and Javascript to retrieve JSON geographic boundary files.

The confirmed COVID-19 case counts are available in csv file format from the COVID-19 Canada Open Data Working Group. The csv files are maintained on Github which is updated daily collating data from provinces and territories.

The health region geographical boundary descriptions were obtained primarily from Statscan ArcGIS Health region boundary Canada dataset. However, I needed to make some modifications to update boundaries used by health regions which is described in more detail in Github repository README.

The biggest challenge to create these choropleth maps were data issues were relating geographical boundary names to the confirmed case health region name. Basically needed to create a lookup table to match different names in boundary data file and confirmed counts data file. See Github repository for more details on boundary modifications and relationship between boundary names and health region names.

Code for this project is maintained on

Montreal COVID-19 confirmed case count
Montreal COVID-19 confirmed case count
Google Calendar Google Script

Delete Google Calendar events by date using Google Script

To quickly and easily delete Google Calendar events by date you can use Google Script to run the following script.

Modify script to include your Google Calendar name which is likely your Google email. Then modify the date variables fromDate and toDate. Note the months start from 0 (Jan).

function delete_events()
    var calendarName = '';
    // for month 0 = Jan, 1 = Feb etc
    // below delete from Jan 1 2015 to Jan 1 2019
    var fromDate = new Date(2015,0,1,0,0,0); 
    var toDate = new Date(2019,0,1,0,0,0);
    var calendar = CalendarApp.getCalendarsByName(calendarName)[0];
    var events = calendar.getEvents(fromDate, toDate);
    for(var i=0; i&lt;events.length;i++){
        var ev = events[i];
        // show event name in log

This will delete your calendar events so be careful!

Check out Google Script documentation for more on how to work with Calendar.

Azure Office 365 OneDrive

OneDrive SharePoint Migration Tool (SPMT)

I have a Office 365 Business Premium account. I wanted to sync many large folders with OneDrive for Business. However, it was taking very long to upload. While I was searching for improvements, I saw many other people have observed that OneDrive is much slower than Dropbox, Box, AWS, Google Files etc. Apparently this is a ‘known issue’ and Microsoft is addressing it.

One of the things they have done is provide a standalone desktop application called ‘SharePoint Migration Tool (SPMT)’. While the name doesn’t specifically include OneDrive, the tool is intended for use with Sharepoint and OneDrive. However, there are a few pieces of information I learned to use it for OneDrive. That is the purpose of this post.

First, the user who will be using SPMT must have permissions for the OneDrive for Business account you want to upload files to.

Then you need to download SPMT from here

The SPMT tool can be used in any of several migration ‘scenarios’, but the one I am focusing on here is the ‘File share to OneDrive and SharePoint’ scenario. I wanted to use SPMT as a potentially faster way to upload files from my desktop computer to OneDrive for Business.

The guide for this scenario is here

After you download, install and start SPMT the start screen will ask you to sign into to your Office 365 account. This will be user that has permission on OneDrive.

After successfully signing in you will see 3 scenarios to choose from:

The two that I was interested in testing were:

  • File Share
  • CSV file for bulk migration

Note / spoiler alert
While SPMT provides much more flexibility in terms of identifying source and target, and capability to batch migration jobs, it was not dramatically faster than just using OneDrive desktop sync tool. It took me a couple of hours to upload only 1000 files total 3 GB in size.

The Microsoft documentation refers to migrations with multiple TB’s of files but my experience suggests that would be impossible if the files were coming from outside of Microsoft cloud. Therefore, it is clear that for large migrations, the SPMT is intended primarily for use within Microsoft cloud.

File Share
The File share option is pretty straightforward. Use SPMT to select the local folder to upload, then manually enter the OneDrive location url to upload them to. However, there are some things to pay attention to here:

  • When you select a local (aka source) folder only it’s contents will be uploaded, not the entire folder with selected local folder name.
  • The string you enter for the OneDrive (aka target) location url is not 100% intuitive or completely described in the instructions. See below.

OneDrive (aka target) location url
The exact specifications for this are not given in a way that is intuitively helpful. There are subtle differences in the url between SharePoint and OneDrive. The linked documentation does indicate these but I had to search for additional clarification which I pass on below.

If my Office 365 account domain is ‘mycompany’ (eg and the user is ‘myuser’ (eg then myuser’s OneDrive target location url:

You can also simply copy this from your browser address bar when you are viewing myuser’s OneDrive folder online.

CSV file for bulk migration
This option uses similar information as above but you put the information for source and target into a CSV (or JSON) file format.

CSV file format:

  • Has 6 columns
    • Column A: Mandatory. Source folder path regular Windows path. Use double quotes if path has spaces.
    • Column B: Optional. For Sharepoint migration only. Not for file migration like we are talking about here.
    • Column C: Optional. Subfolder of source path.
    • Column D: Mandatory. OneDrive (aka target) location url. Note do not include anything other than what is above. Column D and E are used to specify folders.
    • Column D: Mandatory. Standard value “Documents”
    • Column F: Optional. Subfolders that will be created if they are not already present under your OneDrive root url. Enter as “Folder/subfolder/”. Note the use of fore slash and double quotes if any spaces.
  • The file has one row per ‘job’. For example, in File Share method above you select a single folder to migrate. That would be one row in CSV (or JSON) file.
  • Use SPMT to upload the CSV (or JSON) file and it processes all of the migration ‘jobs’.
  • However, it processes them in parallel which seems counterproductive. I would rather it process them in sequence, one row at a time.
  • This bulk migration method would be useful for processing many folders.
  • I did not try JSON method but would expect it would requir same information just in specified JSON foramt.

In both scenarios above when the migration is happening you will see screen similar to below.

Note that SPMT ‘scans’ the files to be uploaded first in any migration scenario to see if they meet specifications. There are strict file name prohibitions but it did rename ‘bad’ files which is nice.

Once it indicates files are being uploaded you should see corresponding folders and files in OneDrive online.

At any time during migration you can click the “View Status” link on the form to open Windows folder where SPMT keeps job status including fails etc in csv files. These are updated like log files so you can monitor status. This was helpful to see what failed and why.

I experienced a few quirky behaviour such as pausing or stopping a migration freezing SPMT app and had to close it using Task Manager. Also one very weird result of closing a job was that a local file was corrupted and I could not move or delete it with error “windows 10 cannot delete file the tag present in the reparse point buffer”. I had to run chkdsk to fix the file.

Overall SPMT does the job advertised. As mentioned it does not dramatically speed uploading files from local computer into OneDrive but it does offer more control and flexibility over OneDrive desktop sync for large uploads with many folders and files.

Amazon Web Services (AWS) D3.js Javascript Web Development

AWS S3 csv file as D3 report data source

This is an example of how to read a csv file retrieved from an AWS S3 bucket as a data source for a D3 javascript visualization.

The D3 visualization would be an HTML document hosted on a web server. 

You will use the AWS SDK to get the csv file from the S3 bucket and so you need to have an AWS S3 bucket key and secret but I won’t cover that in this post.

The key point of this post is to highlight that the bucket.getObject function data is read into D3 using  d3.csv.parse(data.Body.toString());  

Another note is that d3.csv.parse is for D3 version 3. Older versions use d3.csvParse. 

Once implemented, whenever the webpage is refreshed it retrieves latest csv file from the S3 bucket and the D3 visualization is updated.

<script src=""></script>

<script type="text/javascript">

// aws key and secret (note these should be retrieved from server not put as plain text into html code)
AWS.config.accessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.secretAccessKey = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.region = 'us-east-1';

// create the AWS.Request object
var bucket = new AWS.S3();

// use AWS SDK getobject to retrieve csv file
    Bucket: 'my-S3-bucket', 
    Key: 'myfile.csv'

// function to use the data retrieve 
function awsDataFile(error, data) {
    if (error) {
        return console.log(error);

        // this where magic happens using d3.csv.parse to read myCSVdata.Body.toString()
    myCSVdata = d3.csv.parse(data.Body.toString()); 

        // now loop through data and get fields desired for visualization 
    var counter = 0;
    myCSVdata.forEach(function(d) {
            d.field1= +d.field1;
            d.field2= +d.field2;
            countLoop = counter++;

        // now you can create rest of D3 vizualization here 
        // for example like this example

        my D3 vizualization code here

// this closes bucket.getObject 




Power BI Power Query

Introduction to Excel Power Query aka Power BI Query

What is Power Query?

Power Query is an Microsoft tool for Excel that is used to do the following:

  • Connect to a data source
  • Extract data from source
  • Transform the extracted data
  • Load the transformed data into Excel

This functionality is called ETL (Extract, Transform and Load).

Why use Power Query?

Excel already has another tab of features called Get External Data, why not use that?

  • Excel’s other methods for connecting to and retrieving data may still be useful, but they are not able to transform the data.
  • Power Query connections are non-techie, user friendly, and provide much more versatile and robust functionality.

Where is Power Query in Excel?

You can find Power Query in the Data tab Get & Transform icon group in Excel 2016

Note that in previous versions of Excel, Power Query is an add-in that has its own tab. However, in Excel 2016 Power Query is built into Excel.

Get & Transform has 4 icons

  • Show Queries – this hides/shows the Query pane docked on right side of Excel.
  • New Query – choose data source & modify Power Query settings and options.
  • From Table – choose data source as currently selected Excel table.
  • Recent Sources – choose from recently selected data sources.

Step 1 – Connect to a data source and retrieve data

Select the New Query icon to select which type of data you want to connect to

  • FromFile (csv, Excel)
  • FromDatabase (SQL Server db, Access db)
  • FromOnline Services (Facebook)
  • From Other Sources (webpage, SharePoint List, ODBC connection)

Screenshot shows example of connecting to an Excel Workbook.

  • Select From File – From Workbook
  • Popup dialogue allows you to select an Excel file.

Step 2 – Transform the data – common usage

  • “Calculated columns” replace any worksheet formula requirement
  • Power Query has many more formulas than plain old Excel, including VBA type functionality
  • Power Query has been called “vlookup on steroids”
  • “Pivot” and “unpivot” data
  • Group and aggregate data into sums, min, max, averages
  • Simple to combine many files contents in a folder into one table.
  • Filter, sort, replace with complex logic.
  • Merge two or more separate tables of data together on matching values. If you are familiar with SQL, this is a JOIN query.
  • Append two or more separate tables of data into one combined table of data. If you are familiar with SQL, this is a UNION query.

Step 3 – Load the data into Excel

  • Power Query queries are loaded to a worksheet by default.
  • However, you can choose to save a query as a connection only by selecting Load To – Only Create Connection. This query can be used in other Queries eg in Merge or Append, and by other Excel features eg it can be used as datasource for a Pivot Table.

Step 4 – Click and refresh to retrieve new data

  • When data source(s) are updated, just click Refresh to retrieve and transform new data.
  • No copy & paste of data from other files.
  • No worry about formulas being deleted or ranges being moved.
  • All transformations are safely contained inside the Query.

Power Query is in Microsoft Power BI

  • Power Query has been integrated into Power BI as the “Get Data” feature.
  • Has the same functionality and features as the Power Query in Excel.
  • Same programming language ‘M’.

Amazon Web Services (AWS)

AWS RDS Postgres database external secure connection

Require an SSL Connection to a PostgreSQL DB Instance? Here are some notes from my investigations.

Not great examples on that but take a look at this other documentation for secure connection to Redshift database which has working examples

Go to EC2 Security Groups, and modify the Inbound PostgreSQL rule from the VPS server eg something like “sg-c2a232b8” to either your current IP or to Anywhere eg

Setting it to Anywhere means both you and your EC2 application can connect to the database, but so can the rest of the world. Setting it to just your current IP means only you and anyone else on that IP (if it is a work or university IP) can connect to the database. Of course, they will need the key, username and password so reasonably secure.

Hadoop Power BI

How to use Hadoop HDFS as a datasource for Power BI

I have a local Hadoop server running on my Windows laptop. To connect Power BI to this Hadoop file system I used Hadoop’s WebHDFS REST API which is very well documented.

The WebHDFS API specifies a url in which you specify the Hadoop root file system directory that you want to get files from as follows. In my case I had folder named “myhdfsfolder“:


The myhdfsfolder folder contained a csv file which had previously been imported into Hadoop.

In Power BI, select “Get Data”, “Other”, then “Hadoop File (HDFS)” which will pop a window that asks you to enter a url. The url requested is the WebHDFS REST API url specified above.

After entering your url and clicking ok you should see a list of objects that are in the Hadoop directory. In my case I can see my csv file along with bunch of Hadoop log files.

Since I only want the csv file, I clicked the Power BI Edit button so I could filter the list to only my csv file. This part is just standard Power BI/Power Query stuff. If you had multiple csv files, you could filter to csv file types, or some other feature of the file name, or file attribute such as create date etc.

After filtering, only the csv file I am interested in shows in the list. There is one more step to do which is also just standard Power BI/Power Query stuff. Since the file is a binary file you first need to extract its contents. To do this, select the Content column double down arrow selector. This will extract the binary file content’s into a new table which is the data you want.

Now you can continue transforming the csv file data in Power BI/Power  Query or just load the query to start using the file’s data in Power BI data modeling and reporting!

Power BI Python

How to use Python as a datasource for Power BI

Step 1 – Download latest version of Power BI to get new features (version that came out in early August 2018)

Step  2 – Enable Python in the “Options – Preview Features” check the “Python support” option. Close and restart Power BI.

Step 3 – Click “Get Data”, select “Other” and “Python script” and click “Connect” button.

Step 4 – Write or paste Python script in the popup Python script window and click Ok button.

I chose to use Pandas to get csv file:

import numpy as np
import pandas as pd
data = pd.read_csv(‘C:/mydatafile.csv’, delimiter=’,’)

Note if you are using Windows as I am, you have to write file path so Windows can recognize it otherwise you will get an error. I used foreslashes but you could also escape the backslashes.

You could also write any Python script that gets or generates data from any data source eg a database, from AWS S3 buckets, etc. Of course this also includes transforming data. The important thing to note is your script must output a named dataset(s). The named dataset(s) will be thing Power BI will use when you are finished coding.

Step 5 – After clicking ok, you get the familiar Power Query data source selection window. Select Python and you will see your named data set(s). Just check the one(s) you want and click ok.

Now just continue working with Power BI as normal! You can always go back to edit the Python script by select “Source” in query editor.

Palm Pilot

Comparison of my old Palm vs my new Pocket PC (from 2004)

Here is a blog post from old Word Press site I wrote back in July 7, 2004

Comparison of my old Palm vs my new Pocket PC

I thought I would write this to share some of my experiences making the transition from the Palm platform to Windows CE/Pocket PC, as well as some of the things I did to my Pocket PC 2003/Dell Axim X30 to make it function better for me. At the end I get into a more polemic editorial on the looming Microsoft hegemony on the PDA OS world – a good thing, and a bad thing.

What did I use my Palm VIIx for?

  • Calendar (Desktop To Go)
  • Contacts (Desktop To Go)
  • Notes (Desktop To Go)
  • Tasks (Desktop To Go)
  • Reading eBooks (Mobi Book, AportisDoc, Palm Reader, Word To Go)
  • Documents and spreadsheets (My Documents To Go)
  • 3rd expense applications mostly productivity eg expenses
  • Handwritten notes/graphics (Doodle)

What more can I do on my Dell Axim PPC 2003?

  • Wifi internet access
  • Wifi streaming audio
  • Wifi streaming LIVE NASA TV!! (and lots of others)
  • Play relatively sophisticated games like Quake
  • Color picture viewer
  • MP3 player (eed a big SD card now)
  • Wireless synchronization of Pocket PC Outlook to PC Outlook
  • Wireless access to PC (using shared folders)
  • Control Desktop by wireless connection
  • Use Pocket PC as a remote control for TV (and anything else requiring an IR remote)
  • Use Excel to manage my list of books and daily balances and projected income
  • Bluetooth connection to cellphone to dialup internet access
  • Bluetooth to exchange photos from other persons cellphone with camera (also with IR)
  • Much more to come!

Side by Side Views

The Palm VIIx is the black colored device on the left, and the Dell Axim is the silver colored device on the right.













I never used the “Wireless” capability of my Palm VIIx. I actually became annoyed by the antennae popping up and down (it was so loose that it always moved) and activating the modem software, so I Crazy Glued it to the Palm.

But in any event, look at the size of that antennae!

Palm VIIx with wi-fi antenna extended.






Ok let’s get into the details

There are some very fundamentally important things that I really like about my Palm compared to my new Axim. Maybe I am used to the Palm but it got what I wanted done quickly and efficiently. I hoped that my Dell Axim would far surpass that. It hasn’t .. yet! I am continuing my search for applications, tricks, and new ways to do old things!

Anyways, after many years, I was very happy with what I was getting from using Palm. Even though the Palm VIIx is relatively stone-age in terms of hardware, size, etc, I found it and a Palm 100 I had a few years before, to be incredibly useful. While the PPC is the more flexible platform, the Palm does things in a much more efficient, and often more powerful way. I could treat my Palms like the brick of plastic they were. The VIIx screen crapped out on me once, so I opened it up, took it apart cleaned connections, reseated cables, and voila it worked again. I read ton’s of ebooks on it. I also used it on a daily basis for Appointments, Expenses, Tasks, Notes, Excel and Word document editing.

That said, I am overwhelmed by the advances that have been made in less than four years between when the Palm VII (it came out 2001) came out to the current slate of PPC’s.

Dammit, with my Axim X30 312 mhz, I can be listening to MP3’s (the sound on my Axim MP3 really rocks-equalizer, extra bass), while reading Bill Clinton’s, My Life on Mobibook! I can play games like Quake, which was a real struggle on my old 486 50 Mhz Pentium but is smooth as butter on my new Dell Axim X30. Truly amazing.

This brings to mind the exponentially increasing exponential increase in processing power ratio ala Ray Kurzweil .. but dammit again, this just reinforces my consternation about why I am feeling a shortcoming about I can/cannot do with my new PPC vs the Palm given the enormous boost in hardware capabilities! Maybe its not the hardware .. then again, why I am complaining when it is a bit slow? My Palm couldn’t even mult-task, let alone exhibit such brute strength. Clearly, I am conflicted. Clearly, something is conflicte. Anyways, I got my Pocket PC a couple of weeks ago. The amazing price of the Dell Axim, built-in WiFi & Bluetooth made it an easy choice.

After shopping around for Pocket PC PDAs and lots of reading, the only contender was the Ipaq 4150, but even with its greater supposed “form factor”, and longevity of service, etc it was almost $300 (Hello US citizens, this is Canadian Dollars) more expensive. Also, the Dell 312 Mhz cpu rates as fast, or faster in many categories than the Ipaq’s 400 Mhz. So, I fired off my money order (not only do I believe in reducing my ecological footprint, I believe in reducing my electronic footprint) by overnight courier to Dell here in Toronto.

I initially had four orders placed with Dell because the Canadian website was confusing, so I kept going back to make sure I was ordering what I wanted. A little aside here, I know a high level technology manager at Dell Canada who says the guy in charge of the Dell website is a doofus. The US site was great. Anyways, a few strategic phone calls to iron this out probably didn’t hurt in elevating my profile. Then I put “RUSH, per Customer Rep” accentuated with highlighter on the order printout attached to my money order. I was trying to push some buttons in the machine to make sure my order wouldn’t get lost in the in-basket. I think it worked. Lots of people seem to be waiting a long time for their Axims. Four days later my Axim was in a truck on its way to me. It arrived 2 days after that. Not too shabby eh?

Before getting my Dell Axim, I didn’t really know much about what to expect from the Pocket PC or Windows Mobile CE OS. I assumed that the much more powerful hardware in my new Axim/Pocket PC would make everything “snappier”. However, I have discovered a confounding paradox. The Palm PDAs I had let me find and open things faster, but the Axim/Pocket PC is obviously the more powerful system when it comes down processing stuff.

For example, even though the steps are the same, opening MS Excel spreadsheets on the Palm VIIx is much faster. But calculation of formulas and the like in the Palm VIIx takes so long, I can literally tap-a-tap-tap my fingers in idleness. In the Axim/Pocket PC, it seems as fast as my desktop computer. Blink, its done.

Lets move on to another thing, the 3rd party application factor. Those Palm programmers are everywhere. There are guys in Amsterdam, Thailand, Borneo, Chicoutimi churning out tons of Palm applications for free. The PPC world is populated by programmers with $ signs in their eyes. Nothing wrong with making money but what is happening, from my perspective, is that because M$ has bankrolled PPC development, most of the players involved are bigger and better funded, and expecting a better ROI, by targeting the enterprise market and the high end consumer market.

This leaves the plain vanilla user like me with dimished opportunities. I think it would be interesting to do a comparison of the demo trial length (in days) for applications developed for Palm versus PPC. I bet it would be much shorter for PPC. There would be no contest in terms of freeware or pay-as-you-like-ware. Palm would win hands down.

Now this brings my consternation about what I can actually DO with my PPC versus my Palm to the fore. Because the benchmark for ROI in the PPC application world is higher than in the Palm world, the level of application sophistication is higher. Essentially, developers want to give more, to justify higher software cost and thus the competitive spiral elevates. Its only human nature, well almost always, because often people just don’t have what it takes, and they fake it, or they stuff it up, or bloat the ware just to add more “functionality”.

There is nothing wrong with this. Everybody should try to give the user the best experience. But from my perspective, as a user, I am more likely to get the application that meets my needs when 100 so-so developers make my expense program, instead of 3 super high-end developers making my program. Take note, we are talking about variety, about competition!

No doubt about it, M$ knows what it does. Bill Gates is not the richest man in the world for nothing. The company manages every fundamental aspect of its business from making sure that product reaches the loading dock so that revenue is accrued smoothly, to making sure that its platforms, such as Windows CE, are $ attractors. What does this mean?

This is a recurring theme with Win CE. I have downloaded the eMbedded Visual C++ 4.0eVC4.exe, Microsoft Pocket PC 2003 SDK.msi, and TodayScreenSampleSetup.exe from Microsoft. It is clear from their application development documentation that their 80/20 rule is strategically focused on maintaining the vortex. Microsoft will develop the nexus around which dissatisfaction will bring in 3rd party developers, but only the $ seekers. The best survive, M$ consumes them, and they become part of the Hive Mind.

There is absolutely no doubt that Bill Gates primary “directive” to Microsoft product managers is to ensure that their products leave strategic “gaps”, but limited so that can bring in only the high end 3rd party developers. But the directive will be explicit. LEAVE GAPS.

Brght, talented, motivated they may be, but Microsoft developers are corporate booger-heads (Scott are you reading this : )) and ultimately, they CANNOT create. They can only expand on existing corporate paradigms. 3rd parties are the only ones who can do this. This has been the strength of Mac and Palm. But alas, Mac has required propping up, by M$, as will Palm in the near future, to ensure that M$ gets the competitive motivating force it needs. Gates knows monopolies are death. It is a fine balance for Microsoft as all the anti-competition action against them in North America and in Europe can attest to.

For now, can I just ask: Where are the Linux developers for PPC hardware! Or perhaps, Palm developers of the world, learn the CE platform! You could certainly bring some sanity to the irrelevant-ware out thar.

Palm VIIx Dell Axim X30 312 Mhz
Length of Service 3 years 2 weeks
Cost It was a gift from a friend. But was about US$399 new (back in the day) $464 Cdn incl tax, shipping
WiFi umm, sort of .. modem, at 9600 baud special email address & download ‘Web Clipping Applications’. YES! (and BlueTooth too)
Processor Motorola MC68EZ328 DragonBall EZ 20 MHz 312 Mhz Intel® XScale Processor with WMMX
RAM 8 MB (not expandable) 32 MB (buy SD cards)
ROM 2 MB Flash memory 32 MB
Screen 16 gray scales 3.5″ QVGA TFT color 16-bit, transflective
Battery two AAA batteries 950 mAh, Li-Ion rechargeable
OS Palm OS Version 3.5 Windows CE 2003
Input Graffitti Graffitti aka Block Recognizer, Keyboard, Transcriber, Letter Recognizer.


What Have I Done to Customize My Dell Axim/Pocket PC?

Screen Protectors – My Palm VIIx was a giveaway. I treated it like a pocket book (like a paper one ie I didn’t give it any special attention) The result was that the screen became scratched and scuffed but it didn’t matter to me. It was a total utilitarian thing.

So, I thought I could do the same thing with my brand new Dell Axim. Oh boy, was I wrong. After a few days of not using screen protectors and thinking I could continue as I did with my giveaway Palm, I had faint scratches in the Block Recognizer text area : ( just from the stylus! Ok, this is not good. As it happens, there seemed to be something wrong with the screen, a kind of rainbow effect appearing when viewing the screen at extreme angles. So, I called Dell, went through the motions, and ended up getting a replacement, and sent the first one back. So, I had a virgin Axim screen, and then decided, I would be a wuss, and bought screen protectors for my new Dell Axim. They are a bit big (they were made for the Ipaq but were on sale) and I have to trim them using scissors or an exacto blade or my pocket knife. Sometimes it overhangs the screen and catches on things for months but hey, it still protects the screen.

After applying the screen protector, I have an unexpected sense of comfort (you can feel the protector and it has some heft) when I stuff my Axim into my pocket (even though there are never ever any keys, coins or anything else in there .. ever!) As it turns out, I fell in love with the case that comes with the Axim. It is really great. It is tough. Colour is black and it is not big so it hooks/clamps onto my belt really nicely. I can also just through it into my pocket or bag. It is easy to pull the Axim out and stuff it back in. After nearly a year of use, the clip and the elastic that holds the case together is still like new. Very impressed.

Re-Mapped Buttons – I re-mapped buttons so that my new Axim resembled my old Palm. The Message (email) button is mapped to Tasks, and the Home button is mapped to Notes. Now it is like my Palm.
Re-Mapped WiFi Button – using a program called WiFiPower.exe (a hacked version of the WirelessPower.exe in the \Windows folder), you can start only the WiFi and not the BlueTooth.

Benefit? If you don’t use Bluetooth at all or not very often, not starting BT every time you need the WiFi radio might save battery power. At least only the green light flashes when you use WiFi and not the Blue! That has to save some milliamps of juice there.

Simply use the Settings\Buttons setting to remap the Wireless button to WiFiPower.exe.

X Marks the Spot – A fundamental, though well-intentioned, flaw in the whole PPC Windows CE paradigm is the concept of the X button. The X button in top right hand corner doesn’t Exit the window, it simply Minimizes it. This was done because it was deemed that it was too slow to re-start a program than to simply refocus on it.

Dumb,IMHO, because given the mult-tasking propensity of average user, you soon have 5, or 10 applications open all sucking up valuable RAM. It wouldn’t have been difficult to give users a choice. There is a program called Switcher that can close programs from taskbar (not sure if it is Dell or PPC 2003).

Many 3rd party solutions are offered to do what Switcher does: Magic Button, Gigabar. So far, WisBar is my favourite. Just click the X, and it Exits the application! Wow, that is just way too easy! It even has a task selector and you can close applications. There are many versions of WisBar and I have no idea what the differences are etc. I have Version 1.4.2 which is free and is does everything that I want.

Today Screen – Ok, why did Microsoft create both a “Home” and a “Today Screen”? I am baffled by this. The Today screen comes from PC version of Outlook, but I never used it as it was unwieldy, and those bubbles just look too goofy. Ok, so let’s deal with it because it seems we are stuck with the Today screen in the PPC/Win CE world.

How can you make the Today screen better? There are a lot of 3rd party solutions to bring application shortcuts onto the Today Screen.

  • I am using c Launch to avoid using the Home as a program launcher. It allows me to group shortcuts to applications how I want. I can import one or more folder of shortcuts from Windows\Start Menu\Programs\ into one group
  • Many other plugins bring program functionality into the Today screen such as PocketMusic’s player can be controlled from the Today screen.
  • Generally though, I have noticed that using Today screen programs slows things down. Just pick the one or two that work for you and avoid overdoing it. From my experience, you have to be careful because they use a lot of RAM and can diminish system performance. The today screen is like a webpage with programming built in and there are several layers of application communicating which add to the CPU load and slow things down.

I think, from my limited knowledge of the architecture of the Windows CE OS, that this is where it starts to fail. Some of the Windows 95 programming has been recycled to form the core parts of Win CE. Given what M$ had to work with, and given their business philosphy ie, to be the vortex, they have done something almost elegant. The core programming has been setup so that it can sit inside of almost any mobile or non-full-fledged hardware to do the basic stuff, while a surrounding layer of programming lets it “talk” to the user using the hardware available.

Palm is slim, spartan, bare, economical. Win CE is one size fits all, most core functionality for a wide variety of hardware. Problem is that Win CE is relying on hardware to give it the processing power to punch through all of those application layers. That is why I am finding that a 20 Mhz Palm VIIx processor could keep me happy while the 312 Mhz Intel® XScale Processor with WMMX struggles to keep up. I guess it is the same as the desktop. There is a transition to “bloatware” which gives more and more features and functionality but its increase still overwhelms the faster hardware. Hmm.

File Explorer – the one that comes with PPC 2003 isn’t so hot. Get another one. I am using Total Commander. There are others but in my short time as a PPC user, I am liking this one.
Zoom Zoom – After years of daily use of MS Excel on my old Palm, I am taken, absolutely taken with the 78% zoom View of my spreadsheets. Those numbers are so small, so clear, so .. goddammed cute!
File Sync – the one that comes with PPC 2003 is not good. The MS ActiveSync File synchronization forces you to copy the files you want sync’d on your PDA into a special folder on your PC. That is not cool. Why should I change my file structure so just to sych with a Microsoft product. What were they thinking!

  • Get Mighty Synch which synchronizes files to/from PDA/PC without moving files to special folders.
  • Also, try Wireless synchronization with your PC
Clearing Out Unnecessary Files – I regularly rid myself of Internet Explorer cached files. Also, use Where Is My Ram to find out which folders/files are taking up room in your RAM and in your ROM. Eliminate email attachments from the attachment folder (after you save them elsewhere if you want them).
Registry Hacks – RegKing lets you make some categorized registry changes. I was using Pocket Tweak but for some reason, I have no reason why, it made the font in some places super tiny (i had to do hard reset to fix this). So, for now, I am going to forego the few beneficial hacks it offers.
Picture Graphics Viewer – The picture/graphics viewer that comes with PPC 2003 is not very useful.

  • I have a demo of Resco File Viewer and will probably buy it soon because it is really powerful and easy to use, it is a really great product.
  • XnView Pocket is also quite good and is free, for now.
Music/MP3’s – I used PocketMusic but it installed a Today screen plugin that slowed my system down but the features other than that were good. I have yet to reinstall it but am still looking for something else. WinAmp seemed ok although after one install it was missing a file, but will try to install it.

I decided to remove those apps and have been using Media Player which works even though it is quirky in that Microsoft way. MS seems to have a really different way of looking at the world compared to my perspective. That playlist thing is something that just seems contrary. WinAmp works the way I expect a music player should work. However, as Media Player is already installed and the other programs are large-ish, I will use Media Player. I did remap the buttons so that when Media Player is on top, I can adjust volume, go forward/ previous, or stop/play, and the remap to normal function when Media Player is minimized.

eBooks – Mobibook does the job. I use MS Word for ebooks I have in Word or txt format too although scrolling is a bit of a chore.
Weather – Shane Church has built Weather Info, one of the nicest, simplest, most functional, does what it says and no more programs I have seen. The weather data comes from the® web service and is cached so the data will still display if the device does not have an internet connection.
Python Tableau

BC Executive Compensation Tableau Visualization

The Province of British Columbia publishes public agency executive compensation each year. The data is grouped by:

  • Sector – Employers are categorized in one of six sectors – Colleges, Crown Corporations, Education, Health, Public Service, Universities
  • Employer – Employers belong to one Sector. There are about 190 employers such as BC Assessment Authority, BC Housing Management Commission, BC Hydro, etc
  • Employee – An employee belongs to one Employer and they publish the full name of employee.

The employee’s salary is given as:

  • Base Salary
  • Total Compensation

The data is provided each year as pdf and csv data files. I downloaded csv data files for each year from 2012 to 2016 and combined them into one dataset.

Here is link to Tableau Public Dashboard visualizing the data and the chart is embedded below.


I used Python to combine the data. Some of the Sector values where missing spaces, one of the years had a “$” symbol in salary column values which had to be removed, and I added a new year column based on the data file’s name.

Here is the Python code used to combine and clean up the data:

import csv
    import pandas as pd
    from datetime import date, datetime
    import glob
    import os

    def main():

        ## path for data files
        data_path = 'C:/bc_salary_data/'

        allFiles = glob.glob(data_path + "*.csv")
        frame = pd.DataFrame()
        list_ = []
        for file_ in allFiles:
            df = pd.read_csv(file_,index_col=None, header=0)
            df['filename'] = os.path.basename(file_).split('.')[0]
            df['Sector'] = df['Sector'].str.replace(' ','').str.replace('CrownCorporations', 'Crown Corporations').str.replace('CrownCorporation', 'Crown Corporations').str.replace('PublicService', 'Public Service')
            df['year'] =  df['filename'].str[-4:]
            df['Base Salary'] = df['Base Salary'].apply(str).str.replace('

Tableau Visualization

            df['Total Compensation'] = df['Total Compensation'].apply(str).str.replace('

Tableau Visualization

        frame = pd.concat(list_)

        ## Create csv file to write results to

        print str( + ' data files merged'

    if __name__ == '__main__':

Tableau Visualization


Business Intelligence

Business Intelligence – IT Black Box Challenge

The process of engaging IT teams can be difficult to understand for many non-technical people. IT is perceived as a ‘black box’ because its inner workings are often not-transparent.


BI teams may not realize that their business side coworkers do not know how to engage them.


When IT workers reach out to engage business users they may often confuse their potential customer by using too much jargon.


Technical teams often also make too many assumptions of non-technical technical understanding about IT terms and descriptions.


Therefore, it is important for any IT worker to start any engagement with non-technical co-workers at the most basic level of communication.

Do not assume anything. Do not use jargon. Explain your technical job title and role. Describe the process of engagement, what you will do, and what will be expected of the non-technical co-worker. Shine some light on the inside of that IT black box and make it transparent!

Amazon Web Services (AWS) API Chart.js Python Twitter API

Twitter Followers and List Membership Tracking

I have created my own historical reporting of my Twitter account followers and list memberships. I have hosted the reporting on this open webpage.

Most people know what followers are but the list memberships aren’t as well known so here are definitions of them.

Twitter Followers – Twitter followers are other Twitter accounts who follow my account. A Twitter Follower may follow my account for a few days and then unfollow me at any time afterwards.

Twitter List Memberships – Twitter accounts can also create their own Lists and then add other Twitter accounts to the lists. For example a Twitter account may create a List called “Tweet about Big Data” to track Twitter accounts they believe “Tweet about Big Data”. Twitter accounts may add my account to their List and remove it at any time afterwards.

The Twitter data retrieval, writing to database, and data querying are all done on a web server.

In order to record changes in counts of these you need to have daily historical data. The Twitter API doesn’t provide historical data so you need create it yourself by retrieving and saving daily data somewhere.

Three Python scripts using Twitter API, Python Tweepy and AWS SDK are scheduled to run daily using cron jobs.

Two scripts retrieve followers and list memberships and insert the data into a PostgreSQL database. This daily retrieval builds the daily history.

Another script queries the database table to create reporting datasets of new, active and dropped followers and list memberships that are exported as csv files to a AWS S3 folder which also has files for a AWS S3 hosted static website.

The AWS S3 hosted static website uses Chart.js and D3.js Javascript charting libraries to read and visualize the data. This post does not describe how to read the csv files but I have written another post that describes this AWS S3 csv file as D3 report data source

A screenshot of one of the visualizations showing active followers by day, with counts of new follows and unfollows by day is shown below.


A second screenshot of another visualization showing active list memberships by day, with counts of new listings and un-listing by day is shown below.


The Python code to retrieve the Twitter data, transform it and create csv files and upload them to AWS is below.

The code to retrieve Twitter followers:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get secrets
sys.path.insert(0, '/home/secrets/')
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today =

def get_followers():

    twitter_followers = []
    for user in tweepy.Cursor(api.followers, screen_name="twitterusername").items():
    for user in twitter_followers:
        query = "INSERT INTO twitter_followers \
                (date, \
                follower_id, \
                name, \
                screen_name) \
                VALUES \
                (%s, %s, %s, %s)";
        data = (today,
        cursor.execute(query, data)

    ## print job status to log
    print str( + ' twitter followers'    
if __name__ == '__main__':

The code to retrieve Twitter list memberships:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get database creds
sys.path.insert(0, '/home/secrets/') 
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today =

def get_list_memberships():

    twitter_list_memberships = []
    for list in tweepy.Cursor(api.lists_memberships, screen_name="twitterusername").items():
        #print list.full_name
    for list in twitter_list_memberships:

        query = "INSERT INTO twitter_list_memberships \
                (date, \
                id_str, \
                name, \
                full_name, \
                member_count, \
                subscriber_count, \
                mode, \
                following, \
                user_screen_name) \
                VALUES \
                (%s, %s, %s, %s, %s, %s, %s, %s, %s)";
        data = (today,
        cursor.execute(query, data)
    ## print status for log
    print str( + ' twitter list memberships'    
if __name__ == '__main__':

The code to create csv files and upload to AWS S3 bucket:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
from datetime import datetime
from os import listdir
from os.path import isfile, join
import sys
import csv
import psycopg2
import re
from collections import Counter

upload_path = '/home/twitter/aws_s3_site/'

sys.path.insert(0, '/home/secret/')
from aws_keys import aws_keys
from secrets import secrets

## create aws S3 connection
conn = S3Connection(aws_keys['AWS_KEY'], aws_keys['AWS_SECRET'])
bucket = conn.get_bucket('twitter-bucket', validate=False)

## create db connection
conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
cur = conn.cursor()

## get data sets from db

## followers
cur.execute ("select something here;")
followers = list(cur.fetchall())

## lists
cur.execute ("select something here;")
lists = list(cur.fetchall())


## write db data to csv files, save in upload folder

## followers
with open(upload_path + 'followers.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in followers:

## lists
with open(upload_path + 'lists.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in lists:
## upload csv files to S3 twitter-bucket
upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

# delete existing bucket keys to reset
for key in bucket.list():
    if '.csv' in key:
# complete uploading to AWS
for file in upload_files:
    k = Key(bucket)
    k.key = file
    k.set_contents_from_filename(upload_path + file)
## write success to cron log
print str( + ' aws s3 twitter upload'


Installing Hadoop on Windows 8.1

It turned out to be quite simple to install Hadoop on Windows 8.1

I have been using Hadoop Virtual Machine images from Cloudera, Hortonworks or MapR when I wanted to work with Hadoop on my laptop. However these VMs are big files and take up precious hard drive space. So I thought I would be nice to install Hadoop on my Windows’s 8.1 laptop.

After some searches I found excellent instructions on installing Hadoop on Windows 8.1   on Mariusz Przydatek’s blog.

Note that these instructions are not for complete beginners. There are lots of assumptions that the reader understands basic Windows environment configuration, Java SDK, Java applications, and building binaries from source files. So beginners will also want to look at more detailed step by step instructions as well and use Mariusz’s instructions as a guide to make sure they are doing things correctly.

The most important thing to note is that you do not need Cygwin to install Hadoop on Windows. Other tutorials and how-to blog posts that insist Cygwin is required.

Also because you need to build Hadoop on your computer you need to have MS Visual Studio so you can use trial version as you don’t need it after you build Hadoop binaries. Other tutorials and how-to blog posts have some variation on what version of MS Visual Studio you need but this blog makes it clear.

At high level, the Hadoop installation follows these steps:

  • Download and extract Hadoop source files
  • Download and install packages and helper software
  • Make sure System Environment Path and Variables are correct
  • Make sure configuration files and package paths are ok
  • Execute command that uses packages and helper software to build Hadoop binaries
  • Copy newly created Hadoop binaries into new Hadoop production folder
  • Execute command to run the new Hadoop binaries
  • Hadoop is running and ready to be used

Its pretty obvious but worth stating again that this tutorial installs only a single node Hadoop cluster which is useful for learning and development. I quickly found out that I had to increase memory limits so it could successfully run medium sized jobs.

After Hadoop is successfully installed, you can then install Hive, Spark, and other big data ecosystem tools that work with Hadoop.


Web Development

Vultr and Digital Ocean make it easy to get projects going fast

Recently I developed a Django web application for a friend. I wanted to be able to get it up and out onto the internet so she could view it and get others to do user testing.

Normally I host my web applications on my Webfaction account. But I wanted to keep this project separate so I went over to Digital Ocean and spun up a virtual server to host the Django web application.

We didn’t have a domain name for this site yet so just used the Digital Ocean provided IP address to access the site and created a Self-Signed SSL Certificate for Apache and forced all traffic to HTTPS to enable secure browsing.

I have used Digital Ocean for development purposes and to share projects with others from time to time. I have also used Vultr.

  • Digital Ocean – Use this link to open account and get $10 credit.
  • Vultr – use this link to open account and get $20 credit (summer 2016 promotion).


Both have monthly price plans that can also be pro-rated depending on how many days you use them so you can spin them up for a few minutes or days and pay accordingly.  I am using the basic plans on both which are good enough for my demonstration development web application purposes.

  • Digital Ocean – $5 plan includes 1 cpu, 512MB of RAM, 20GB SSD storage and 1000GB Bandwidth
  • Vultr – $5 plan includes 1 cpu, 768MB Memory, 15 GB SSD storage and 1000GB Bandwidth

Beware though that unlike the huge number of shared hosting such as Webfaction, Digital Ocean and Vultr virtual servers are yours to own and manage. You get root access to the server and are responsible for managing the server, OS, how it is configured and secured, etc.

That is definitely something you should consider. There are lots of new things to learn. However there are tutorials and lots of help available on the internet so l recommend at least attempting to setup either or both to learn first hand. The cost is small and risk very low. Knowledge gained and benefits are well worth it.

Many comparison reviews seem to prefer Vultr over Digital Ocean for performance and features. However Digital Ocean was first and has done excellent job of documentation for getting many web applications setup. But Vultr does have good documentation too. You can’t really go wrong with either for development projects. If you don’t like one, just switch to the other!



Django Geocoding Python Web Development

Django form geocode submitted address to get lat, lon and postal code

One of my Django applications includes a form where user can enter and submit a property address.

django form

The user submitting the form might not know the postal code so I left it optional. However the postal code is a key piece of information for this particular application so I wanted to ensure that I was getting it.

I also wanted to geocode the address immediately to get the address latitude and longitude so it could be shown to user on a Leaflet.js map.

There are lots of free geocoding services for low intensity usage but I ended up using Google Geocoding which is free under certain usage level. You just need to create a Geocoding API project and use the credentials to set up the geocoding.

To interact with the geocoding API I tried Python gecoding modules geopy and geocoder but in the end just used Python Requests module instead as it was less complicated.

When the user clicked the Submit button, behind the scenes, Requests submitted the address to Google’s Geocoding API, gets the JSON response containing the latitude, longitude and postal code which are then written to the application database.

I will update the code in future to check if the user’s postal code is correct and replace it if it is incorrect. Will wait to see how the postal code accuracy looks. Making geocoding API requests too often could bump me over the free usage limit.

The Django View that contains the code described above is shown below.

def property_add(request):
    property_list = Property.objects.filter('created')
    if request.method == 'POST':
        form = PropertyForm(request.POST)
        if form.is_valid():
            new_property =
            address = "%s, %s, %s, %s" % (new_property.address1,, new_property.state, new_property.postcode)
            google_geocode_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'
            url = '' + "'" + address + "'" + '&amp;key=' + google_geocode_key
                response = requests.get(url)
                geoArray = response.json()
       = geoArray['results'][0]['geometry']['location']['lat']
                new_property.lon = geoArray['results'][0]['geometry']['location']['lng']
                new_postcode = geoArray['results'][0]['address_components'][7]['long_name']
                new_fsa = geoArray['results'][0]['address_components'][7]['short_name'][:3]
       = None
                new_property.lon = None
                new_postcode = None
                new_fsa = None
            if new_property.postcode:
                new_property.fsa = new_property.postcode[:3]
                new_property.postcode = new_postcode
                new_property.fsa = new_fsa
            new_property.user_id =
            new_property =
            return HttpResponseRedirect(reverse(property, args=(,)))
        form = PropertyForm()

    context_dict = {
        'form': form, 
        'property_list': property_list,
    return render(
        context_instance = RequestContext(
                'title':'Add Property',


Django Geocoding Javascript Postgres Python Web Development

Leaflet.js choropleth map color by count using geoJSON datasource

I have a Django web application that needed an interactive map with shapes corresponding to Canadian postal code FSA areas that were different colors based on how many properties were in each FSA. It ended up looking something like the screenshot below.


This exercise turned out to be relatively easy using the awesome open-source Javascript map library Leaflet.js.

I used this Leaflet.js tutorial as the foundation for my map.

One of the biggest challenges was finding a suitable data source for the FSAs. Chad Skelton (now former) data journalist at the Vancouver Sun wrote a helpful blog post about his experience getting a suitable FSA data source. I ended up using his BC FSA data source for my map.

Statistics Canada hosts a Canada Post FSA boundary files for all of Canada. As Chad Skelton notes these have boundaries that extend out into the ocean among other challenges.

Here is a summary of the steps that I followed to get my choropleth map:

1. Find and download FSA boundary file. See above.

2. Convert FSA boundary file to geoJSON from SHP file using qGIS.

3. Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

4. Create Leaflet.js map in HTML page basically the HTML DIV that holds the map and separate Javascript script that loads Leaflet.js, the FSA geoJSON boundary data and processes it to create the desired map.

Find and download FSA boundary file.

See above.

Convert FSA boundary file to geoJSON from SHP file using qGIS.

Go to and download qGIS. Its free and open source.

Use qGIS to convert the data file from Canada Post or other source to geoJSON format. Lots of blog posts and documentation about how to use qGIS for this just a Google search away.

My geoJSON data source looked like this:

var bcData = {
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::4269" } },
    "features": [
    { "type": "Feature", "properties": { "CFSAUID": "V0A", "PRUID": "59", "PRNAME": "British Columbia \/ Colombie-Britannique" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -115.49499542, 50.780018587000029 ], [ -115.50032807, 50.77718343600003 ], [ -115.49722732099997, 50.772528975000057 ], [ -115.49321284, 50.770504059000075 ], [ -115.49393662599999, 50.768143038000062 ], [ -115.50289288699997, 50.762270941000054 ], [ -115.50846411599997, 50.754243300000041 ], [ -115.5104796, 50.753297703000044 ], [ -115.51397592099994, 50.748953800000038 ], [ -115.51861431199995, 50.745737989000077 ], [ -115.52586378899997, 50.743771099000071 ], [ -115.53026371899995, 50.74397910700003 ], [ -115.53451319199996,


Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

I used a SQL query in the Django View to get count of properties by FSA.

This dataset looks like this in the template. These results have only one FSA, if it had more it would have more FSA / count pairs.

   var fsa_array = [["V3J", 19]];

Below is code for  the Django view query to create the fsa_array FSA / counts data source.

    cursor = connection.cursor()
    "select fsa, count(*) \
    from properties \
    group by fsa \
    order by fsa;")
    fsas_cursor = list(cursor.fetchall())

    fsas_array = [(x[0].encode('utf8'), int(x[1])) for x in fsas_cursor]

My Javascript largely retains the Leaflet tutorial code with some modifications:

1. How the legend colors and intervals are assigned is changed but otherwise legend functions the same.

2. Significantly changed how the color for each FSA is assigned. The tutorial had the color in its geoJSON file so only had to reference it directly. My colors were coming from the View so I had to change code to include new function to match FSA’s in both my Django view data and the geoJSON FSA boundary file and return the appropriate color based on the Django View data set count.

var map ='map',{scrollWheelZoom:false}).setView([ active_city_center_lat, active_city_center_lon], active_city_zoom);

map.once('focus', function() { map.scrollWheelZoom.enable(); });

var fsa_array = fsas_array_safe;

L.tileLayer('{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw', {
    maxZoom: 18,
    attribution: 'Map data © OpenStreetMap contributors, ' +
        'CC-BY-SA, ' +
        'Imagery © Mapbox',
    id: 'mapbox.light'

// control that shows state info on hover
var info = L.control();

info.onAdd = function (map) {
    this._div = L.DomUtil.create('div', 'info');
    return this._div;

info.update = function (props) {
    this._div.innerHTML = (props ?
        '' + props.CFSAUID + ' ' + getFSACount(props.CFSAUID) + ' lonely homes' 
        : 'Hover over each postal area to see lonely home counts to date.');


// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';

function getFSACount(CFSAUID) {
    var fsaCount;
    for (var i = 0; i < fsa_array.length; i++) {
        if (fsa_array[i][0] === CFSAUID) {
            fsaCount = ' has ' + fsa_array[i][1];
    if (fsaCount == null) {
         fsaCount = ' has no '; 
    return fsaCount;

function getFSAColor(CFSAUID) {
    var color;
    for (var i = 0; i < fsa_array.length; i++) {
    if (fsa_array[i][0] === CFSAUID) {
        color = getColor(fsa_array[i][1]);
        //console.log(fsa_array[i][1] + '-' + color)
    return color;
function style(feature) {
    return {
        weight: 1,
        opacity: 1,
        color: 'white',
        dashArray: '3',
        fillOpacity: 0.7,
        fillColor: getFSAColor(

function highlightFeature(e) {
    var layer =;
        weight: 2,
        color: '#333',
        dashArray: '',
        fillOpacity: 0.7

    if (! && !L.Browser.opera) {


var geojson;

function resetHighlight(e) {

function zoomToFeature(e) {

function onEachFeature(feature, layer) {
        mouseover: highlightFeature,
        mouseout: resetHighlight,
        click: zoomToFeature

geojson = L.geoJson(bcData, {
    style: style,
    onEachFeature: onEachFeature

var legend = L.control({position: 'bottomright'});

legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend'),
        grades = [0, 1, 5, 10, 15, 20, 25, 30],
        labels = [],
        from, to;

    for (var i = 0; i < grades.length; i++) {
        from = grades[i];
        if (i === 0) {
            var_from_to = grades[i];
            var_color = getColor(from);
        } else {
            var_from_to =  from + (grades[i + 1] ? '–' + grades[i + 1] : '+') ;
            var_color = getColor(from + 1);
            ' ' +

    div.innerHTML = labels.join('
'); return div; }; legend.addTo(map);

That is pretty much all there is to creating very nice looking interactive free open-source choropleth maps for your Django website application!


Canadian Canola seed crushing more efficient at extracting canola oil

Statistics Canada regularly tweet links to various Canadian statistics. I have occasionally created quick Tableau visualizations of the data and replied with a link to my Tableau Public site.

The idea is to encourage Statistics Canada to start communicating more visually and create their own visualizations for all data. Its extra work but value will be realized when Statistics Canada visitors can more easily understand the data instantly by looking at visualizations instead of munging about with boring data in tables.

This particular tweet was to Statistics Canada data reporting on Canadian canola seed crush, canola oil and meal output.

The data shows Canada’s canola seed production and efficiency at extracting canola oil has increased significantly since 1971.

Canola seed is crushed to extract the oil and the seed meal is leftover. The ratio of oil to meal is about .8 in 2016 compared to .3 in 1971. That is a impressive increase in oil extraction efficiency.