Microsoft Windows Copilot General Availability September 26, 2023

Microsoft Copilot is coming across Microsoft products that will bring Copilot LLM user experience across Bing, Edge, Microsoft 365, and Windows.

Read Microsoft blog post for details.

Watch this YouTube video for details.

These new AI features are in four major product areas.

Windows Copilot

The first release is the Windows Copilot in which will start to be available on September 26, 2023 as a free Windows 11, version 22H2 update.  This upcoming Windows 11 update will bring over 150 new features and Copilot AI powered experiences to apps like Paint, Photos, Clipchamp and more right to your Windows PC.

Bing and Edge Copilot

While Bing already has ChatGPT enabled, Bing and Edge will be supercharged  by the latest models delivering the most advanced capabilities for AI available the most exciting is the DALL.E 3 image and visual generative AI capabilities. This should also be available soon though no exact date was given.

Microsoft 365 Copilot

Microsoft 365 enterprise customers will have to wait until November 1, 2023  to get the related Microsoft 365 Copilot for Microsoft 365 Apps such as Word, Excel, PowerPoint, Outlook, Teams, and more.

This will include Microsoft 365 Chat, which promises to be very powerful enterprise feature. Microsoft 365 Chat combs across your entire universe of data at work—all your emails, meetings, chats, documents, and more, plus the web so you can interact with it. Imagine asking for meeting summaries, etc.

Microsoft Surface Devices

New Surface devices are coming that have these AI features integrated. These include laptops and whiteboard devices.

 

 

Voip.ms getCDR API – get all call detail records

Voip.ms is a reliable, affordable and customizable VoIP service.

Recently I wanted to get all of my Voip.ms telephone number call detail records (CDR). Voip.ms provides a nice UI to manually view these but I had 100’s of records that I wanted to analyse. So I turned to the Voip.ms REST/JSON API which turned out to be very easy to work with (link to API docs).

The API has a getCDR endpoint. To get started I downloaded one of the provided PHP examples (a link near the top of API documentation page downloads a zip file with code examples) that made it easy to quickly understand the endpoint’s requirements to get all of my telephone number’s call detail records. The API documentation page also includes list of endpoint parameters and response values for each endpoint.

The API requires that you enable the API before you can use it and to register the IP address(es) making the calls.

The API getCDR endpoint also limits the response to only 90 records per API call. This only required creating a loop through a date range starting with the date of my first CDR to current date getting 90 day’s CDR per call.

Each of the 90 day’s responses were appended into a dataset that was exported to a csv file.

Code is provided below. Note that the password used is not your Voip.ms account password but another API password you create in their API management form at the top of the API docs page.

Aside from the getCDR endpoint, VoIP.ms API has over 100 functions to help you integrate our services into your website.

<?php

    $user = "username";
    $pass = "api_password";  /* API pw diff from account pw */

    $currentDate = new DateTime(date("Y-m-d"));
    $startDate = new DateTime('2019-08-08');
    $endDate = clone $startDate;
    $endDate->modify('+90 days');
    
    $cdr = []; // Array to hold all records
    
    $filename = "path\\to\\your\\folder\\" . $currentDate->format('Y-m-d') . "-api-getcdr_response-data.csv"; // Specify your desired path and filename
    
    // Check if the file exists
    if (file_exists($filename)) {
        // Delete the existing file
        unlink($filename);
    }
    
    $handle = fopen($filename, 'w'); // Open file for writing
    
    while ($startDate < $currentDate) {
        // Set startDate to be the day after the last endDate
        $startDate = clone $endDate;
        $startDate->modify('+1 day');
        $endDate->modify('+90 days');
        
        // Check if the end date exceeds the current date
        if ($endDate >= $currentDate) {
            $endDate = clone $currentDate;
        }
        
        // Fetch records
        $records = fetchRecords($startDate->format('Y-m-d'), $endDate->format('Y-m-d'));
        $cdr = array_merge($cdr, $records);
        
        echo $startDate->format('Y-m-d') . ", " . $endDate->format('Y-m-d') . ", " . $currentDate->format('Y-m-d') . "\n";
        
        // If endDate is the current date, break the loop
        if ($endDate == $currentDate) {
            break;
        }
    }
    
    function fetchRecords($from, $to) {
        global $user, $pass;
    
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );
        curl_setopt($ch, CURLOPT_URL, "https://voip.ms/api/v1/rest.php?api_username={$user}&api_password={$pass}&method=getCDR&date_from={$from}&date_to={$to}&answered=1&timezone=-5");
        $result = curl_exec($ch);
        curl_close($ch);
    
        $response=json_decode($result,true);
        if($response['status'] != 'success'){
            echo $response['status'];
            return [];
        }
    
        return $response['cdr'];
    }
    
    fputcsv($handle, array('Date', 'Caller ID', 'Destination', 'Description', 'Account', 'Disposition', 'Duration', 'Seconds', 'Rate', 'Total', 'Unique ID', 'Destination Type'));
    
    foreach ($cdr as $row) {
        fputcsv($handle, array(
            $row['date'],
            $row['callerid'],
            $row['destination'],
            $row['description'],
            $row['account'],
            $row['disposition'],
            $row['duration'],
            $row['seconds'],
            $row['rate'],
            $row['total'],
            $row['uniqueid'],
            $row['destination_type']
        ));
    }
    
    fclose($handle);
    
    ?>

 

OpenAI Code Interpreter river flow analysis

OpenAI Code Interpreter is pretty amazing!

I put it to the test on some data I analysed recently.

The data was a multi-year daily set of a river’s level and discharge measurements at a specific location on the river. The river level is measured in meters and the discharge in cubic meters per second. The data comes from a Canadian government website.

The data was downloaded from the website as a single csv file with one row per measurement per day per parameter and had the following columns:

    • Date – formatted as YYYY/MM/DD
    • Parameter – either level or discharge
    • Value – decimal number

In a new ChatGPT chat, I selected ChatGPT 4 and model: Code Interpreter. This provides a prompt which has a file upload feature. I could then upload the csv file and make the first prompt below asking “what is the relationship between the parameters level and discharge?”

The response gave some basic inputs about the shape of the data in the csv file along with an impressive insight that the data had to be reshaped in order to analyse the relationship between level and discharge so it could create a scatter plot of level vs discharge and calculate a correlation coefficient.

In the responses you can click the “Show work” button to reveal the Python code and results generated. In this next screenshot you can see it load the csv and show its contents.

The resulting scatter plot is something I created myself manually in my previous analysis. In the world of water flow analysis this visualization can be called a “rating curve”. Each location in a river can be characterized by its rating curve. Then one only needs to collect river water level values and then calculate the discharge based on the rating curve.

The response also included some analysis of the results along with the calculated correlation coefficient to support the identification of a positive correlation between level and discharge.

I then provided information this was water flow measurements and asked it to visualize each parameter by year.

The response reshaped the data again so it could create a stacked line chart by year for each of level and discharge over the months and days of the year.

This visualization allows a visual comparison of annual flows by month and day. This can help identify anomalies visually. The overall annual trend is more water flow in the spring in this river’s case primarily due to snow melt.

In order to better visualize level and discharge changes over time I asked it to plot both values by day over all years.

The analysis in the response isn’t very novel but it does include an obligatory caution about comparing parameters in a chart with two y-axis. However, in this case, level and discharge are related.

It may be interesting to note that while discharge is relatively constant over time, the level is decreasing. This indicates that the river channel is becoming  wider and more shallow at this location.

I also asked it to identify anomalous time periods for each parameter. It gave a description of what it could do to respond to this and proceeded to do the analysis. In this case it identified it could calculate “z-scores” and then identify statistical outliers based on daily z-score values.

It then listed years that had anomalies along with count of days with high absolute z-score values.

I asked it visualize these and it produced a clustered bar chart of level and discharge anomalous counts by year.

I paused during my analysis and came back a few hours later to continue. In the meantime, the chat had timed out which dropped the Python cache of its previous work including the uploaded csv file so it has to apologize for not being about to continue until it can rerun code.

It also it needed me to re-upload the csv file so it could then rerun all of its work from the beginning.

As an example of reworking the analysis, I then modified the level and discharge value line chart so it was split by year using sub-plots and it did a pretty good job of that.

I was able to successfully continue modifying the visualization. Some of these are described below as way of examples.

Code Interpreter made this analysis fast and easy. Normally when fine-tuning like this I have to do a lot of manual experiements with the visualization module features. It is a lot faster to do it using Code Interpreter.

Fabric Dataflow Gen2 Web API connection with dynamic current date parameter

I created a Fabric Dataflow Gen2 Web API connection to retrieve csv file data from an Environment Canada daily environmental data web API.

The API provides parameters for specific date ranges eg start and end dates.

I wanted to get data for a date range that had a fixed start date but dynamic end date = current date.

An example of the url for hard-coded date range is:

https://api.weather.gc.ca/collections/climate-daily/items?datetime=2013-01-01%2000:00:00/2023-06-11%2000:00:00&STN_ID=51558&sortby=PROVINCE_CODE,STN_ID,LOCAL_DATE&f=csv&limit=150000&startindex=0

In this example I want to replace the “2023-06-11” with a dynamic current date.

This turned out to be relatively simple. The steps to do this are provided below.

Create a new Dataflow Gen 2 dataflow.

 

Choose the Web API connector.


In the Web API connector settings the url field is empty.

 

Enter the full hardcoded url eg from my example above. (We will edit the url later to include a dynamic end date.) Then click Next to retrieve the data.

Preview the data to make sure that the connector works. Then click Next.

 

You can then see the dataflow query steps and note that the Source is our hardcoded url. You can edit the query Source to include a  dynamic end date.

Instead of the  hardcoded end date use DateTime.LocalNow()  formatted to match the Web API parameter requirements.

https://api.weather.gc.ca/collections/climate-daily/items?datetime=2013-01-01%2000:00:00/” & Date.ToText(Date.From(DateTime.LocalNow()),”YYYY-MM-DD”) & “%2000:00:00&STN_ID=51558&sortby=PROVINCE_CODE,STN_ID,LOCAL_DATE&f=csv&limit=150000&startindex=0

Replace the hardcoded url with this new dynamic url. Preview that the connector is still working with the edits and click Next.

 

That is it. Now when you refresh your dataflow it will retrieve an updated range of data.

CBC news article comments analysis

This post is focuses on a project of the analysis of scaped comments from CBC News website articles. (BTW, there are lots of “CBC’s” in the world. The one I am referring to is the Canadian Broadcasting Corporation.) Another post details web scraping method used to get the articles and their comments.

As a Canadian often travelling and working abroad I use the CBC News site as a way to stay current on Canadian news. Not all of the articles have comments enabled.

Frankly, the quality of the comments is quite low. While the comments are actively moderated, because it is very easy to create an account, there are probably a lot of bots and throwaway accounts. The comments can represent the worst of social media often very snarky, aggressive and politically partisan.

However, as a data worker I was curious to analyze them. I wanted to learn more about who made how many comments, frequently used words and phrases, and interactions between people commenting (comments and their replies) on the site.

I used a variety of techniques to scrape the articles and their comments and save the data as csv files which I could then analyse. Comments include a timestamp when it was posted, comment text, and comment user name, and if it is a reply, then name of the comment user being replied to.

After an article is posted comments are made dynamically and after a while commenting is disabled. So each article can have varying numbers of comments. Many articles never have commenting enabled. Some articles can have more than 10,000 comments by 1,000’s of people.

On of the analyses I did on the scraped data was to create “network charts” for each article’s comments using the Python NetworkX module. Network charts are helpful to show interactions between nodes. In this case the nodes are the people making comments and replies.

A presentation of the resulting network charts for each article I analysed can be seen at this Github repository web page  https://sitrucp.github.io/cbc_comments/image_grid.html.

The code used create the NetworkX charts is shown below and is also available in this Github repository.

import networkx as nx
import matplotlib.pyplot as plt
import math

# Drop comments without any replies
df.dropna(subset=['replied_to_user'], how='all', inplace=True)

# Build NetworkX graph
G = nx.Graph()

# Select data to use in graph from dataframe with full data
G = nx.from_pandas_edgelist(df, 'comment_user', 'replied_to_user', 'minutes')

# Create node size variable
d = nx.degree(G)

# create edges, and weights list for edge colors
# weights are minutes from first comment 
edges, weights = zip(*nx.get_edge_attributes(G,'minutes').items())

# create variable to increase graph figure size based on number of nodes to make more readable
factor = math.sqrt(len(G.nodes()) * 0.01)

# Create plot
plt_width = 25 * factor
plt_height = 25 * factor
fig, ax = plt.subplots(figsize=(plt_width, plt_height))
fig.set_facecolor('black')
ax.set_facecolor('black')

# create layout kamada_kawai_layout seemed best!
#pos = nx.spring_layout(G, k=.10, iterations=20)
#pos = nx.spring_layout(G)
pos = nx.kamada_kawai_layout(G)
#pos = nx.fruchterman_reingold_layout(G)

# draw edges
nx.draw_networkx_edges(
    G, 
    pos,
    arrows=True,
    arrowsize=20,
    edgelist=edges,
    edge_color=weights,
    width=1.0,
    edge_cmap=plt.cm.spring,
    node_size=[(d[node]+1) * 200 for node in G.nodes()], # tells edge to go join node on border
)

# draw nodes
nx.draw_networkx_nodes(
    G, 
    pos,
    node_color='red',
    alpha = 0.7,
    edgecolors='white', #color of node border
    node_size=[(d[node]+1) * 200 for node in G.nodes()],
)

# draw labels
nx.draw_networkx_labels(
    G, 
    pos, 
    labels=None, 
    font_size=10, 
    font_color='white', 
    font_family='sans-serif', 
    font_weight='normal', 
    alpha=None, 
    bbox=None, 
    horizontalalignment='center', 
    verticalalignment='center', 
    ax=None, 
    clip_on=False
)

# create variables to use in chart title
min_comment_time = df['comment_time'].min()[:-3]
max_comment_time = df['comment_time'].max()[:-3]
comment_count = len(df)

# create chart title text
title_text = file_url + '\n' + str(comment_count) + ' comments & replies '+ 'from: ' + min_comment_time + ' to: ' + max_comment_time

# add chart title
plt.title(title_text, fontsize=26 * factor, color='white')

#  save the image in the img folder:
plt.savefig(file_path_image + 'network_' + file_name + '.png', format="PNG")






The network charts have the following features.

Nodes: The red circles (“nodes”) are users. The red circle size corresponds to user comment counts. If you zoom in on the chart you can also see the commenter’s name in white lettering.

Edges: The lines (“edges”) connecting the red circles represent interactions between users as replies to comments. The line’s arrows indicate direction of interaction eg who replied.

It’s quite interesting to note that article comments and replies have their own unique pattern of interactions but also that overall there are general patterns of interactions.

The first network chart below only contains 104 comments and replies on the article  Unvaccinated travellers over the age of 12 barred from planes and trains as of today.  This provides a very clear simple view of the nodes and lines.

(Note comments data fro this visualization were obtained just after the article was posted when it had about 100 comments and replies. It eventually had  2000+ comments.)

 

The second network chart below shows an article On COVID restrictions, our governments keep firing up the gaslights and shifting the goalposts with 6,052 comments and repliesThese comments and replies were made by  1,140 people.

This makes for a very dense visualization and a big image size compared to the first network chart above.  You can see that there are some people commenting much more than others represented by the larger red circles. Not surprising, these people garner more interactions eg replies from others as represented by the blue lines.

Click on the network chart image to open it in your browser or save and download the image to your desktop and zoom into it to see the detail.

Zooming in for a closer look below shows more detail. The center of the chart has the users with the greatest number of comments and replies. The outer edges show users with fewer comments and replies.

And another closer looks shows even more detail of the sparse low comment and reply count users on the edges of the chart.

 

A few other articles’ network charts are shown below as examples of the variations they can have.

In a fiery speech, O’Toole says Canada is ‘drowning in debt and division’ on Trudeau’s watch

 

Trudeau calls for global carbon tax at COP26 summit

 

RCMP union says it supports a Mountie’s ‘right’ to refuse vaccination

 

 

Additional analysis

Comment word frequency

In addition to the network charts, it was quite easy to make a “word cloud” visualization showing the top 200 words in all of the comments and replies.

The word cloud highlights themes that arise in the comments. I think it would be interesting to do sentiment analysis on the these. Perhaps that will come in the future.

Comment people frequency

The line chart below shows percent users vs percent comments for a single article. This visualization highlights that about 50% (about 615) of 1,226 users made 90% of the comments. Only 9% (about 105 users) of the users made 50% of the comments!

The “word cloud” chart below shows the names of the top 200 users by comment and reply count. The name size corresponds to user comment and reply counts.

Of the 7,800 comments 1,744 (22%) were “top-level” comments eg they were not directly replying to another comment. The rest 6,056 (78%) were replies to another comment. This indicates a lot of interaction between comments.

Canada cell tower distribution

opencellid.org  has global cell tower location data that I used to do a series of QGIS maps showing Canada’s cell phone tower distribution. The data includes at latitude, longitude and radio generation eg 4G (LTE) or 3G/2G (GSM, UMTS or CDMA).

Canada cell tower distribution
4G (red) and 3G/2G (blue)

Southern Ontario cell tower distribution
4G (red) and 3G/2G (blue)

Southern Quebec cell tower distribution
4G (red) and 3G/2G (blue)

Canada Maritimes (NB, NS, PEI) cell tower distribution
4G (red) and 3G/2G (blue)

Newfoundland cell tower distribution
4G (red) and 3G/2G (blue)

Canada Prairies (AB, SK, MB) cell tower distribution
4G (red) and 3G/2G (blue)

BC and Alberta cell tower distribution
4G (red) and 3G/2G (blue)

Yukon and Northwest Territories cell tower distribution
4G (red) and 3G/2G (blue)

Nunavut cell tower distribution
4G (red) and 3G/2G (blue)

 

CBC news article comments web scraping

This post is focuses on a project where I scaped comments from CBC News website articles. (BTW, there are lots of “CBC’s” in the world. The one I am referring to is the Canadian Broadcasting Corporation.)

There is another post that focuses on analysis and visualization of the scraped data.

As a Canadian often travelling and working abroad I use the CBC News site as a way to stay current on Canadian news.

However, as a data worker I was curious to analyze them. I wanted to learn more about who made how many comments, frequently used words and phrases, and interactions between people commenting (comments and their replies) on the site.

The technique used to scrape the articles and their comments was to show all available comments on the page and save and download the entire html page locally.

Then Python was used to extract the article details and the comments and replies from the downloaded document into a csv file. The csv file could then be analysed using Python.

Let’s continue with a bit more detail on the technique used.

The comments section is found below the news article (though comments are not enabled for all articles!) The screenshot below shows the comments section.

 

After an article is posted comments are made dynamically and the list of comments grows longer and pagination is used that requires manual clicking of a “load more” button to see new comments.

After a while commenting is disabled. So each article can have varying numbers of comments. As previously mentioned, many articles never have commenting enabled. Some articles can have more than 10,000 comments by 1,000’s of people.

The comment delivery method makes it very difficult to read all of the comments because it uses the “endless scrolling” format.

In fact, for articles with more than a couple of hundred comments, it is likely no one will read all comments and their replies because the endless clicking is so tedious.

Pagination requires clicking a “SHOW MORE” button at the bottom of the comments again and again to show more comments.

Longer comments require clicking a “» more” link to reveal hidden text

Comments with multiple replies requires clicking a “SHOW 2 OLDER REPLIES”  to show more replies.

The scraping method needed to accomodate these. In order to see all of the comments and their complete text we would need a process that would effectively click through all of the buttons above until all of the comments and their content was displayed on the webpage.

Once all of the content was visible on the webpage it could be saved locally and Python BeautifulSoup could be used to extract all comments and their content and save it in a tabular data format.

Using Chrome browser’s  “Inspect”, “View pge source” (Ctrl-U) and “Developer tools” (Ctrl-Shift-i ) quickly revealed the relevant HTML tags behind the buttons identified above. These are the things that need to be “clicked” again and again until all the comments and their content are displayed on the webpage.

Comments include a timestamp when it was posted, comment text, and comment user name, and if it is a reply, then name of the comment user being replied to.

Relevant code is provided below and can be found in this Github repository.

View complete list of CBC comments visualizations here.

// SHOW MORE COMMENTS
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-load-more-con" style="display: none;">
<a href="#" class="vf-load-more vf-text-small vf-strong">Show More</a>
</div>

// SHOW REPLIES
// div tag will have style="display: none;" if there are no more comments otherwise it is displayed
<div class="vf-comment-replies hidden">
<a class="vf-replies-button vf-strong vf-text-small" href="#">Show <span class="vf-replies">0</span> older replies</a>
</div>

// SHOW MORE COMMENT TEXT
// tag is displayed only when comment has hidden text otherwise the tag is not present
<a href="#" class="vf-show-more" data-action="more">» more</a>

The button clicking was somewhat automated using the Javascript below executed in the Developer tools console. The process currently requires pasting the code into the console and manually executing it. Step 1 required some babysitting to ensure it runs to completion satisfactorily.

The workflow to show all comments and their content is as follows:

    • Step 1: Run “STEP 1 – Show more comments” javascript in browser console.
    • Step 2: Run “STEP 2 – Show replies” javascript in browser console.
    • Step 3: Run “STEP 3 – Show more comment text” javascript in browser console.

At this point, all the comments and their content are displayed on the webpage.

    • Step 4: Save webpage locally.
    • Step 5: Run Python script to scape local webpage and save data as csv file.
    • Step 6: Open csv in Excel or analyse using your favourite data visualization tool.
//STEP 1 - Show more comments - pages with 1000's of comments gets slower and show button exceeds 5000 ms so requires manual rerun of script

var timer = setInterval(getMore, 5000);
function getMore() {
    moreDiv = document.getElementsByClassName('vf-load-more-con')[0];
    if(moreDiv.style.display === "none") {
        console.log('vf-load-more comments finished');
        clearInterval(timer);
        return;
    }
    console.log('More comments');
    moreDiv.childNodes[0].nextElementSibling.click();
}

//STEP 2 - Show replies - loops to auto show all comments' replies

var buttons = document.getElementsByClassName('vf-replies-button');
console.log(buttons.length, 'vf-replies-button')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i ,'of', buttons.length) 
}
console.log('vf-rreplies-button finished');

//STEP 3 - Show more comment text - loops to show all commments' text

var buttons = document.getElementsByClassName('vf-show-more');
console.log(buttons.length, 'vf-show-more buttons')
for(var i = 0; i <= buttons.length; i++) { 
    buttons[i].click(); 
    console.log('click', i, 'of',buttons.length) 
}
console.log('vf-show-more comment text finished');

Once all the comments and their content are displayed on the webpage, Step 4 is to save the webpage locally. You need to save as complete html page to save the javascript otherwise the page will be blank.

Then Step 5 is to run the following Python code to extract comment data into csv file.

This uses Python BeautifulSoup to extract HTML tag data into a Pandas dataframe which is then saved locally as a csv file.

import sys, os
import csv
import re
from datetime import datetime, timedelta
from bs4 import BeautifulSoup 
import pandas as pd

file_path_html = 'C:/cbc_comments/html/'
file_path_csv = 'C:/cbc_comments/data/'

file_url = 'https://www.cbc.ca/news/politics/trudeau-carbon-tax-global-1.6233936'

file_name = file_url.replace('https://www.cbc.ca/news/','').replace('/','_') + '.html'

soup = BeautifulSoup(open(file_path_html + file_name, encoding='utf8').read(), 'html.parser')

publish_date_raw = soup.find('time', class_='timeStamp')['datetime'][:-5]
publish_date = datetime.strptime(str(publish_date_raw), '%Y-%m-%dT%H:%M:%S')
vf_comments = soup.find('div', class_='vf-comments')
vf_comment_threads = soup.find_all('div', class_='vf-comment-container')
vf_usernames = soup.find_all('button', class_='vf-username')

# create comment data list of lists
comment_data = []
replies = []

for thread in vf_comment_threads:
    # children = data_ids.findChildren()
    # div_data_id = soup.find('div', class_='vf-comment')
    data_id = thread['data-id']
    username = thread.find('button', class_='vf-username').get_text()
    comment_time_str = thread.find('span', class_='vf-date').get_text().replace('s ago', '')
    comment_time_int = int(re.sub('[^0-9]', '', comment_time_str))
    if 'minute' in comment_time_str:
        elapsed_minutes = comment_time_int
    if 'hour' in comment_time_str:
        elapsed_minutes = comment_time_int * 60
    comment_text_raw = thread.find('span', class_='vf-comment-html-content').get_text()
    comment_time = publish_date - timedelta(minutes=elapsed_minutes)
    if 'Reply to @' in comment_text_raw:
        comment_type = 'reply'
        replied_to_user = comment_text_raw.split(": ",1)[0].replace('Reply to @', '').strip()
        try:
            comment_text = comment_text_raw.split(": ",1)[1].strip()
        except:
            comment_text = 'no text'
    else:
        comment_type = 'parent'
        replied_to_user = ''
        comment_text = comment_text_raw.strip()

    comment_data.append((
        data_id, 
        publish_date, 
        comment_time,
        username, 
        comment_type, 
        replied_to_user, 
        comment_text, 
        file_name.replace('.html', ''), 
        file_url))

df_comment_data = pd.DataFrame(
    list(comment_data), 
    columns=[
    'data_id', 
    'publish time', 
    'comment_time', 
    'comment_user', 
    'comment_type', 
    'replied_to_user', 
    'comment_text', 
    'file_name',
    'file_url'])

df_comment_data.to_csv(
    file_path_csv + file_name.replace('.html', '.csv'), 
    encoding='utf-8', 
    index=False)

Now that we have created a nice tabular format csv data file we can  analyse the data using your favourite data analysis and visualization tools.

Read this other post that describes some of the data analysis and visualization done on this data.

 

Xero.com data and reporting

Once upon a time, I was using Xero to record and manage a company’s finances and wanted to use Zebra Bi for Power BI to create interactive income statement actual vs plan vs budget reporting.

Xero Features

Zebra BI offers very compelling visualization tools to create financial reporting in Power BI and Excel specifically for comparing actuals vs plan and budget. Zebra BI’s product is a Power BI and Excel plugin. Using it in Power BI is as simple as choosing Zebra BI as a visualization in same way as any other Power BI chart object.

Zebra BI Example Visualizations

Xero API

The primary challenge to using Power BI with Xero is deciding the best way to get data out of Xero so that it can be used by Power BI.

Xero has an API that provides access to almost all of the Xero data. The API can be used directly or indirectly via many third parties. However, the API features and limitations require careful  consideration to determine if they meet your use case.

The Xero API has some good documentation as well as an interactive API explorer that can be used in a no-code way to learn about the API endpoint features and data returned.

Two methods of using the Xero API:

    • Directly from Xero – The API can be used directly by registering for API access and developing your own code and process.
    • Indirectly via third parties – There are also a variety of third parties include a range of SaaS platforms and tools that connect to the API and retrieve Xero data on your behalf. These provide a wide range of methods and Xero data “modelling” and transformation to enable it to be readily used for financial reporting.

I ended up conducting a selection process to identify the best solution to get Xero data which considered a variety of Xero API and third party limitations.

Requirements Summary

    • Multi-year Xero data retrieval and analysis eg 2019, 2020, 2021.
    • Use Power BI to enable Zebra BI reporting.
    • Data must include actuals, budget and multiple forecasts, and ability to create previous year metrics.
    • Method to allow ad-hoc “account mapping” eg ability to categorize accounts other than Xero tracking categories eg group all travel as “Travel”.
    • Entirely automated to minimize manual effort required. This includes retrieval and transformation of Xero data and Power BI report refresh.

Xero API Limitations

    • API 11 month per call limit – need workarounds to get more.
    • Scheduling – need method to schedule data refresh.
    • Update existing data – either delete all or update existing data.
    • Data modelling – Xero Report API’s provide Xero modelling however additional modelling required to get desired datasets.
    • Budget data – need a method to incorporate Xero budget data. Xero is only recently providing budget data via API.
    • Forecast data – need a method to incorporate forecast data. Xero does not have forecast data functionality. However, a workaround is to create a budget that has forecast data.
    • Account grouping – how to categorize accounts other than Xero tracking categories eg group all travel as “Travel”.
    • Implementation & support complexity – code, technical mix, processes, components, environments.
    • All methods require extraction of Xero data into some intermediary storage location eg a datawarehouse or datamart.

Summary

The graphic below summarizes the selection investigation results and comparision and highlights my top two choices that best satisfied my requirements.

Note this comparison doesn’t include costs. However, generally speaking, these solutions have more or less similar costs.

My requirements were best met using Acterys which provides Xero API data ETL,  an advanced datawarehouse, and scheduling and user administration capabilities.  Of the remaining solutions FreshBI  was ranked second.

Note that there are a lot of other potential solutions including the many Xero application partners  or the many cloud data integration tools but these were the ones that I found to be most useful.

Solution Details

FreshBI Power BI Custom Connector for Xero Accounting was initially selected primarily because it was quick and easy way to get Xero API data.

    • The FreshBI team has developed their own custom Power BI Xero connector that they sell for a fixed price. Once you have the connector you are free to modify it as required which is what I did.
    • The FreshBI connector connects to API and returns data to Power BI as Power BI queries. The connector can be accessed and the query code modified as desired to get data into desired shape or to modify it to retrieve multiple years of data.
    • However, code required non-trivial modification to get desired data. Original connector retrieves only 1 year of pivoted format data. I modified the query code to unpivot the data months from columns to rows. Stopped before making further modifications to retrieve > 1 year data. (Interesting note: based on my discussions with FreshBI they said they may modify connector to provide this capability which they will make available as update to purchased connector).
    • Connector cannot refresh from Xero from PBI Online. Only Microsoft ‘certified’ connectors have this capability and FreshBI said Microsoft will certify their connector only if it has many more users. Hence, a Power BI pbix file published to PBI Online can only refresh data by using the PBI Gateway which is free app that runs connector on laptop or on server and syncs Xero data and PBI Online dataset.
    • Support complexity deemed high due to modifications, multiple components and environments.

Acterys ultimately selected because it provides best supported, most robust Xero data solution.

    • Acterys has developed a sophisticated modern BI data modelling and process which meets all data requirements eg actuals, budget, forecast, relational star schema, process to import budget and forecast, modify existing datasets.
    • Includes Microsoft Azure SQL Server database for Xero data storage as reporting data source.
    • Xero data refresh process includes scheduling, status email notifications, and previously retrieved update process. Changes made to data within past 7 days updated in Acterys database.
    • Support complexity deemed medium, though Acterys solution is quite complex, it built and supported by Acterys, for all of their customers, which reduces end user support complexity.

ODataLink and CData were not selected as they are essentially alternate methods of FreshBI connector API data acquisition, and do not offer additional data modelling other than Xero Report API. Interesting note: As a result of my discussions with them, OdataLink added new functionality to retrieve > 1 year data.

Synchub not selected because while it provides excellent data syncing capabilities it does not have any data modelling simply providing raw Xero database table data.

Xero PBI App is a Power BI application that is available in the Microsoft Power BI app store. It is considered only as a Xero “showcase”. While the reports can be modified, or new reports created using the dataset, the data has only rolling 1 year data, and it is not possible to have access to the report pbix file to inspect or modify the dataset, hence it is limited to the purpose of the “showcase” reporting.

Hopefully this write-up was helpful when deciding how to get data out of Xero for analytical and reporting purposes. It seemed to me while researching this topic that there is still a lot of opportunity to provide Xero or other ERP / financial application data integration and reporting capability.

Canadian Government First Nations long term water advisory data

The Government of Canada is working with Canadian First Nations communities to end long-term drinking water advisories which have been in effect for more than 12 months since November 2015. This includes projects to build or renew drinking water infrastructure in these communities.

The First Nation drinking water advisories were a big part of the 2021 Canadian federal election campaigns. The current government had made big promises to fix all of the advisories in 2015. The opposition criticized them for not having achieved that goal to-date. However there was poor communication about the actual number and status of the advisory projects. So I wanted to find some data to learn the actual status for myself.

TLDR the results can be seen on a Github.io hosted page.  A screenshot is provided below. The code to retrieve and transform the data to create that web page are in my Github https://github.com/sitrucp/first_nations_water repository.

Finding the data I needed only took a  few minutes of Google search from an Indigenous Services Canada  website that mapped the water advisories by counts of advisories by FN community, project status, advisory dates, and project type.

While the web page also included a link to download the map data, after a bit of web page scraping and inspection, I learned that the downloadable map data was created by web page Javascript from another larger data source referenced in the map page code https://www.sac-isc.gc.ca//DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt.

This larger data source is a text file containing JSON format records of all Canadian First Nations communities and is many thousands of records. As a side note this larger file appears to be an official Canadian government dataset used in other government websites. The map data is limited to only about 160 First Nations communities with drinking water issues.

So rather than download the 160 record map page data, I retrieved the larger JSON format text file and used similar logic as that for the map web page code to get the 160 records. This was done in two steps.

Step 1: retrieve JSON format text file using a Python script water_map_data.py to retrieve and save the data file locally. I may yet automate this using a scheduled task so the map gets regularly updated data as advisory status changes over time.

Step 2: process the saved data file and present it in an HTML web page as Plotly.js charts and an HTML tabular format using Javascript in this file  first_nations_water.js

Finally, I also separately created a Excel file with Pivot Table and Chart that you can download and use to do your own analysis. Download this Excel file from the Github repository. The file contains an Excel Power Query link to the larger text JSON file described above. You can simply refresh the query to get the latest data from the Indigenous Services Canada  website.

COVID-19 Data Analysis and Visualization Summary

This is a list of Canadian COVID-19 related data analysis and visualization that I created during 2020/21 pandemic.

Canada COVID-19 Case Map – COVID-19 new and total cases and mortalities by Canadian provincial health regions.
https://sitrucp.github.io/canada_covid_health_regions

Canada COVID-19 Vaccination Schedule – Canada’s current and historical vaccine doses administered and doses distributed. Also includes two distribution forecasts: 1) based on Canadian government vaccine planning and 2) based on Sep 30 2021 goal to vaccinate all 16+ Canadians. Uses COVID-19 Canada Open Data Working Group data.
https://sitrucp.github.io/covid_canada_vaccinations

Canada COVID-19 Vaccination vs World – Canada’s current and historical ranking vs all countries in the Our World in Data coronavirus data  by total doses, daily doses and total people vaccines adminstered.
https://sitrucp.github.io/covid_global_vaccinations

Global COVID-19 Vaccination Ranking – Ranking of all countries in the Our World in Data coronavirus data by daily vaccine dose administration. Includes small visualization of all time, population, vaccines used and trend. Can sort by these measures.
https://sitrucp.github.io/covid_world_vaccinations

COVID-19 New Case Rate by Canadian Health Regions Animation – SVG animation of new cases visualized as daily rate for each Canadian provincial health regions. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_canada

COVID-19 New Case Rate by Country Animation – SVG animation of new cases visualized as daily rate for each country in the Our World in Data dataset. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_world

COVID-19 New Case Rate by US State Animation – SVG animation of new cases visualized as daily rate for each US state. Like a horse race, faster moving dot means higher daily rate.
https://sitrucp.github.io/covid_rate_us

Apple Mobility Trends Reports – Canadian Regions Data – Apple cell phone mobility data tracking data used to create heat map visualizations of activity over time.
https://sitrucp.github.io/covid_canada_mobility_apple

WHO Draft landscape of COVID-19 Candidate Vaccines – AWS Textract used to extract tabular data from WHO pdf file. Python and Javascript code then used to create webpages from extracted data.
https://sitrucp.github.io/covid_who_vaccine_landscape

Montreal Confirmed COVID-19 Cases By City Neighbourhoods – Code and process used to scrape Quebec Health Montreal website to get COVID-19 case data for Montreal city boroughs.
https://github.com/sitrucp/covid_montreal_scrape_data

Use Excel Power Query to get data from Our World In Data – How to use Excel’s Power Query to get Our World in Data Github csv files automatically and update with simple refresh.
https://009co.com/?p=1491

 

TendiesTown.com – WallStreetBets gain and loss analysis

For a 16 month period, from February 2020 to June 2021, TendiesTown.com used the Reddit API in an automated process to retrieve r/wallstreetbets posts tagged with “Gain” or “Loss” flair. The subReddit r/wallstreetbets is a stock and option trading discussion board that gained popularity during the 2020/2021 “meme stock” media frenzy.

The retrieved posts were automatically categorized by gain or loss, post date, Reddit username, trade verification image url, and post url.

These posts were then manually processed to identify the gain or loss amount and reported stock ticker(s).

As of June 9, 2021, there were 4,582 trades with 2,585 gains and 1,997 losses recorded. Gain amounts totalled USD $389 million and losses USD $117 million.

While 6,844 posts were retrieved from the subreddit, 2,262 (33%) of these were rejected because they were either duplicate posts, not actual gain or loss trades (per r/wallstreetbets flair guidelines) or it was not possible to identify the amount.

The trades and more detailed analysis are available on the tendiestown.com website. As of June 9, 2021, the manual processing is no longer being done, so no new data will appear on the site.

“Gain” and “Loss” flair are described in more detail in the community flair guidelines but essentially “Gain” = trade made money and “Loss”= trade lost money.

Table 1: Summary statistics

Gains Losses
Count 2,585 1,997
Sum $389,329,571 $117,347,985
Avg $150,611 $58,762
Median $24,138 $12,120
Min $138 $100
Max $8,000,000 $14,776,725

The 4,582 trades were made by 3,903 unique Reddit users. 492 of these Reddit users have more than one trade as shown in table 2 below.

Table 2: Trade counts by unique Reddit users

# trades count of Reddit users
1 3,411
2 378
3 73
4 22
5 11
6 5
7 1
8

2

 

Charts 1 & 2: Daily and cumulative counts

Bar charts daily count of gain and loss

Bar charts cumulative count of gain and loss

Charts 3 & 4: Daily and cumulative amounts

Bar charts daily amount of gain and loss

Bar charts cumulative amount of gain and loss

Notable observations

Early 2021 increase in trades

The count and amount of trades rose significantly in the early months of 2021.  This can be explained by the surge of new users to the r/wallstreetbets subreddit due to huge increase in popular media reporting surrounding the GME, AMC meme stock craze.

Gains greater than losses

Gains consistently lead losses over the 16 month period.  Rather than simply concluding that r/wallstreetbets traders gain more than they lose, it suspected that the variance can be explained due to fact that it is easier to tell the world that you won, and harder to say you have lost.

Use Excel Power Query to get data from Our World In Data

Our World In Data (OWID) has been doing a hero’s job of collating the world’s covid vaccine distribution and administration data. They generate additional metrics and allow site visitors to analyse them in awesome interactive visualizations.

However, I wanted to do some analysis that went beyond what was possible on the site. No problem, because the OWID team also provides the data as csv files in their Github repository.

These csv files are easily manually downloaded from Github to be used. But as the OWID vaccine data is updated regularly, often twice per day, you will probably want to automate the retrieval of updated csv files from Github.

There are many tools and methods to automate this but in this blog post I want to quickly highlight how Excel users can use Excel’s Power Query Get Data->From Other Sources->From Web feature to link to the csv files in Github. (Read my other blog post to learn more about Power Query.)

Once implemented you only need to click refresh to get latest csv data from OWID’s Github repository.

First, you need to get a url for the csv file’s in the OWID Github repository.

The easiest way to get correct url is to open the csv file in Github in Raw format by clicking on the file name and then clicking on the Raw button on top right corner of page. This will open the csv file in native “raw” format. Then you can simply copy the url from browser location bar.

You can also manually create a url to any Github repository file as follows below. Let’s use the vaccinations.csv file in OWID’s Github repository as an example.

https://raw.githubusercontent.com/owid/COVID-19-data/master/public/data/vaccinations/vaccinations.csv

The vaccinations.csv Github repository file url has the following parts:

    • Base url: https://raw.githubusercontent.com
    • Repository owner: owid
    • Repository name: COVID-19-data
    • Repository branch: master
    • Repository base element: public
    • Repository folders: data/vaccinations
    • Repository file name: vaccinations.csv

Based on this example you should be able to manually create a url for any file in any Github repository.

Once you have your url, you can test if it is correct by simply copy pasting it into a browser location bar and verifying the url is good, and retrieves the data you are looking for.

After verifying the url is correct, go to your Excel file, find the Excel Power Query Get Data feature and select From Other SourcesFrom Web which will show input field for a url. Paste your url into the input field and simply click Load, which will load the csv file data into an Excel worksheet.

Now instead of manually downloading data from Github, you can simply click refresh and automatically retrieve the most recent csv file from the OWID Github repository.

Here is full M-code for a Query connected to the OWID Github “vaccinations.csv”. You can copy the code below and paste it into a blank Query Advanced Editor and it will return OWID data to your Excel file.

 let
Source = Csv.Document(Web.Contents("https://raw.githubusercontent.com/owid/COVID-19-data/master/public/data/vaccinations/vaccinations.csv"),[Delimiter=",", Columns=12, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"location", type text}, {"iso_code", type text}, {"date", type date}, {"total_vaccinations", Int64.Type}, {"people_vaccinated", Int64.Type}, {"people_fully_vaccinated", Int64.Type}, {"daily_vaccinations_raw", Int64.Type}, {"daily_vaccinations", Int64.Type}, {"total_vaccinations_per_hundred", type number}, {"people_vaccinated_per_hundred", type number}, {"people_fully_vaccinated_per_hundred", type number}, {"daily_vaccinations_per_million", Int64.Type}})
in
#"Changed Type"

 

@SirPatStew #ASonnetADay dashboard

Sir Patrick Stewart @SirPatStew was doing Shakespeare Sonnet readings from his home during the COVID-19 lockdown and they were really good.

I wanted to track the sonnet reading tweets’ like and retweet counts over time and show this to other @SirPatStew fans. I also thought it would be very helpful to provide an automatically updated list of links to the sonnet tweets.

So I created a daily automated job that retrieved Sir Patrick’s Twitter data, saved it in the cloud. I also created a web page that included a visualization showing the #ASonnetADay hashtag tweets like and retweet counts (screenshot below) along with a table listing the tweets in chronological order  with like and retweet counts and as well as a link to the tweet.

@SirPatStew finished has long finished posting new #ASonnetADay tweets and his tweets continue to get visitors and like and retweet counts continue increasing.  The automated daily job is still ongoing and the visualization continues to be updated.

View visualization: https://sitrucp.github.io/sir_pat_sonnet_a_day_tweets/

Code hosted on Github: https://github.com/sitrucp/sir_pat_sonnet_a_day_tweets

Data was retrieved from Twitter using the Twitter API and Tweepy and the visualization was created using Plotly.js dot plot and is hosted on Github.io

The Tweet data was used to create the following:

    • A categorical dot plot of each sonnet’s tweet like and retweet counts.
    • A tabular list of #ASonnetADay tweets with links to tweet to allow others to easily find and watch them.

 

The human radio-broadcast bubble

“Space… is big. Really big. You just won’t believe how vastly, hugely, mindbogglingly big it is” (quote from The Hitchhiker’s Guide to the Galaxy)

One measure of  space is the size of a “bubble” created by radio waves radiating outwards from the Earth over the past 139 years since they were first intentionally produced by humans in the 1880’s. Wikipedia: History of radio

The image below, created by science blogger Adam Grossman, illustrates this “bubble” of radio waves as a tiny blue dot on an artist visualization of the Milky Way galaxy.

The human radio-broadcast bubble compared to Milky Way galaxy.

The visualization has an inset in the bottom-right with a close-up of the Earth’s region of the Milky Way galaxy to make the blue dot more readily visible. This shows that the blue dot is very small compared to the Milky Way.

However, that tiny blue dot is actually an enormous sphere with a radius of 139 light-years Wikipedia: light-year and the Earth at its center and the sphere’s grows another light-year every year. For the non-science people, those radio waves are travelling at the speed of light!

Tiny blue dot is actually an enormous sphere with a radius of 139 light-years and the Earth at its center!

The blue dot is very small compared to the Milky Way and even smaller when compared to the universe.

The distance from Earth to the edge of the observable universe is 46 billion light-years Wikipedia: Universe and the observable universe contains about 200 billion galaxies according to latest Hubble observations Nasa: Hubble.

Feeling small yet? Even though that blue dot is arguably the biggest thing humans have created, its 139 light-year radius about 331 million times smaller than the 46 billion light-year distance to the edge of the observable universe.

Also as a final note, it is not likely that the radio-waves are actually detectable, at least not by our current technology, after travelling 139 light-years. But, perhaps somewhere out there, aliens with some magic technology are entertaining themselves by listening to human radio and tv broadcasts, air traffic controllers, police and EMT calls, and cell phone conversations. Nice thought .. scary thought?

Also this image has been used often without attribution to its creators. This planetary.org blog post gives more history of the creation of this image in 2011 by Adam Grossman and Nick Risinger. The original blog post by Adam Grossman can be found archived on github.io and it was featured on YCombinator’s Hacker News here.

 

 

 

Amazon AWS Transcribe to get 2020 presidential debate #1 speaker segments

TLDR: I used Amazon Transcribe to transcribe the first presidential debate audio that included timestamps for each word, to create the following speaker timeline visualization (created using a Plotly timeline chart). Click image to view full size visualization.

After watching the US 2020 Presidential Debate #1  I was curious to see if there was an automated way to identify when a debater was interrupted while speaking during their 2 minutes allotted time.

I envisioned a timestamped transcription that could be used to create a timeline of each speaker talking and identifying overlaps where one speaker was talking first and second speaker starts during that talk ‘segment’.

Unfortunately Amazon Transcribe purposefully shifts all of the transcribed words’ start times to eliminate overlapping word time periods. Therefore, it wasn’t possible to get the data I wanted to satisfy my curiousity using Amazon Transcribe.

It may be possible to infer overlapping speaker talking and interruptions with multiple small interleaving speaker segments but that would be hard to distinguish from two people having a conversation with the Amazon Transcribe results. Might investigate alternative automated transcription methods and make new post. TBD.

Here is link to Github repository containing code.

Getting Debate Audio

I used youtube-dl to download the debate audio from a CSPAN video recording of the debate which was on YouTube. The audio produced youtube-dl was an mp3 file. I used Libav to trim off the beginning ~30 minute portion of the audio as it was not the actual debate but pre-debate stage prep.

Using Amazon Transcribe

I used Amazon Transcribe to create a transcription of the debate audio.

Amazon Transcribe can only process audio files that are stored in an AWS S3 bucket. Uploading the file and running the Amazon Transcribe job were done using Python and AWS Boto3 SDK.

The Transcribe transcription output is a JSON file that contains “segments” for each identified speaker in the audio. These segments identify the speaker and have start and end times. Amazon Transcribe seemed to be pretty good at identifying the speakers in the audio. The transcription itself was not perfect.

The output JSON file is saved in a new S3 bucket created by Transcribe. The JSON file contains the following content and high-level structure:

    • jobName – job name specified for transcription.
    • accountId – Amazon account or IAM account?
    • results– contains elements below:
      • transcripts – complete text of audio transcription.
      • speaker_labels – contains elements below:
        • speakers – the number of speakers specified for transcription.
        • segments – one or more time based segments by speaker. Has start and end time, and speaker label.
          • items – segments have one or more time based items. Has start and end time, and speaker label. Does not include word.
      • items – separate section with more than one item, one for each word or inferred punctuation. Includes word along with alternatives with confidence value for each word. Has start and end time, but does not have speaker label.
    • status – of transcription job eg in-process, failed or completed.

Processing Transcription

The Transcribe output JSON file was read into a Pandas dataframe which was used as the data source for the Plotly timeline chart shown above.

The timeline chart data was created from the [‘results’][‘speaker_labels’][‘segment’] elements which identified the speaker and had segment start and end times. The x-axis was populated by the segment timestamps and the y-axis was populated by categorical values of speaker names.

An important data transformation was done because a Plotly timeline chart requires datetimes for period start and end and x-axis values. However the Transcribe output JSON file only has start and end times that are elapsed seconds from beginning of audio.

Therefore the elapsed seconds were transformed into “fake” dates by adding an arbitrary date (in this case “1970-01-01”) to a “HH:mm:ss” value created from the JSON file seconds values.

The Plotly timeline chart formatting was set to create nice vertical bars for each speaker segment.

D3.js SVG animation – COVID-19 rate “race” visualization

This visualization shows COVID-19 new cases as a “race” of dots moving from left to right.

The dot’s “speed” or how long it takes to move from left to right is based on the number of cases per day.

If a country has one case per day, it will take an entire day for the dot to move from left to right. Some countries have many 1000’s of new cases daily and the dot moves from left to right in minutes or seconds.

There are three  visualizations for following geographical regions. Click “viz” to view the visualization and “github code” to view the code for the visualization:

The screenshot below shows countries of world. Some countries have not had any new cases over past 7 days so show as gray. Those that have had new cases over past 7 days are shown as white circle (no change from prev 7 days), red (increase from prev 7 days) or green (decrease from prev 7 days).

The visualization is sorted by country by default but can change sorting by average new cases. In addition, you can toggle between showing new cases as actual count or new cases per million (population).

The visualization uses D3.js SVG to create a canvas for each location, the location name text & counts, and circle shape, and transitions, and to retrieve csv file and process data, including filtering to most recent 7 days, group by location to get case count means.

The most important aspect for this visualization was how to use D3.js to animate the movement of the white circle across the canvas, and how to repeat the movement in an ‘endless’ loop.

The code block below hightlights use of a function that uses D3.js .on(“end”, repeat);  to loop through repeat function ‘endlessly’ so that shape is moved across canvas, and then back to original position, to move across canvas again and again. See bl.ocks.org ‘Looping a transition in v5’ example.

The duration() value is the proxy for rate in this visualization and is calculated in another function separately for each location SVG. I also added a counter that would increment an SVG text value to show each loop’s count on canvas.

// repeat transition endless loop
function repeat() {
    svgShape
    .attr("cx", 150)
    .transition()
    .duration(cycleDuration)
    .ease(d3.easeLinear)
    .attr("cx", 600)
    .transition()
    .duration(1)
    .attr("cx", 150)
    .on("end", repeat);
    
    svgTextMetric
    .text(counter + ' / ' + metric);
    counter++;
  };

This visualization was inspired by Jan Willem Tulp’s COVID-19 spreading rates and Dr James O’Donoghue’s  relative rotation periods of planets, and uses same data as Tulp’s spreading rates.

Amazon AWS Textract – WHO “Draft landscape of COVID-19 candidate vaccines” – convert PDF to csv

TLDR: I extracted text from the WHO’s vaccine candidate PDF file using AWS Textract and made text into a set of interactive web pages . View the AWS Textract PDF extract output csv files in this Github repository and view and interact with the web pages here.

The World Health Organization (WHO) maintains a regularly updated PDF document named Draft landscape of COVID-19 candidate vaccines which contains all COVID-19 vaccine candidates and treatments currently being developed and their status.

2020-01-03 EDIT: Note that the WHO is now providing an Excel file which contains the same data previously contained in the PDF file referred to in this post.

The main content in this PDF document is a tabular format with one vaccine or treatment per row. There is also some non-tabular text content including introduction text and footer notes.

I wanted a machine readable format version of this PDF document’s table data so I could do some analysis. This meant I needed to do PDF text extraction. There are lots of solutions. I ended up using Amazon Textract to extract the PDF into csv file format.

“Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables”

Using Textract

You need an AWS account to use Textract and it does cost to use the service (see costs for this analysis at bottom of post).

The Textract service has UI that you can use to upload and process documents manually eg not using code. However, it is also included in the AWS Boto3 SDK so you can code or use command line automation with Textract too.

I used the manual UI for this one time processing of the PDF. However, if I was to automate this to regularly extract data from the PDF I would use Python and Boto3 SDK. The Boto3 SDK Textract documentation and example code are here.

Use this link to go directly to the AWS Textract UI once you are logged into your AWS console.

The Textract UI is quite intuitive and easy to use. You manually upload your PDF file, it processes the file and shows you the interpreted content which is described in “blocks” of text, tables, images, and then you can select which to extract from document.

During this process the the manual Textract process asks permission to create a new S3 folder in your account where it uploads the PDF before processing it. This is because Textract will only accept documents from S3.

Screenshot of Textract UI

The Textract PDF extract output is a zip file contained bunch of files that is automatically downloaded to your computer. The zip file contained the files listed below.

These 3 files appear to be standard information for any AWS Textract job.

    • apiResponse.json
    • keyValues.csv
    • rawText.txt

The rest of the AWS Textract output will vary depending on your document. In this case it returned a file for each table in the document.

    • table-1.csv
    • table-2.csv
    • table-3.csv
    • table-4.csv
    • table-5.csv
    • table-6.csv
    • table-7.csv
    • table-8.csv
    • table-9.csv

To process the AWS Textract table csv files , I imported them into a Pandas dataframe. The Python code used is in Github.

There was some minor clean-up of OCR/interpreted tabular data which included stripping trailing white spaces from all text and removing a few blank table rows. In addition the PDF tables had two header rows that were removed and manually replaced with single header row. Also there were some minor OCR mistakes for example some zeros were rendered as capital letter ‘O’ and some words were missing last letter.

The table columns in the *WHO Draft landscape of COVID-19 candidate vaccines* PDF document tables are shown below. Textract did a good job of capturing these columns.

Vaccine columns:

    • COVID-19 Vaccine developer or manufacturer
    • Vaccine platform
    • Type of candidate vaccine
    • Number of doses
    • Timing of doses
    • Route of administration
    • Stage – Phase 1
    • Stage – Phase 1/2
    • Stage – Phase 2
    • Stage – Phase 3

Treatment columns:

    • Platform
    • Type of candidate vaccine
    • Developer
    • Coronavirus target
    • Current stage of clinical evaluation/regulatory -Coronavirus candidate
    • Same platform for non-Coronavirus candidates

Cost for procesing 9 page PDF file 3 times:

I copied the costs from AWS Billing below to give readers some idea of what Textract costs.

Amazon Textract USE1-AsyncFormsPagesProcessed $1.35

AsyncPagesProcessed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 27.000 pages $1.35

Amazon Textract USE1-AsyncTablespages Processed $0.41
Asyncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 27.000 pages $0.41

Amazon Textract USE1-SyncFormspages Processed $0.30
Syncpages Processed: 0-1M pages of AnalyzeDocument Forms, $50 USD per 1000 pages 6.000 pages $0.30

Amazon Textract USE1-SyncTablespages Processed $0.09
Syncpages Processed: 0-1M pages of AnalyzeDocument Tables, $15 USD per 1000 pages 6.000 pages $0.09

HTML tabular presentation of Textract output data

View it here.

 

Periodic chart elements by origin from SVG using Python

This cool periodic chart of the elements shows source / origin of the chemical elements. Source: Wikipedia created by Cmglee

It was really interesting to learn that elements may be created from more than one source/origin which are listed below:

    • Big Bang fusion
    • Exploding white dwarfs
    • Exploding massive stars
    • Cosmic ray fission
    • Merging neutron stars
    • Dying low-mass stars
    • Human synthesis

After learning this, I wanted to see counts of elements by origin. At first I thought I might have to do some manual data entry from the graphic.

However, after a bit of digging, it turned out that the author of the SVG file shown above had embedded the data I wanted along with Python code necessary to create the SVG file inside the file, which is very cool!

With some minor modification to the SVG file Python code I was able to extract the data into a csv data file and then use that as data source for the visualizations of counts of elements by origin below.

Read more about the SVG and the Python code modifications that I made in this Github repository:  https://github.com/sitrucp/periodic_elements.

The first chart below shows the counts of elements by their source/origin which answers my original question.

It was very interesting to learn that only 4 elements were created by the Big Bang and that all of the rest of the elements were created afterwards, by source/origin that came into being after the Big Bang.

The second chart shows counts of elements by number of source/origin. It was also very interesting to also learn that some elements have more than one source/origin, and 136 elements have more than one source/origin.

Retrieve and process Environment Canada hydrometric and climate data using Python

Recently needed to get flow and level data for a watercourse hydrological station as well as regional precipitation data for relevant location(s) upstream within the watershed of the station.

The objective was to combine two decades of watercourse flow and level with regional watershed precipitation data into a single set of analysis and reporting.

Environment Canada has all of this data and provides it in a variety of different ways.

Flow and level data

For many users the Water Level and Flow web portal wateroffice.ec.gc.ca site is sufficient. However I needed multiple years’ data so I looked around for alternatives.

I found the Hydat National Water Data Archive which is provided as a SQLite database was a good source. It contains all Canadian watercourse flow and level data up until about a month or so prior to current date and was an quick and easy download and convenient way to get required data.

Real-time / more recent flow data is available using a nice search interface in the same wateroffice.ec.gc.ca portal.

In addition, this same real-time / more recent data can also be obtained by going directly to an FTP site by clicking the ‘Datamart’ link on wateroffice.ec.gc.ca.

More details about the FTP site content are available here.

Both of these methods result in downloading csv files.

The website portal allows you to search by station name or reference id. The FTP site files are named with station reference id so you need to know the reference id to get the correct csv file.

This hydrological flow and level data is also available through the which is described MSC GeoMet API. The API was a bit too complex to use to get data and do analysis so I passed on it but it looks very powerful and well documented.

Precipitation data
Precipitation data is available for stations across Canada and this website https://climate.weather.gc.ca/historical_data/search_historic_data_e.html allows search by name, reference id or geographical location.

However, since this only allows download by single month of data, I needed to find another method to more quickly get multiple years and months of data.

When you use the website search and follow the links to the download page you will also see a link to get more historical data. This link brings you to a Google Drive folder. This folder documents how to use their ‘wget’ method to download files. However, I was using Windows and didn’t want to mess around with Cygwin or Windows linux to be able to use wget.

It turned out to be relatively simple to replicate the wget process using Python Requests, then loop through the csv files to process them. Python io.StringIO was also used to stream request content into a ‘fake csv’ file in each loop, which were then aggregated into a Python List that was converted into a Pandas dataframe so Pandas could be used to process data.

The Python code is in my Github ‘Environment-Canada-Precip-and-Flow-Data’ Repository.

Plotly Express Python remove legend title

Plotly.py 4.5, Plotly Express no longer puts the = in trace names, because legends support titles (source).

Prior to Plotly.py 4.5, I had used this ‘hover_data’ trick to remove the ‘=’ from the legend trace names.

hover_data=['gain_loss']).for_each_trace(lambda t: t.update(name=t.name.split("=")[0])

However now with Plotly.py 4.5, I want to remove the legend title. The new trick to do that is to enter empty string in the new legend title in the fig.update.layout section.

fig.update_layout({
    'legend_title_text': ''

This is much cleaner look for legends where the trace names, and your chart title, are sufficiently explanatory and a legend title would be superfluous.

Legend and polygon colors for Leaflet choropleth using Chroma.js

A Leaflet tutorial uses the following hard-coded getColor function to return colors.

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';
}

However, I wanted to use Chroma.js to generate the legend colors dynamically. So I needed a new getColor function.

Chroma.js has a variety of methods to return colors. The one I choose was using scale and classes. These can then be sent as variables to a getColor function to return colors to use in legend and map.

scale can be single value or an array of two colors (either as hex values or color words). In my case, the first is a light blue and the second is a darker blue. Chroma.js will then return gradients between these two colors. See colorHex variable below.

classes is an array of legend ‘breaks’ for the color gradients. For example they could be the numerical values from the Leaflet tutorial getColor function above (eg 10, 20, 50, etc). See classBreaks variable below.

The new getColor function is shown below:

var classBreaks = [1,50,100,250,500,1000,2000,3000,6000,9000];
var colorHex = ['#deebf7','#08306b'];

function getColor(n,classBreaks,colorHex) {
    var mapScale = chroma.scale(colorHex).classes(classBreaks);
    if (n === 0) {
        var regionColor = '#ffffff';
    } else { 
        var regionColor = mapScale(n).hex();
    }
    return regionColor
}

This getColor function can then be used as described in the Leaflet tutorial to set choropleth polygon fill colors. It also be used similarly to create the legend by looping through the classes to get a color for each legend entry.

However there is important consideration when creating the legend. Using scale and classes, Chroma.js only returns classes – 1 colors. For example the variable classBreaks array with 10 elements will only return 9 colors. To hack this I push a dummy element (‘999’) to the array so Chroma.js would return 10 colors and then ignore the dummy element when creating the legend.

The legend code is below includes hard-coded zero (0) value set to color white (#ffffff). Looping through the classBreaks each time using getColor function to return legend color based on break value.

var legend = L.control({position: 'topright'});

legend.onAdd = function (map) {
    var div = L.DomUtil.create('div', 'legend');
    div.innerHTML += '<i style="background: #ffffff;"></i>0
';
    classBreaks.push(999); // add dummy class to extend to get last class color, chroma only returns class.length - 1 colors
    for (var i = 0; i &lt; classBreaks.length; i++) {
        if (i+2 === classBreaks.length) {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '+';
            break
        } else {
            div.innerHTML += '<i style="background: ' + getColor(classBreaks[i], classBreaks, colorHex) + ';"></i> ' +
            classBreaks[i] + '–' + classBreaks[i+1] + '<br>';
        }
    }
    return div;
};
legend.addTo(map);

The final map legend looks like this:

Scraping public health web page using Python, BeautifulSoup4, Requests and Pandas

During the 2020 COVID-19 pandemic in Canada I wanted to get COVID-19 confirmed cases counts data for the city of Montreal.

The data I wanted was made freely available by the Quebec Government’s Health Montreal website in a tabular format that was updated regularly.  EDIT: Health Montreal stopped publishing data so the link no longer works!

I wanted to be able to use this data for this Leaflet choropleth map visualization. If interested, read more details on this visualization in another blog post.

There are many ways to get data from web pages. First I did it manually by copy and paste into Excel. This is ok for a one time analysis. You can even use Excel Power Query web feature to automate this a bit more. However, if you want to fully automate getting data from a web page you should use web scraping techniques.

Note that the code described below is available in this Github repository https://github.com/sitrucp/covid_montreal_scrape_data.

Initial data retrieval and transformation done using Excel
To get the web page data, at first, I simply manually copied and pasted into an Excel workbook. This was quite easy to do as the tabular format copies and pastes nicely into an Excel grid.

To automate this a bit more and do some more complex data transformations I switched to using Excel Power Query’s web query feature and Power Query to transform the data for the choropleth map visualization.

Full automation and scheduling using Python, cron job and AWS S3
However, this was intended to be an ongoing analysis so it needed to fully automated and the data retrieval and transformation process to be run on a scheduled basis.

In addition to scraping the data from the web page, the data had to be made available somewhere on the internet where the choropleth map visualization could freely access it by a url.

As the choropleth map visualization is hosted on Github.io I could have used Git on the web server to do an automated, scheduled push of new data from web server to the Github repository. I decided to give this a pass and try it some other time.

Instead, I choose to upload the data to public AWS S3 bucket that the choropleth map visualization could access with a simple url for each data file.

Everything from scraping the website to uploading data to AWS S3 was done in Python. The Python code is run on scheduled basis using a cron job on a web server. The cron job runs a few times each evening when the data is usually updated on the website.

Python, BeautifulSoup4, Requests and Pandas to retrieve and transform the web page data and create a JSON file that could be uploaded to AWS S3 bucket and made available to the choropleth map visualization.

Python module Boto was used to upload the data from web server to an AWS S3 bucket.

Let go through the code.

BeautifulSoup4 was used to get web page and find the specific table that holds the tabular data as below. The table with the counts by neighbourhood was the 4th table in the web page:

# get health montreal webpage html
 url = 'https://santemontreal.qc.ca/en/public/coronavirus-covid-19/'
 page = requests.get(url)
 soup = BeautifulSoup(page.content, 'html.parser')
 # get all tables on webpage
 tables = soup.find_all('table')
 # select 4th table in list of tables on webpage
 table = tables[3]

Then Pandas is used to read that table into a dataframe and then add more readable consistent column headers as below:

  # read table into pandas dataframe
    df_table_data_all_cols = pd.read_html(str(table))[0]
    # rename columns 
    df_table_data_all_cols.columns = ['region_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']
    df_table_data = df_table_data_all_cols[['region_name','case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The web page table dataframe was merged with the “lookup” dataframe. This merging is basically equivalent to a SQL JOIN::

   # join lookup table to scrape data to get geojson_name field to use on map
    df_table_data_w_lookup = pd.merge(df_montreal_regions_lookup, df_table_data, left_on='website_name', right_on='region_name', how='left')
    df_table_data_final = df_table_data_w_lookup[['website_name', 'region_name', 'geojson_name', 'case_count','case_percent','case_per_100k','mort_count', 'mort_per_100k']]

The lookup table has one row per Montreal neighbourhood with 2 columns: one for the Health Montreal website neighbourhood name and a second for the Leaflet map’s geoJSON geographical region boundary names. This is required because the Health Montreal website neighbourhood names were not identical to the map’s geographical region boundary names.

Of course, I could have modified the map’s geographical region boundary names to reflect Health Canada naming convention but creating a “lookup” table was easier and provided flexibility in case Health Montreal table’s names changed (which they did, in fact several times!).

The Python code does a check to see if current data on the web page is same as previously retrieved. I only wanted to upload new data to AWS S3 if it was necessary. This check is done by comparing the total case count on web page to previous case count:

   # if new is diff from prev, update files and upload to aws
    if str_total_case_prev == str_total_case_new:
        scrape_result = 'no change, case total is still same as prev case total: ' + str_total_case_prev
    else:
        # create scrape result string to print to cron log
        scrape_result = 'new cases found: ' + str_total_case_new + ' prev case total: ' + str_total_case_prev
        # transform pandas dataframe into dictionary to write as json
        json_table = df_table_data_final.to_dict('records')
        # write new montreal covid_data to json file for map to use
        with open('uploads/montreal_covid_data.json', 'w') as f:
            f.write('var covid_data = \n')
            json.dump(json_table, f, ensure_ascii=True)
            # write today's date to use in index page as last updated date
        with open('uploads/last_update_date.json', 'w') as f:
            f.write('var last_update_date = \n')
            json.dump(todays_date, f)
        upload_to_aws()

If the counts are the same then code stops. If the new count is different than the previous count the code create new data that is uploaded to the AWS S3 bucket.

A scrape_results string is also created that is written to the cron log.

Uploading to an AWS S3 bucket is conceptually quite straightforward. The Python module Boto makes it easy to create the connection and bucket definitions:

  ## create aws S3 connection
    conn = S3Connection(canada_covid_aws_keys['AWS_KEY'], canada_covid_aws_keys['AWS_SECRET'])
    bucket = conn.get_bucket('canada-covid-data')

The bucket itself has a Public Policy so anyone can read the data files. Each bucket file has a public url so the map visualization can simply reference these to get the data.

However, authentication is required in order transfer the data from the web server to the S3 bucket. So there is some behind the scenes setup work to do on the AWS side. First, to create and configure the bucket and second to create and configure the IAM objects to do authentication.

An IAM User Policy was created to allow that User to write, read and delete on that bucket. The User has an AWS key and secret that are provided as part of the Boto connection to do the S3 authentication. Of course the key and secret should not be exposed and are imported into the Python code from another non-public location on the web server.

Once connection is made, then the Python code deletes the existing files on S3 before uploading new files:

  
    # identify files to be uploaded to aws
    upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

    # delete existing files from bucket
    for key in bucket.list():
        bucket.delete_key(key)

    # write new files to bucket 
    for file in upload_files:
        k = Key(bucket)
        k.key = file
        k.set_contents_from_filename(upload_path + file)

The Leaflet map visualization will then show the new updated data next time it is viewed or browser page refreshed.

Heat maps of Canadian activity changes due to COVID-19 using Google Community Mobility Reports

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 related activity changes across Canada.

Google has helpfully provided freely available global “Community Mobility Reporting” which shows Google location history change compared to baseline by country, and country sub-regions. These provide changes in activity by location categories: Workplace, Retail & Recreation, Transit Stations, Grocery & Pharmacy and Parks locations, and Residential locations. For Canada it is available by province. As of April 19, data contained daily values from Feb 15 to Apr 11.

The Community Mobility Reporting data is available as a single csv file for all countries at Google Community Mobility Report site. In addition, Google provides feature to filter for specific country or country sub regions eg state or provinces, etc and download resulting PDF format.

As the COVID-19 lockdowns occurred across Canada you would expect that people were less likely to be in public spaces and more likely to be at home. The Community Mobility Reporting location history allows us to get some insight into whether or not this happened, and if it did, to what degree and how this changed over time.

I used the Community Mobility Report data to create a D3.js heat map visualization which is described in more detail below and in this Github repository.

I also created an Excel version of this heat map visualization using Pivot Table & Chart plus conditional formatting. This Excel file, described in more detail below, is available in the Github repository.

More detail and screenshots of visualizations is provided below:

Heatmaps
Heatmaps are grids where columns represent date and rows province/territory. Each heatmap is a grid representing a single mobility report category. The grid cell colors represent value of percent change which could be positive or negative. Changes can be observed as lockdowns occurred where locations in public areas decreased relative to baseline. Inversely, residential location increased relative to baseline as people sheltered in place at their homes.

1) Heatmap created using Excel / Power Query: For this heatmap visualization the global csv data was transformed using Excel Power Query. The Excel file has two Pivot Table and Chart combos. The Excel files and Power Query M Code are in the repository. Excel files are available in Github repository.

2) Heatmap created using D3.js: For this heatmap visualization the global csv data was transformed using Excel Power Query. The heatmap visualization was created using slightly modified code from ONSvisual.

Bar charts
These were created using Excel to visualize percent change by Province/Territory and location category using Excel / Power Query. These allow comparison between provinces by date and category. This Excel / Power Query file can be used for analytical purposes to slice and dice global data by date, country, sub region 1 & 2 and category. Excel files are available in Github repository.

Choropleth map of Canada COVID-19 cases by health region using Leaflet and D3.js

During the early days of the 2020 COVID-19 pandemic in Canada, I wanted to get better understanding of the geographical distribution of COVID-19 cases across Canada.

At the time, government or news agencies were only mapping case counts by province. However Canadian provinces are so big compared to population centers that it doesn’t accurately reflect actual geographic distribution of cases. It would be better to use the provincial “health regions” which correspond much better to population centers.

So I set about to create for myself a choropleth map visualization by health regions.

View the finished choropleth map at the following link. The source data is updated daily each evening.
https://sitrucp.github.io/canada_covid_health_regions/index.html

I used Leaflet.js open-source JavaScript mapping library to create the interactive choropleth map, D3.js to retrieve and transform the csv format data, and Javascript to retrieve the JSON geographic boundary files and also to manipulate and present the data with HTML and CSS.

The COVID-19 case count data are obtained as csv file format from the “COVID-19 Canada Open Data Working Group” who are an amazing group of volunteers.  They have been tirelessly collating data from the various provincial and territory government agencies daily since early March. This group saves the collated and cleaned data as csv files in a Github repository https://github.com/ishaberry/Covid19Canada.

The health region geographical boundary descriptions are from Statistics Canada’s Statscan ArcGIS Health region boundary Canada dataset. These had very detailed boundaries so I simplified them using QGIS which also dramatically reduced the dataset size.

However, there were some data issues that needed to be addressed first. The Statscan health regions shape file boundary names are different than those used by the provincial and territory government agencies reporting the data.

The Statscan seems to have full-form “official” health region names, while the provincial and territory government agency names are common, more familar, short-hand names. Also, names appeared to have changed since they were recorded in Statscan data.

Provinces may also add or remove health regions from time to time due to administrative changes or population changes etc. So either set may have health regions that the other doesn’t have.

From a data governance perspective, in a perfect world, everyone uses a single set of health region boundary names. COVID-19 reporting has made a lot of people aware of this issue which is a silver lining in the COVID-19 dark cloud!

Addressing these name differences was actually quite simple, requiring creation of a lookup table with two columns, one for each dataset, to match the names in the boundary data files to the names in the counts data file. The lookup table can then be used dynamically when getting data each time the map is refreshed. This is described in more detail in Github repository README linked below.

Code for this project is maintained in Github:  github.com/sitrucp/canada_covid_health_regions.

I also created a separate choropleth map for Montreal, where I was living at the time, which was Canada’s COVID-19 “hotspot” with about 25-30% of Canada’s total COVID-19 cases. However, the Montreal data source has since been discontinued so the map is archived now.

View archived Montreal map here:
https://sitrucp.github.io/canada_covid_health_regions/montreal/index.html

Delete Google Calendar events by date using Google Script

To quickly and easily delete Google Calendar events by date you can use Google Script https://script.google.com to run the following script.

Modify script to include your Google Calendar name which is likely your Google email. Then modify the date variables fromDate and toDate. Note the months start from 0 (Jan).

function delete_events()
{
    var calendarName = '[email protected]';
    // for month 0 = Jan, 1 = Feb etc
    // below delete from Jan 1 2015 to Jan 1 2019
    var fromDate = new Date(2015,0,1,0,0,0); 
    var toDate = new Date(2019,0,1,0,0,0);
    var calendar = CalendarApp.getCalendarsByName(calendarName)[0];
    var events = calendar.getEvents(fromDate, toDate);
    for(var i=0; i&lt;events.length;i++){
        var ev = events[i];
        // show event name in log
        Logger.log(ev.getTitle()); 
        ev.deleteEvent();
     }
}

This will delete your calendar events so be careful!

Check out Google Script documentation for more on how to work with Calendar.

https://developers.google.com/apps-script/advanced/calendar

OneDrive SharePoint Migration Tool (SPMT)

I have a Office 365 Business Premium account. I wanted to sync many large folders with OneDrive for Business. However, it was taking very long to upload. While I was searching for improvements, I saw many other people have observed that OneDrive is much slower than Dropbox, Box, AWS, Google Files etc. Apparently this is a ‘known issue’ and Microsoft is addressing it.

One of the things they have done is provide a standalone desktop application called ‘SharePoint Migration Tool (SPMT)’. While the name doesn’t specifically include OneDrive, the tool is intended for use with Sharepoint and OneDrive. However, there are a few pieces of information I learned to use it for OneDrive. That is the purpose of this post.

First, the user who will be using SPMT must have permissions for the OneDrive for Business account you want to upload files to.

Then you need to download SPMT from here https://www.microsoft.com/en-us/download/details.aspx?id=53598

The SPMT tool can be used in any of several migration ‘scenarios’, but the one I am focusing on here is the ‘File share to OneDrive and SharePoint’ scenario. I wanted to use SPMT as a potentially faster way to upload files from my desktop computer to OneDrive for Business.

The guide for this scenario is here https://docs.microsoft.com/en-us/sharepointmigration/fileshare-to-odsp-migration-guide.

After you download, install and start SPMT the start screen will ask you to sign into to your Office 365 account. This will be user that has permission on OneDrive.

After successfully signing in you will see 3 scenarios to choose from:

The two that I was interested in testing were:

  • File Share
  • CSV file for bulk migration

Note / spoiler alert
While SPMT provides much more flexibility in terms of identifying source and target, and capability to batch migration jobs, it was not dramatically faster than just using OneDrive desktop sync tool. It took me a couple of hours to upload only 1000 files total 3 GB in size.

The Microsoft documentation refers to migrations with multiple TB’s of files but my experience suggests that would be impossible if the files were coming from outside of Microsoft cloud. Therefore, it is clear that for large migrations, the SPMT is intended primarily for use within Microsoft cloud.

File Share
The File share option is pretty straightforward. Use SPMT to select the local folder to upload, then manually enter the OneDrive location url to upload them to. However, there are some things to pay attention to here:

  • When you select a local (aka source) folder only it’s contents will be uploaded, not the entire folder with selected local folder name.
  • The string you enter for the OneDrive (aka target) location url is not 100% intuitive or completely described in the instructions. See below.

OneDrive (aka target) location url
The exact specifications for this are not given in a way that is intuitively helpful. There are subtle differences in the url between SharePoint and OneDrive. The linked documentation does indicate these but I had to search for additional clarification which I pass on below.

If my Office 365 account domain is ‘mycompany’ (eg mycompany.onmicrosoft.com) and the user is ‘myuser’ (eg [email protected]) then myuser’s OneDrive target location url:

https://mycompany-my.sharepoint.com/personal/myuser_mycompany_onmicrosoft_com/

You can also simply copy this from your browser address bar when you are viewing myuser’s OneDrive folder online.

CSV file for bulk migration
This option uses similar information as above but you put the information for source and target into a CSV (or JSON) file format.

CSV file format:

  • Has 6 columns
    • Column A: Mandatory. Source folder path regular Windows path. Use double quotes if path has spaces.
    • Column B: Optional. For Sharepoint migration only. Not for file migration like we are talking about here.
    • Column C: Optional. Subfolder of source path.
    • Column D: Mandatory. OneDrive (aka target) location url. Note do not include anything other than what is above. Column D and E are used to specify folders.
    • Column D: Mandatory. Standard value “Documents”
    • Column F: Optional. Subfolders that will be created if they are not already present under your OneDrive root url. Enter as “Folder/subfolder/”. Note the use of fore slash and double quotes if any spaces.
  • The file has one row per ‘job’. For example, in File Share method above you select a single folder to migrate. That would be one row in CSV (or JSON) file.
  • Use SPMT to upload the CSV (or JSON) file and it processes all of the migration ‘jobs’.
  • However, it processes them in parallel which seems counterproductive. I would rather it process them in sequence, one row at a time.
  • This bulk migration method would be useful for processing many folders.
  • I did not try JSON method but would expect it would requir same information just in specified JSON foramt.

In both scenarios above when the migration is happening you will see screen similar to below.

Note that SPMT ‘scans’ the files to be uploaded first in any migration scenario to see if they meet specifications. There are strict file name prohibitions but it did rename ‘bad’ files which is nice.

Once it indicates files are being uploaded you should see corresponding folders and files in OneDrive online.

At any time during migration you can click the “View Status” link on the form to open Windows folder where SPMT keeps job status including fails etc in csv files. These are updated like log files so you can monitor status. This was helpful to see what failed and why.

I experienced a few quirky behaviour such as pausing or stopping a migration freezing SPMT app and had to close it using Task Manager. Also one very weird result of closing a job was that a local file was corrupted and I could not move or delete it with error “windows 10 cannot delete file the tag present in the reparse point buffer”. I had to run chkdsk to fix the file.

Overall SPMT does the job advertised. As mentioned it does not dramatically speed uploading files from local computer into OneDrive but it does offer more control and flexibility over OneDrive desktop sync for large uploads with many folders and files.

AWS S3 csv file as D3 report data source

This is an example of how to read a csv file retrieved from an AWS S3 bucket as a data source for a D3 javascript visualization.

The D3 visualization would be an HTML document hosted on a web server. 

You will use the AWS SDK to get the csv file from the S3 bucket and so you need to have an AWS S3 bucket key and secret but I won’t cover that in this post.

The key point of this post is to highlight that the bucket.getObject function data is read into D3 using  d3.csv.parse(data.Body.toString());  

Another note is that d3.csv.parse is for D3 version 3. Older versions use d3.csvParse. 

Once implemented, whenever the webpage is refreshed it retrieves latest csv file from the S3 bucket and the D3 visualization is updated.

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.6.3.min.js"></script>

<script type="text/javascript">

// aws key and secret (note these should be retrieved from server not put as plain text into html code)
AWS.config.accessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.secretAccessKey = 'xxxxxxxxxxxxxxxxxxxxxxx';
AWS.config.region = 'us-east-1';

// create the AWS.Request object
var bucket = new AWS.S3();

// use AWS SDK getobject to retrieve csv file
bucket.getObject({
    Bucket: 'my-S3-bucket', 
    Key: 'myfile.csv'
}, 

// function to use the data retrieve 
function awsDataFile(error, data) {
    if (error) {
        return console.log(error);
    }

        // this where magic happens using d3.csv.parse to read myCSVdata.Body.toString()
    myCSVdata = d3.csv.parse(data.Body.toString()); 

        // now loop through data and get fields desired for visualization 
    var counter = 0;
    myCSVdata.forEach(function(d) {
            d.field1= +d.field1;
            d.field2= +d.field2;
            countLoop = counter++;
    });

        // now you can create rest of D3 vizualization here 
        // for example like this example https://gist.github.com/d3noob/4414436

        my D3 vizualization code here

// this closes bucket.getObject 
});

</script>

 

 

Introduction to Excel Power Query aka Power BI Query

What is Power Query?

Power Query is an Microsoft tool for Excel that is used to do the following:

    • Connect to a data source
    • Extract data from source
    • Transform the extracted data
    • Load the transformed data into Excel

This functionality is called ETL (Extract, Transform and Load).

Why use Power Query?

Excel already has another tab of features called Get External Data, why not use that?

    • Excel’s other methods for connecting to and retrieving data may still be useful, but they are not able to transform the data.
    • Power Query connections are non-techie, user friendly, and provide much more versatile and robust functionality.

 

Where is Power Query in Excel?

You can find Power Query in the Data tab Get & Transform icon group in Excel 2016

Note that in previous versions of Excel, Power Query is an add-in that has its own tab. However, in Excel 2016 Power Query is built into Excel.

Get & Transform has 4 icons:

    • Show Queries – this hides/shows the Query pane docked on right side of Excel.
    • New Query – choose data source & modify Power Query settings and options.
    • From Table – choose data source as currently selected Excel table.
    • Recent Sources – choose from recently selected data sources.

Step 1 – Connect to a data source and retrieve data

Select the New Query icon to select which type of data you want to connect to

    • FromFile (csv, Excel)
    • FromDatabase (SQL Server db, Access db)
    • FromOnline Services (Facebook)
    • From Other Sources (webpage, SharePoint List, ODBC connection)

Screenshot shows example of connecting to an Excel Workbook.

    • Select From File – From Workbook
    • Popup dialogue allows you to select an Excel file.

Step 2 – Transform the data – common usage

    • “Calculated columns” replace any worksheet formula requirement
    • Power Query has many more formulas than plain old Excel, including VBA type functionality
    • Power Query has been called “vlookup on steroids”
    • “Pivot” and “unpivot” data
    • Group and aggregate data into sums, min, max, averages
    • Simple to combine many files contents in a folder into one table.
    • Filter, sort, replace with complex logic.
    • Merge two or more separate tables of data together on matching values. If you are familiar with SQL, this is a JOIN query.
    • Append two or more separate tables of data into one combined table of data. If you are familiar with SQL, this is a UNION query.

Step 3 – Load the data into Excel

    • Power Query queries are loaded to a worksheet by default.
    • However, you can choose to save a query as a connection only by selecting Load To – Only Create Connection. This query can be used in other Queries eg in Merge or Append, and by other Excel features eg it can be used as datasource for a Pivot Table.

Step 4 – Click and refresh to retrieve new data

    • When data source(s) are updated, just click Refresh to retrieve and transform new data.
    • No copy & paste of data from other files.
    • No worry about formulas being deleted or ranges being moved.
    • All transformations are safely contained inside the Query.

Bonus: Power Query is also in Microsoft Power BI

    • Power Query has been integrated into Power BI as the “Get Data” feature.
    • Has the same functionality and features as the Power Query in Excel.
    • Same programming language ‘M’.

 

AWS RDS Postgres database external secure connection

Require an SSL Connection to a PostgreSQL DB Instance? Here are some notes from my investigations.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.SSL

Not great examples on that but take a look at this other documentation for secure connection to Redshift database which has working examples
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html

Go to EC2 Security Groups, and modify the Inbound PostgreSQL rule from the VPS server eg something like “sg-c2a232b8” to either your current IP or to Anywhere eg 0.0.0.0/0

Setting it to Anywhere means both you and your EC2 application can connect to the database, but so can the rest of the world. Setting it to just your current IP means only you and anyone else on that IP (if it is a work or university IP) can connect to the database. Of course, they will need the key, username and password so reasonably secure.

How to use Hadoop HDFS as a datasource for Power BI

I have a local Hadoop server running on my Windows laptop. To connect Power BI to this Hadoop file system I used Hadoop’s WebHDFS REST API which is very well documented.

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

The WebHDFS API specifies a url in which you specify the Hadoop root file system directory that you want to get files from as follows. In my case I had folder named “myhdfsfolder“:

http://localhost:50070/webhdfs/v1/myhdfsfolder?op=LISTSTATUS

The myhdfsfolder folder contained a csv file which had previously been imported into Hadoop.

In Power BI, select “Get Data”, “Other”, then “Hadoop File (HDFS)” which will pop a window that asks you to enter a url. The url requested is the WebHDFS REST API url specified above.

After entering your url and clicking ok you should see a list of objects that are in the Hadoop directory. In my case I can see my csv file along with bunch of Hadoop log files.

Since I only want the csv file, I clicked the Power BI Edit button so I could filter the list to only my csv file. This part is just standard Power BI/Power Query stuff. If you had multiple csv files, you could filter to csv file types, or some other feature of the file name, or file attribute such as create date etc.

After filtering, only the csv file I am interested in shows in the list. There is one more step to do which is also just standard Power BI/Power Query stuff. Since the file is a binary file you first need to extract its contents. To do this, select the Content column double down arrow selector. This will extract the binary file content’s into a new table which is the data you want.

Now you can continue transforming the csv file data in Power BI/Power  Query or just load the query to start using the file’s data in Power BI data modeling and reporting!

How to use Python as a datasource for Power BI

Step 1 – Download latest version of Power BI to get new features (version that came out in early August 2018)

Step  2 – Enable Python in the “Options – Preview Features” check the “Python support” option. Close and restart Power BI.

Step 3 – Click “Get Data”, select “Other” and “Python script” and click “Connect” button.

Step 4 – Write or paste Python script in the popup Python script window and click Ok button.

I chose to use Pandas to get csv file:

import numpy as np
import pandas as pd
data = pd.read_csv(‘C:/mydatafile.csv’, delimiter=’,’)

Note if you are using Windows as I am, you have to write file path so Windows can recognize it otherwise you will get an error. I used foreslashes but you could also escape the backslashes.

You could also write any Python script that gets or generates data from any data source eg a database, from AWS S3 buckets, etc. Of course this also includes transforming data. The important thing to note is your script must output a named dataset(s). The named dataset(s) will be thing Power BI will use when you are finished coding.

Step 5 – After clicking ok, you get the familiar Power Query data source selection window. Select Python and you will see your named data set(s). Just check the one(s) you want and click ok.

Now just continue working with Power BI as normal! You can always go back to edit the Python script by select “Source” in query editor.

Comparison of my old Palm vs my new Pocket PC (from 2004)

Here is a blog post from old Word Press site I wrote back in July 7, 2004

Comparison of my old Palm vs my new Pocket PC

I thought I would write this to share some of my experiences making the transition from the Palm platform to Windows CE/Pocket PC, as well as some of the things I did to my Pocket PC 2003/Dell Axim X30 to make it function better for me. At the end I get into a more polemic editorial on the looming Microsoft hegemony on the PDA OS world – a good thing, and a bad thing.

What did I use my Palm VIIx for?

  • Calendar (Desktop To Go)
  • Contacts (Desktop To Go)
  • Notes (Desktop To Go)
  • Tasks (Desktop To Go)
  • Reading eBooks (Mobi Book, AportisDoc, Palm Reader, Word To Go)
  • Documents and spreadsheets (My Documents To Go)
  • 3rd expense applications mostly productivity eg expenses
  • Handwritten notes/graphics (Doodle)

What more can I do on my Dell Axim PPC 2003?

  • Wifi internet access
  • Wifi streaming audio
  • Wifi streaming LIVE NASA TV!! (and lots of others)
  • Play relatively sophisticated games like Quake
  • Color picture viewer
  • MP3 player (eed a big SD card now)
  • Wireless synchronization of Pocket PC Outlook to PC Outlook
  • Wireless access to PC (using shared folders)
  • Control Desktop by wireless connection
  • Use Pocket PC as a remote control for TV (and anything else requiring an IR remote)
  • Use Excel to manage my list of books and daily balances and projected income
  • Bluetooth connection to cellphone to dialup internet access
  • Bluetooth to exchange photos from other persons cellphone with camera (also with IR)
  • Much more to come!

Side by Side Views

The Palm VIIx is the black colored device on the left, and the Dell Axim is the silver colored device on the right.

 

 

 

 

 

 

 

 

 

 

 

 

I never used the “Wireless” capability of my Palm VIIx. I actually became annoyed by the antennae popping up and down (it was so loose that it always moved) and activating the modem software, so I Crazy Glued it to the Palm.

But in any event, look at the size of that antennae!

Palm VIIx with wi-fi antenna extended.

 

 

 

 

 

Ok let’s get into the details

There are some very fundamentally important things that I really like about my Palm compared to my new Axim. Maybe I am used to the Palm but it got what I wanted done quickly and efficiently. I hoped that my Dell Axim would far surpass that. It hasn’t .. yet! I am continuing my search for applications, tricks, and new ways to do old things!

Anyways, after many years, I was very happy with what I was getting from using Palm. Even though the Palm VIIx is relatively stone-age in terms of hardware, size, etc, I found it and a Palm 100 I had a few years before, to be incredibly useful. While the PPC is the more flexible platform, the Palm does things in a much more efficient, and often more powerful way. I could treat my Palms like the brick of plastic they were. The VIIx screen crapped out on me once, so I opened it up, took it apart cleaned connections, reseated cables, and voila it worked again. I read ton’s of ebooks on it. I also used it on a daily basis for Appointments, Expenses, Tasks, Notes, Excel and Word document editing.

That said, I am overwhelmed by the advances that have been made in less than four years between when the Palm VII (it came out 2001) came out to the current slate of PPC’s.

Dammit, with my Axim X30 312 mhz, I can be listening to MP3’s (the sound on my Axim MP3 really rocks-equalizer, extra bass), while reading Bill Clinton’s, My Life on Mobibook! I can play games like Quake, which was a real struggle on my old 486 50 Mhz Pentium but is smooth as butter on my new Dell Axim X30. Truly amazing.

This brings to mind the exponentially increasing exponential increase in processing power ratio ala Ray Kurzweil .. but dammit again, this just reinforces my consternation about why I am feeling a shortcoming about I can/cannot do with my new PPC vs the Palm given the enormous boost in hardware capabilities! Maybe its not the hardware .. then again, why I am complaining when it is a bit slow? My Palm couldn’t even mult-task, let alone exhibit such brute strength. Clearly, I am conflicted. Clearly, something is conflicte. Anyways, I got my Pocket PC a couple of weeks ago. The amazing price of the Dell Axim, built-in WiFi & Bluetooth made it an easy choice.

After shopping around for Pocket PC PDAs and lots of reading, the only contender was the Ipaq 4150, but even with its greater supposed “form factor”, and longevity of service, etc it was almost $300 (Hello US citizens, this is Canadian Dollars) more expensive. Also, the Dell 312 Mhz cpu rates as fast, or faster in many categories than the Ipaq’s 400 Mhz. So, I fired off my money order (not only do I believe in reducing my ecological footprint, I believe in reducing my electronic footprint) by overnight courier to Dell here in Toronto.

I initially had four orders placed with Dell because the Canadian website was confusing, so I kept going back to make sure I was ordering what I wanted. A little aside here, I know a high level technology manager at Dell Canada who says the guy in charge of the Dell website is a doofus. The US site was great. Anyways, a few strategic phone calls to iron this out probably didn’t hurt in elevating my profile. Then I put “RUSH, per Customer Rep” accentuated with highlighter on the order printout attached to my money order. I was trying to push some buttons in the machine to make sure my order wouldn’t get lost in the in-basket. I think it worked. Lots of people seem to be waiting a long time for their Axims. Four days later my Axim was in a truck on its way to me. It arrived 2 days after that. Not too shabby eh?

Before getting my Dell Axim, I didn’t really know much about what to expect from the Pocket PC or Windows Mobile CE OS. I assumed that the much more powerful hardware in my new Axim/Pocket PC would make everything “snappier”. However, I have discovered a confounding paradox. The Palm PDAs I had let me find and open things faster, but the Axim/Pocket PC is obviously the more powerful system when it comes down processing stuff.

For example, even though the steps are the same, opening MS Excel spreadsheets on the Palm VIIx is much faster. But calculation of formulas and the like in the Palm VIIx takes so long, I can literally tap-a-tap-tap my fingers in idleness. In the Axim/Pocket PC, it seems as fast as my desktop computer. Blink, its done.

Lets move on to another thing, the 3rd party application factor. Those Palm programmers are everywhere. There are guys in Amsterdam, Thailand, Borneo, Chicoutimi churning out tons of Palm applications for free. The PPC world is populated by programmers with $ signs in their eyes. Nothing wrong with making money but what is happening, from my perspective, is that because M$ has bankrolled PPC development, most of the players involved are bigger and better funded, and expecting a better ROI, by targeting the enterprise market and the high end consumer market.

This leaves the plain vanilla user like me with dimished opportunities. I think it would be interesting to do a comparison of the demo trial length (in days) for applications developed for Palm versus PPC. I bet it would be much shorter for PPC. There would be no contest in terms of freeware or pay-as-you-like-ware. Palm would win hands down.

Now this brings my consternation about what I can actually DO with my PPC versus my Palm to the fore. Because the benchmark for ROI in the PPC application world is higher than in the Palm world, the level of application sophistication is higher. Essentially, developers want to give more, to justify higher software cost and thus the competitive spiral elevates. Its only human nature, well almost always, because often people just don’t have what it takes, and they fake it, or they stuff it up, or bloat the ware just to add more “functionality”.

There is nothing wrong with this. Everybody should try to give the user the best experience. But from my perspective, as a user, I am more likely to get the application that meets my needs when 100 so-so developers make my expense program, instead of 3 super high-end developers making my program. Take note, we are talking about variety, about competition!

No doubt about it, M$ knows what it does. Bill Gates is not the richest man in the world for nothing. The company manages every fundamental aspect of its business from making sure that product reaches the loading dock so that revenue is accrued smoothly, to making sure that its platforms, such as Windows CE, are $ attractors. What does this mean?

This is a recurring theme with Win CE. I have downloaded the eMbedded Visual C++ 4.0eVC4.exe, Microsoft Pocket PC 2003 SDK.msi, and TodayScreenSampleSetup.exe from Microsoft. It is clear from their application development documentation that their 80/20 rule is strategically focused on maintaining the vortex. Microsoft will develop the nexus around which dissatisfaction will bring in 3rd party developers, but only the $ seekers. The best survive, M$ consumes them, and they become part of the Hive Mind.

There is absolutely no doubt that Bill Gates primary “directive” to Microsoft product managers is to ensure that their products leave strategic “gaps”, but limited so that can bring in only the high end 3rd party developers. But the directive will be explicit. LEAVE GAPS.

Brght, talented, motivated they may be, but Microsoft developers are corporate booger-heads (Scott are you reading this : )) and ultimately, they CANNOT create. They can only expand on existing corporate paradigms. 3rd parties are the only ones who can do this. This has been the strength of Mac and Palm. But alas, Mac has required propping up, by M$, as will Palm in the near future, to ensure that M$ gets the competitive motivating force it needs. Gates knows monopolies are death. It is a fine balance for Microsoft as all the anti-competition action against them in North America and in Europe can attest to.

For now, can I just ask: Where are the Linux developers for PPC hardware! Or perhaps, Palm developers of the world, learn the CE platform! You could certainly bring some sanity to the irrelevant-ware out thar.

Palm VIIx Dell Axim X30 312 Mhz
Length of Service 3 years 2 weeks
Cost It was a gift from a friend. But was about US$399 new (back in the day) $464 Cdn incl tax, shipping
WiFi umm, sort of .. modem, at 9600 baud special Palm.net email address & download ‘Web Clipping Applications’. YES! (and BlueTooth too)
Processor Motorola MC68EZ328 DragonBall EZ 20 MHz 312 Mhz Intel® XScale Processor with WMMX
RAM 8 MB (not expandable) 32 MB (buy SD cards)
ROM 2 MB Flash memory 32 MB
Screen 16 gray scales 3.5″ QVGA TFT color 16-bit, transflective
Battery two AAA batteries 950 mAh, Li-Ion rechargeable
OS Palm OS Version 3.5 Windows CE 2003
Input Graffitti Graffitti aka Block Recognizer, Keyboard, Transcriber, Letter Recognizer.

 

Programs
Software
What Have I Done to Customize My Dell Axim/Pocket PC?

Screen Protectors – My Palm VIIx was a giveaway. I treated it like a pocket book (like a paper one ie I didn’t give it any special attention) The result was that the screen became scratched and scuffed but it didn’t matter to me. It was a total utilitarian thing.

So, I thought I could do the same thing with my brand new Dell Axim. Oh boy, was I wrong. After a few days of not using screen protectors and thinking I could continue as I did with my giveaway Palm, I had faint scratches in the Block Recognizer text area : ( just from the stylus! Ok, this is not good. As it happens, there seemed to be something wrong with the screen, a kind of rainbow effect appearing when viewing the screen at extreme angles. So, I called Dell, went through the motions, and ended up getting a replacement, and sent the first one back. So, I had a virgin Axim screen, and then decided, I would be a wuss, and bought screen protectors for my new Dell Axim. They are a bit big (they were made for the Ipaq but were on sale) and I have to trim them using scissors or an exacto blade or my pocket knife. Sometimes it overhangs the screen and catches on things for months but hey, it still protects the screen.

After applying the screen protector, I have an unexpected sense of comfort (you can feel the protector and it has some heft) when I stuff my Axim into my pocket (even though there are never ever any keys, coins or anything else in there .. ever!) As it turns out, I fell in love with the case that comes with the Axim. It is really great. It is tough. Colour is black and it is not big so it hooks/clamps onto my belt really nicely. I can also just through it into my pocket or bag. It is easy to pull the Axim out and stuff it back in. After nearly a year of use, the clip and the elastic that holds the case together is still like new. Very impressed.

Re-Mapped Buttons – I re-mapped buttons so that my new Axim resembled my old Palm. The Message (email) button is mapped to Tasks, and the Home button is mapped to Notes. Now it is like my Palm.
Re-Mapped WiFi Button – using a program called WiFiPower.exe (a hacked version of the WirelessPower.exe in the \Windows folder), you can start only the WiFi and not the BlueTooth.

Benefit? If you don’t use Bluetooth at all or not very often, not starting BT every time you need the WiFi radio might save battery power. At least only the green light flashes when you use WiFi and not the Blue! That has to save some milliamps of juice there.

Simply use the Settings\Buttons setting to remap the Wireless button to WiFiPower.exe.

X Marks the Spot – A fundamental, though well-intentioned, flaw in the whole PPC Windows CE paradigm is the concept of the X button. The X button in top right hand corner doesn’t Exit the window, it simply Minimizes it. This was done because it was deemed that it was too slow to re-start a program than to simply refocus on it.

Dumb,IMHO, because given the mult-tasking propensity of average user, you soon have 5, or 10 applications open all sucking up valuable RAM. It wouldn’t have been difficult to give users a choice. There is a program called Switcher that can close programs from taskbar (not sure if it is Dell or PPC 2003).

Many 3rd party solutions are offered to do what Switcher does: Magic Button, Gigabar. So far, WisBar is my favourite. Just click the X, and it Exits the application! Wow, that is just way too easy! It even has a task selector and you can close applications. There are many versions of WisBar and I have no idea what the differences are etc. I have Version 1.4.2 which is free and is does everything that I want.

Today Screen – Ok, why did Microsoft create both a “Home” and a “Today Screen”? I am baffled by this. The Today screen comes from PC version of Outlook, but I never used it as it was unwieldy, and those bubbles just look too goofy. Ok, so let’s deal with it because it seems we are stuck with the Today screen in the PPC/Win CE world.

How can you make the Today screen better? There are a lot of 3rd party solutions to bring application shortcuts onto the Today Screen.

  • I am using c Launch to avoid using the Home as a program launcher. It allows me to group shortcuts to applications how I want. I can import one or more folder of shortcuts from Windows\Start Menu\Programs\ into one group
  • Many other plugins bring program functionality into the Today screen such as PocketMusic’s player can be controlled from the Today screen.
  • Generally though, I have noticed that using Today screen programs slows things down. Just pick the one or two that work for you and avoid overdoing it. From my experience, you have to be careful because they use a lot of RAM and can diminish system performance. The today screen is like a webpage with programming built in and there are several layers of application communicating which add to the CPU load and slow things down.

I think, from my limited knowledge of the architecture of the Windows CE OS, that this is where it starts to fail. Some of the Windows 95 programming has been recycled to form the core parts of Win CE. Given what M$ had to work with, and given their business philosphy ie, to be the vortex, they have done something almost elegant. The core programming has been setup so that it can sit inside of almost any mobile or non-full-fledged hardware to do the basic stuff, while a surrounding layer of programming lets it “talk” to the user using the hardware available.

Palm is slim, spartan, bare, economical. Win CE is one size fits all, most core functionality for a wide variety of hardware. Problem is that Win CE is relying on hardware to give it the processing power to punch through all of those application layers. That is why I am finding that a 20 Mhz Palm VIIx processor could keep me happy while the 312 Mhz Intel® XScale Processor with WMMX struggles to keep up. I guess it is the same as the desktop. There is a transition to “bloatware” which gives more and more features and functionality but its increase still overwhelms the faster hardware. Hmm.

File Explorer – the one that comes with PPC 2003 isn’t so hot. Get another one. I am using Total Commander. There are others but in my short time as a PPC user, I am liking this one.
Zoom Zoom – After years of daily use of MS Excel on my old Palm, I am taken, absolutely taken with the 78% zoom View of my spreadsheets. Those numbers are so small, so clear, so .. goddammed cute!
File Sync – the one that comes with PPC 2003 is not good. The MS ActiveSync File synchronization forces you to copy the files you want sync’d on your PDA into a special folder on your PC. That is not cool. Why should I change my file structure so just to sych with a Microsoft product. What were they thinking!

  • Get Mighty Synch which synchronizes files to/from PDA/PC without moving files to special folders.
  • Also, try Wireless synchronization with your PC
Clearing Out Unnecessary Files – I regularly rid myself of Internet Explorer cached files. Also, use Where Is My Ram to find out which folders/files are taking up room in your RAM and in your ROM. Eliminate email attachments from the attachment folder (after you save them elsewhere if you want them).
Registry Hacks – RegKing lets you make some categorized registry changes. I was using Pocket Tweak but for some reason, I have no reason why, it made the font in some places super tiny (i had to do hard reset to fix this). So, for now, I am going to forego the few beneficial hacks it offers.
Picture Graphics Viewer – The picture/graphics viewer that comes with PPC 2003 is not very useful.

  • I have a demo of Resco File Viewer and will probably buy it soon because it is really powerful and easy to use, it is a really great product.
  • XnView Pocket is also quite good and is free, for now.
Music/MP3’s – I used PocketMusic but it installed a Today screen plugin that slowed my system down but the features other than that were good. I have yet to reinstall it but am still looking for something else. WinAmp seemed ok although after one install it was missing a file, but will try to install it.

I decided to remove those apps and have been using Media Player which works even though it is quirky in that Microsoft way. MS seems to have a really different way of looking at the world compared to my perspective. That playlist thing is something that just seems contrary. WinAmp works the way I expect a music player should work. However, as Media Player is already installed and the other programs are large-ish, I will use Media Player. I did remap the buttons so that when Media Player is on top, I can adjust volume, go forward/ previous, or stop/play, and the remap to normal function when Media Player is minimized.

eBooks – Mobibook does the job. I use MS Word for ebooks I have in Word or txt format too although scrolling is a bit of a chore.
Weather – Shane Church has built Weather Info, one of the nicest, simplest, most functional, does what it says and no more programs I have seen. The weather data comes from the weather.com® web service and is cached so the data will still display if the device does not have an internet connection.

BC Executive Compensation Tableau Visualization

The Province of British Columbia publishes public agency executive compensation each year. The data is grouped by:

  • Sector – Employers are categorized in one of six sectors – Colleges, Crown Corporations, Education, Health, Public Service, Universities
  • Employer – Employers belong to one Sector. There are about 190 employers such as BC Assessment Authority, BC Housing Management Commission, BC Hydro, etc
  • Employee – An employee belongs to one Employer and they publish the full name of employee.

The employee’s salary is given as:

  • Base Salary
  • Total Compensation

The data is provided each year as pdf and csv data files. I downloaded csv data files for each year from 2012 to 2016 and combined them into one dataset.

Here is link to Tableau Public Dashboard visualizing the data and the chart is embedded below.

 

I used Python to combine the data. Some of the Sector values where missing spaces, one of the years had a “$” symbol in salary column values which had to be removed, and I added a new year column based on the data file’s name.

Here is the Python code used to combine and clean up the data:

import csv
    import pandas as pd
    from datetime import date, datetime
    import glob
    import os

    def main():

        ## path for data files
        data_path = 'C:/bc_salary_data/'

        allFiles = glob.glob(data_path + "*.csv")
        frame = pd.DataFrame()
        list_ = []
        
        for file_ in allFiles:
            df = pd.read_csv(file_,index_col=None, header=0)
            df['filename'] = os.path.basename(file_).split('.')[0]
            df['Sector'] = df['Sector'].str.replace(' ','').str.replace('CrownCorporations', 'Crown Corporations').str.replace('CrownCorporation', 'Crown Corporations').str.replace('PublicService', 'Public Service')
            df['year'] =  df['filename'].str[-4:]
            df['Base Salary'] = df['Base Salary'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            df['Total Compensation'] = df['Total Compensation'].apply(str).str.replace('

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 
,'')
            list_.append(df)
            
        frame = pd.concat(list_)

        ## Create csv file to write results to
        frame.to_csv('merged.csv')

        print str(datetime.now()) + ' data files merged'
        
        return

    if __name__ == '__main__':
        main()

Tableau Visualization

http://www.fin.gov.bc.ca/psec/disclosure/disclosure15-16.htm

 

Business Intelligence – IT Black Box Challenge

The process of engaging IT teams can be difficult to understand for many non-technical people. IT is perceived as a ‘black box’ because its inner workings are often not-transparent.

bi-black-box-1

BI teams may not realize that their business side coworkers do not know how to engage them.

bi-black-box-2

When IT workers reach out to engage business users they may often confuse their potential customer by using too much jargon.

bi-black-box-3

Technical teams often also make too many assumptions of non-technical technical understanding about IT terms and descriptions.

bi-black-box-4

Therefore, it is important for any IT worker to start any engagement with non-technical co-workers at the most basic level of communication.

Do not assume anything. Do not use jargon. Explain your technical job title and role. Describe the process of engagement, what you will do, and what will be expected of the non-technical co-worker. Shine some light on the inside of that IT black box and make it transparent!

Twitter Followers and List Membership Tracking

I have created my own historical reporting of my Twitter account followers and list memberships. I have hosted the reporting on this open webpage.

Most people know what followers are but the list memberships aren’t as well known so here are definitions of them.

Twitter Followers – Twitter followers are other Twitter accounts who follow my account. A Twitter Follower may follow my account for a few days and then unfollow me at any time afterwards.

Twitter List Memberships – Twitter accounts can also create their own Lists and then add other Twitter accounts to the lists. For example a Twitter account may create a List called “Tweet about Big Data” to track Twitter accounts they believe “Tweet about Big Data”. Twitter accounts may add my account to their List and remove it at any time afterwards.

The Twitter data retrieval, writing to database, and data querying are all done on a web server.

In order to record changes in counts of these you need to have daily historical data. The Twitter API doesn’t provide historical data so you need create it yourself by retrieving and saving daily data somewhere.

Three Python scripts using Twitter API, Python Tweepy and AWS SDK are scheduled to run daily using cron jobs.

Two scripts retrieve followers and list memberships and insert the data into a PostgreSQL database. This daily retrieval builds the daily history.

Another script queries the database table to create reporting datasets of new, active and dropped followers and list memberships that are exported as csv files to a AWS S3 folder which also has files for a AWS S3 hosted static website.

The AWS S3 hosted static website uses Chart.js and D3.js Javascript charting libraries to read and visualize the data. This post does not describe how to read the csv files but I have written another post that describes this AWS S3 csv file as D3 report data source

Screenshots of visualizations are shown below.

Daily Follower Counts

Weekly New and Dropped Follow Counts

Follower Duration Days

The Python code to retrieve the Twitter data, transform it and create csv files and upload them to AWS is below.

The code to retrieve Twitter followers:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get secrets
sys.path.insert(0, '/home/secrets/')
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_followers():

    twitter_followers = []
    
    for user in tweepy.Cursor(api.followers, screen_name="twitterusername").items():
        twitter_followers.append(user)
        
    for user in twitter_followers:
        query = "INSERT INTO twitter_followers \
                (date, \
                follower_id, \
                name, \
                screen_name) \
                VALUES \
                (%s, %s, %s, %s)";
        data = (today,
                str(user.id).strip(),
                str(user.name.encode('utf8','ignore')).replace(',','').strip(),
                str(user.screen_name.encode('utf8','ignore')).strip()
                )
        cursor.execute(query, data)
        conn.commit()

    conn.close()
    
    ## print job status to log
    print str(datetime.now()) + ' twitter followers'    
            
if __name__ == '__main__':
	get_followers()

The code to retrieve Twitter list memberships:

import sys, os
import csv
from datetime import datetime, date
import tweepy
from dateutil import tz
import psycopg2

## get database creds
sys.path.insert(0, '/home/secrets/') 
from secrets import secrets
from twitter_keys_tokens import twitter_api_keys

conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cursor = conn.cursor()

## twitter consumer key and secret
consumer_key = twitter_api_keys['consumer_key']
consumer_secret = twitter_api_keys['consumer_secret']

#get twitter auth
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

today = datetime.now().date()

def get_list_memberships():

    twitter_list_memberships = []
    
    for list in tweepy.Cursor(api.lists_memberships, screen_name="twitterusername").items():
        twitter_list_memberships.append(list)
        #print list.full_name
        
    for list in twitter_list_memberships:

        query = "INSERT INTO twitter_list_memberships \
                (date, \
                id_str, \
                name, \
                full_name, \
                member_count, \
                subscriber_count, \
                mode, \
                following, \
                user_screen_name) \
                VALUES \
                (%s, %s, %s, %s, %s, %s, %s, %s, %s)";
        data = (today,
                list.id_str.encode('utf8','ignore'),
                list.name.encode('utf8','ignore'),
                list.full_name.encode('utf8','ignore'),
                list.member_count,
                list.subscriber_count,
                list.mode,
                list.following,
                list.user.screen_name.encode('utf8','ignore'))
        cursor.execute(query, data)
        conn.commit()
        
    conn.close()
    
    ## print status for log
    print str(datetime.now()) + ' twitter list memberships'    
            
if __name__ == '__main__':
	get_list_memberships()

The code to create csv files and upload to AWS S3 bucket:

from boto.s3.connection import S3Connection
from boto.s3.key import Key
from datetime import datetime
from os import listdir
from os.path import isfile, join
import sys
import csv
import psycopg2
import re
from collections import Counter

upload_path = '/home/twitter/aws_s3_site/'

sys.path.insert(0, '/home/secret/')
from aws_keys import aws_keys
from secrets import secrets

## create aws S3 connection
conn = S3Connection(aws_keys['AWS_KEY'], aws_keys['AWS_SECRET'])
bucket = conn.get_bucket('twitter-bucket', validate=False)

## create db connection
conn = psycopg2.connect(
    host = secrets['host'],
    dbname = secrets['dbname'],
    user = secrets['user'],
    password = secrets['password']
    )
cur = conn.cursor()

## get data sets from db

## followers
cur.execute ("select something here;")
followers = list(cur.fetchall())

## lists
cur.execute ("select something here;")
lists = list(cur.fetchall())

conn.close()

## write db data to csv files, save in upload folder

## followers
with open(upload_path + 'followers.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in followers:
        writer.writerow(row)

## lists
with open(upload_path + 'lists.csv', 'wb') as file:
    writer = csv.writer(file, delimiter = ',', lineterminator='\n')
    for row in lists:
        writer.writerow(row)
            	
## upload csv files to S3 twitter-bucket
    
upload_files = [f for f in listdir(upload_path) if isfile(join(upload_path, f))]

# delete existing bucket keys to reset
for key in bucket.list():
    if '.csv' in key:
        bucket.delete_key(key)
  
# complete uploading to AWS
for file in upload_files:
    k = Key(bucket)
    k.key = file
    k.set_contents_from_filename(upload_path + file)
    
## write success to cron log
print str(datetime.now()) + ' aws s3 twitter upload'

Installing Hadoop on Windows 8.1

It turned out to be quite simple to install Hadoop on Windows 8.1

I have been using Hadoop Virtual Machine images from Cloudera, Hortonworks or MapR when I wanted to work with Hadoop on my laptop. However these VMs are big files and take up precious hard drive space. So I thought I would be nice to install Hadoop on my Windows’s 8.1 laptop.

windows81_hadoop
After some searches I found excellent instructions on installing Hadoop on Windows 8.1   on Mariusz Przydatek’s blog.

Note that these instructions are not for complete beginners. There are lots of assumptions that the reader understands basic Windows environment configuration, Java SDK, Java applications, and building binaries from source files. So beginners will also want to look at more detailed step by step instructions as well and use Mariusz’s instructions as a guide to make sure they are doing things correctly.

The most important thing to note is that you do not need Cygwin to install Hadoop on Windows. Other tutorials and how-to blog posts that insist Cygwin is required.

Also because you need to build Hadoop on your computer you need to have MS Visual Studio so you can use trial version as you don’t need it after you build Hadoop binaries. Other tutorials and how-to blog posts have some variation on what version of MS Visual Studio you need but this blog makes it clear.

At high level, the Hadoop installation follows these steps:

  • Download and extract Hadoop source files
  • Download and install packages and helper software
  • Make sure System Environment Path and Variables are correct
  • Make sure configuration files and package paths are ok
  • Execute command that uses packages and helper software to build Hadoop binaries
  • Copy newly created Hadoop binaries into new Hadoop production folder
  • Execute command to run the new Hadoop binaries
  • Hadoop is running and ready to be used

Its pretty obvious but worth stating again that this tutorial installs only a single node Hadoop cluster which is useful for learning and development. I quickly found out that I had to increase memory limits so it could successfully run medium sized jobs.

After Hadoop is successfully installed, you can then install Hive, Spark, and other big data ecosystem tools that work with Hadoop.

 

Vultr and Digital Ocean make it easy to get projects going fast

Recently I developed a Django web application for a friend. I wanted to be able to get it up and out onto the internet so she could view it and get others to do user testing.

Normally I host my web applications on my Webfaction account. But I wanted to keep this project separate so I went over to Digital Ocean and spun up a virtual server to host the Django web application.

We didn’t have a domain name for this site yet so just used the Digital Ocean provided IP address to access the site and created a Self-Signed SSL Certificate for Apache and forced all traffic to HTTPS to enable secure browsing.

I have used Digital Ocean for development purposes and to share projects with others from time to time. I have also used Vultr.

  • Digital Ocean – Use this link to open account and get $10 credit.
  • Vultr – use this link to open account and get $20 credit (summer 2016 promotion).

digitaloceanvultr

Both have monthly price plans that can also be pro-rated depending on how many days you use them so you can spin them up for a few minutes or days and pay accordingly.  I am using the basic plans on both which are good enough for my demonstration development web application purposes.

  • Digital Ocean – $5 plan includes 1 cpu, 512MB of RAM, 20GB SSD storage and 1000GB Bandwidth
  • Vultr – $5 plan includes 1 cpu, 768MB Memory, 15 GB SSD storage and 1000GB Bandwidth

Beware though that unlike the huge number of shared hosting such as Webfaction, Digital Ocean and Vultr virtual servers are yours to own and manage. You get root access to the server and are responsible for managing the server, OS, how it is configured and secured, etc.

That is definitely something you should consider. There are lots of new things to learn. However there are tutorials and lots of help available on the internet so l recommend at least attempting to setup either or both to learn first hand. The cost is small and risk very low. Knowledge gained and benefits are well worth it.

Many comparison reviews seem to prefer Vultr over Digital Ocean for performance and features. However Digital Ocean was first and has done excellent job of documentation for getting many web applications setup. But Vultr does have good documentation too. You can’t really go wrong with either for development projects. If you don’t like one, just switch to the other!

 

 

Django form geocode submitted address to get lat, lon and postal code

One of my Django applications includes a form where user can enter and submit a property address.

django form

The user submitting the form might not know the postal code so I left it optional. However the postal code is a key piece of information for this particular application so I wanted to ensure that I was getting it.

I also wanted to geocode the address immediately to get the address latitude and longitude so it could be shown to user on a Leaflet.js map.

There are lots of free geocoding services for low intensity usage but I ended up using Google Geocoding which is free under certain usage level. You just need to create a Geocoding API project and use the credentials to set up the geocoding.

To interact with the geocoding API I tried Python gecoding modules geopy and geocoder but in the end just used Python Requests module instead as it was less complicated.

When the user clicked the Submit button, behind the scenes, Requests submitted the address to Google’s Geocoding API, gets the JSON response containing the latitude, longitude and postal code which are then written to the application database.

I will update the code in future to check if the user’s postal code is correct and replace it if it is incorrect. Will wait to see how the postal code accuracy looks. Making geocoding API requests too often could bump me over the free usage limit.

The Django View that contains the code described above is shown below.

def property_add(request):
   
    property_list = Property.objects.filter(user_id=request.user.id).order_by('created')
    
    if request.method == 'POST':
        form = PropertyForm(request.POST)
        if form.is_valid():
            new_property = form.save(commit=False)
            address = "%s, %s, %s, %s" % (new_property.address1, new_property.city, new_property.state, new_property.postcode)
            google_geocode_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'
            url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + "'" + address + "'" + '&amp;key=' + google_geocode_key
            
            try:
                response = requests.get(url)
                geoArray = response.json()
                new_property.lat = geoArray['results'][0]['geometry']['location']['lat']
                new_property.lon = geoArray['results'][0]['geometry']['location']['lng']
                new_postcode = geoArray['results'][0]['address_components'][7]['long_name']
                new_fsa = geoArray['results'][0]['address_components'][7]['short_name'][:3]
            except:
                new_property.lat = None
                new_property.lon = None
                new_postcode = None
                new_fsa = None
           
            if new_property.postcode:
                new_property.fsa = new_property.postcode[:3]
            else:
                new_property.postcode = new_postcode
                new_property.fsa = new_fsa
           
            new_property.user_id = request.user.id
            new_property = form.save()
            return HttpResponseRedirect(reverse(property, args=(new_property.pk,)))
    else:
        form = PropertyForm()

    context_dict = {
        'form': form, 
        'property_list': property_list,
    }
        
    return render(
        request,
        'property_form.html',
        context_dict,
        context_instance = RequestContext(
            request,
            {
                'title':'Add Property',
             }
            )
    )    

 

Leaflet.js choropleth map color by count using geoJSON datasource

I have a Django web application that needed an interactive map with shapes corresponding to Canadian postal code FSA areas that were different colors based on how many properties were in each FSA. It ended up looking something like the screenshot below.

map1

This exercise turned out to be relatively easy using the awesome open-source Javascript map library Leaflet.js.

I used this Leaflet.js tutorial as the foundation for my map.

One of the biggest challenges was finding a suitable data source for the FSAs. Chad Skelton (now former) data journalist at the Vancouver Sun wrote a helpful blog post about his experience getting a suitable FSA data source. I ended up using his BC FSA data source for my map.

Statistics Canada hosts a Canada Post FSA boundary files for all of Canada. As Chad Skelton notes these have boundaries that extend out into the ocean among other challenges.

Here is a summary of the steps that I followed to get my choropleth map:

1. Find and download FSA boundary file. See above.

2. Convert FSA boundary file to geoJSON from SHP file using qGIS.

3. Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

4. Create Leaflet.js map in HTML page basically the HTML DIV that holds the map and separate Javascript script that loads Leaflet.js, the FSA geoJSON boundary data and processes it to create the desired map.

Find and download FSA boundary file.

See above.

Convert FSA boundary file to geoJSON from SHP file using qGIS.

Go to http://www.qgis.org/en/site/ and download qGIS. Its free and open source.

Use qGIS to convert the data file from Canada Post or other source to geoJSON format. Lots of blog posts and documentation about how to use qGIS for this just a Google search away.

My geoJSON data source looked like this:


var bcData = {
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::4269" } },
    "features": [
    { "type": "Feature", "properties": { "CFSAUID": "V0A", "PRUID": "59", "PRNAME": "British Columbia \/ Colombie-Britannique" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -115.49499542, 50.780018587000029 ], [ -115.50032807, 50.77718343600003 ], [ -115.49722732099997, 50.772528975000057 ], [ -115.49321284, 50.770504059000075 ], [ -115.49393662599999, 50.768143038000062 ], [ -115.50289288699997, 50.762270941000054 ], [ -115.50846411599997, 50.754243300000041 ], [ -115.5104796, 50.753297703000044 ], [ -115.51397592099994, 50.748953800000038 ], [ -115.51861431199995, 50.745737989000077 ], [ -115.52586378899997, 50.743771099000071 ], [ -115.53026371899995, 50.74397910700003 ], [ -115.53451319199996,

 

Create Django queryset to create data source for counts of properties by FSA to be added to the Leaflet map layer.

I used a SQL query in the Django View to get count of properties by FSA.

This dataset looks like this in the template. These results have only one FSA, if it had more it would have more FSA / count pairs.

Below is code for  the Django view query to create the fsa_array FSA / counts data source.

    

cursor = connection.cursor()
    cursor.execute(
    "select fsa, count(*) \
    from properties \
    group by fsa \
    order by fsa;")
    fsas_cursor = list(cursor.fetchall())

    fsas_array = [(x[0].encode('utf8'), int(x[1])) for x in fsas_cursor]

My Javascript largely retains the Leaflet tutorial code with some modifications:

1. How the legend colors and intervals are assigned is changed but otherwise legend functions the same.

2. Significantly changed how the color for each FSA is assigned. The tutorial had the color in its geoJSON file so only had to reference it directly. My colors were coming from the View so I had to change code to include new function to match FSA’s in both my Django view data and the geoJSON FSA boundary file and return the appropriate color based on the Django View data set count.


var fsa_array = [["V3J", 19]];

var map = L.map('map',{scrollWheelZoom:false}).setView([ active_city_center_lat, active_city_center_lon], active_city_zoom);

map.once('focus', function() { map.scrollWheelZoom.enable(); });

var fsa_array = fsas_array_safe;

L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw', {
    maxZoom: 18,
    attribution: 'Map data ©' + OpenStreetMap CC-BY-SA' + 'Imagery ©' + 'Mapbox'
    id: 'mapbox.light'
}).addTo(map);

// control that shows state info on hover
var info = L.control();

info.onAdd = function (map) {
    this._div = L.DomUtil.create('div', 'info');
    this.update();
    return this._div;
};

info.update = function (props) {
    this._div.innerHTML = (props ?
        '' + props.CFSAUID + ' ' + getFSACount(props.CFSAUID) + ' lonely homes' 
        : 'Hover over each postal area to see lonely home counts to date.');
};

info.addTo(map);

// get color 
function getColor(n) {
    return n > 30 ? '#b10026'
           : n > 25 ? '#e31a1c' 
           : n > 25 ? '#fc4e2a' 
           : n > 20 ? '#fd8d3c'
           : n > 15  ? '#feb24c'
           : n > 10  ? '#fed976'
           : n > 5  ? '#ffeda0'
           : n > 0  ? '#ffffcc'
           : '#ffffff';
}     

function getFSACount(CFSAUID) {
    var fsaCount;
    for (var i = 0; i < fsa_array.length; i++) {
        if (fsa_array[i][0] === CFSAUID) {
            fsaCount = ' has ' + fsa_array[i][1];
            break;
        }
    }
    if (fsaCount == null) {
         fsaCount = ' has no '; 
    }
    return fsaCount;
}

function getFSAColor(CFSAUID) {
    var color;
    for (var i = 0; i < fsa_array.length; i++) {
    if (fsa_array[i][0] === CFSAUID) {
        color = getColor(fsa_array[i][1]);
        //console.log(fsa_array[i][1] + '-' + color)
        break;
        }
    }
    return color;
}
    
function style(feature) {
    return {
        weight: 1,
        opacity: 1,
        color: 'white',
        dashArray: '3',
        fillOpacity: 0.7,
        fillColor: getFSAColor(feature.properties.CFSAUID)
    };
}

function highlightFeature(e) {
    var layer = e.target;
    layer.setStyle({
        weight: 2,
        color: '#333',
        dashArray: '',
        fillOpacity: 0.7
    });

    if (!L.Browser.ie && !L.Browser.opera) {
        layer.bringToFront();
    }

    info.update(layer.feature.properties);
}

var geojson;

function resetHighlight(e) {
    geojson.resetStyle(e.target);
    info.update();
}

function zoomToFeature(e) {
    map.fitBounds(e.target.getBounds());
}

function onEachFeature(feature, layer) {
    layer.on({
        mouseover: highlightFeature,
        mouseout: resetHighlight,
        click: zoomToFeature
    });
}

geojson = L.geoJson(bcData, {
    style: style,
    onEachFeature: onEachFeature
}).addTo(map);

var legend = L.control({position: 'bottomright'});

legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend'),
        grades = [0, 1, 5, 10, 15, 20, 25, 30],
        labels = [],
        from, to;

    for (var i = 0; i < grades.length; i++) {
        from = grades[i];
        if (i === 0) {
            var_from_to = grades[i];
            var_color = getColor(from);
        } else {
            var_from_to =  from + (grades[i + 1] ? '–' + grades[i + 1] : '+') ;
            var_color = getColor(from + 1);
        }
        
        labels.push(
            ' ' +
             var_from_to);
    }

    div.innerHTML = labels.join('');
    return div;
};

legend.addTo(map);

That is pretty much all there is to creating very nice looking interactive free open-source choropleth maps for your Django website application!

Canadian Canola seed crushing more efficient at extracting canola oil

Statistics Canada regularly tweet links to various Canadian statistics. I have occasionally created quick Tableau visualizations of the data and replied with a link to my Tableau Public site.

The idea is to encourage Statistics Canada to start communicating more visually and create their own visualizations for all data. Its extra work but value will be realized when Statistics Canada visitors can more easily understand the data instantly by looking at visualizations instead of munging about with boring data in tables.

This particular tweet was to Statistics Canada data reporting on Canadian canola seed crush, canola oil and meal output. http://www5.statcan.gc.ca/cansim/a47

The data shows Canada’s canola seed production and efficiency at extracting canola oil has increased significantly since 1971.

Canola seed is crushed to extract the oil and the seed meal is leftover. The ratio of oil to meal is about .8 in 2016 compared to .3 in 1971. That is a impressive increase in oil extraction efficiency.

Chart.js tooltip format number with commas

Chart.js V2.0 is a useful javascript charting library. It looks great, has ton of features though it is new enough that there is still some work to find out how to get some relatively simple things done.

In this case I wanted to format the chart’s tooltip. Tooltips are the pop-ups that show when you hover mouse over a bar or line in a chart and show the yAxis value along with any other information you want to include.

By default Chart.js tooltips do not format numbers with commas and there was no simple option to do this.

After some Googling, I found out it required using Chart.js callbacks feature which can be used to format chart elements. Note V1 used a different method that modified a tooltip’s template but that is now deprecated in V2.0.

The callback is in the Options’ tooltips section. You put function into the callback that uses regex to insert commas.

callbacks: {
    label: function(tooltipItem, data) {
        return tooltipItem.yLabel.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",");
    },
},

This can be done as a global change to all charts in the page or just to a specific chart which is what I used and is shown in the example below.

The result is that the tooltip now has a commas.

chartjs tooltip number comma



Python get image color palette

I created a web application that included screenshots of about 190 country’s national statistics agencies website home page. I created the site to host the results of comparisons of each country’s website home page features.

One of the features I wanted to include was the top 5 colors used on each home page. For example here is an image of the Central Statistical Office of Zambia website home page.

zambia

I wanted a list of the top 5 colors used in the web page, in my case I wanted these as a list of  rgb color values that I could save to a database and use to create a color palette image.

For example the Central Statistical Office of Zambia website home page top 5 rgb colors were:

  1. 138, 187, 223
  2. 174, 212, 235
  3. 101, 166, 216
  4. 93, 92, 88
  5. 242, 245, 247

These 5 colors were not equally distributed on the home page and when plotted as stacked bar plot and saved as an image, looked like this:zambia

How to identify top 5 colors on webpage

Identifying dominant colors in an image is a common task for a variety of use cases so there was a number of options available.

The general technique essentially involves reducing an image to a list of pixels and then identifying each pixel’s color and the relative proportion of that color to all other colors and then taking the top 5 pixel counts to get top 5 colors.

However a web page can have many similar colors for example many shades of blue. So  identification of colors involves a statistical categorization and enumeration to group similar colors into one category, for example, group all shades of blue into one ‘blue’ category.

There are a couple of different statistical methods to do this categorization that are discussed below.

Modified Median Cut Quantization

This method involves counting image pixels by color and charting them on a histogram from which peaks are counted to get dominant colors.

This method is used in a Python module called color-thief-py. This module uses Pillow to process the image and modified median cut quantization

I used this module in the code below where I loop through a folder of screenshots to open image and then pass it to color-thief-py to process. Then the top 5 colors’ rgb strings are written into my database as a list so they can be used later.

    import os, os.path
    from PIL import Image
    import psycopg2
    from colorthief import ColorThief

    conn_string = \
        "host='localhost' \
        dbname='databasename' \
        user='username' \
        password='userpassword'" 
        
    conn = psycopg2.connect(conn_string)     
    cur = conn.cursor()

    ## dev paths
    screenshots_path = 'C:/screenshots/'

    screenshots_dir = os.listdir(screenshots_path)
    for screenshot in screenshots_dir:
        if screenshot != 'Thumbs.db':
            img = Image.open(screenshots_path + screenshot)
            width, height = img.size
            quantized = img.quantize(colors=5, kmeans=3)
            palette = quantized.getpalette()[:15]
            convert_rgb = quantized.convert('RGB')
            colors = convert_rgb.getcolors(width*height)
            color_str = str(sorted(colors, reverse=True))
            color_str = str([x[1] for x in colors])
            print screenshot + ' ' + str(img.size[1]) + ' ' + color_str
            
            cur.execute("UPDATE screenshots \
            set color_palette = %s,  \
            height = %s \
            WHERE filename like %s", \
            (str(color_str),img.size[1], '%' + screenshot + '%',))
            conn.commit()
        
    cur.close()
    conn.close()

K-means clustering

Another statistical method to identify dominant colors in an image is using K-means clustering to group image pixels by rgb values into centroids. Centroid’s with highest counts are identified as dominant colors.

This method is used in the process I found on this awesome blog post on pyimagesearch.com.

This method also uses Python module sklearn k-means to identify the dominant colors. It produced very similar results to the other method.

This code from the pyimagesearch website used Python module matplotlib to create plot of the palette and saved these as images. I modified the code to instead simply save the rgb color values as strings to insert into database and save the matplotlib palette rendered plot as an image. I  added plt.close() after each loop to close the rendered plot after the plot image was saved because if they aren’t closed, they accumulate in memory and crashed the program.

# USAGE
# python color_kmeans.py --image images/jp.png --clusters 3

# import the necessary packages
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import argparse
import utils
import cv2
import os, os.path
import csv

# construct the argument parser and parse the arguments
#ap = argparse.ArgumentParser()
#ap.add_argument("-i", "--image", required = True, help = "Path to the image")
#ap.add_argument("-c", "--clusters", required = True, type = int, help = "# of clusters")
#args = vars(ap.parse_args())

screenshots_path = 'screenshots/'

screenshot_palette = list()

## Create csv file to write results to
file = csv.writer(open('palettes.csv', 'wb'))
file.writerow(['screenshot','palette'])

screenshots_dir = os.listdir(screenshots_path)
for screenshot in screenshots_dir:
    if screenshot != 'Thumbs.db':
        print screenshot

        # load the image and convert it from BGR to RGB so that
        # we can dispaly it with matplotlib
        image = cv2.imread(screenshots_path + screenshot)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # show our image
        #plt.figure()
        #plt.axis("off")
        #plt.imshow(image)
        #plt.close()

        # reshape the image to be a list of pixels
        image = image.reshape((image.shape[0] * image.shape[1], 3))

        # cluster the pixel intensities
        clt = KMeans(n_clusters = 3)
        clt.fit(image)

        # build a histogram of clusters and then create a figure
        # representing the number of pixels labeled to each color
        hist = utils.centroid_histogram(clt)
        bar = utils.plot_colors(hist, clt.cluster_centers_)
        #color_palette = utils.plot_colors(hist, clt.cluster_centers_)
        #print color_palette
        #row = (screenshot, [tuple(x) for x in color_palette])
        #screenshot_palette.append(row)
        #print row
        
        #print row[0]
        #print row[1]
        
        # show our color bar
        plt.figure()
        plt.axis("off")
        plt.imshow(bar)
        plt.show()
        plt.savefig('palettes/' + screenshot)
        plt.close()
       
        #file.writerow([
        #        row[0].encode('utf-8', 'ignore'),
        #        row[1],
        #        ])

Excel Power Query tutorial using Canadian potato production statistics

This data comes from Statistics Canada.

1

Statistics Canada download pages often provide the opportunity to modify the data structure and content before it is downloaded.

For example clicking on the [Add/Remove data] link provides options select different groupings of data by provinces. I choose to group data by provinces.

2

I also limited the content to only yield and production. There are other metrics available though they are often differing presentations of the data eg relative amounts by category etc.

3

I selected option to download the data as a csv formatted text file.

This csv file data had to be transformed before it could be used for analysis.

Statistics Canada often includes non-data text such as report titles and descriptions in the first rows of text files and footer notes in rows below the actual data.

toptext

Before we can work with this data these title and footer rows have to be removed.

For this job I used Excel Power Query which can do the ETL. It can extract the data from the csv file, transform the data to a format that is amenable to analysis and load the data into a worksheet (or model) so it can be used for analysis.

Power Query has capabilities and features that match those of many more advanced ETL / integration software such as SSIS, Talend, Informatica, etc.

I have been encouraging and providing training to Excel users to move all of their ETL work in to Power Query. It brings the advanced capabilities of these tools to the relatively non technical desktop business user.

People too often copy and paste data from other sources into their Excel file.

With Power Query the original data file is a data source, and is linked into the Power Query, thus remains untouched and can simply be refreshed to get new or modified data at any time, providing that the columns and content remain unchanged. This makes it easy to get updated time sequence data as it comes available.

So back to our job. The first step was to link to the csv data file downloaded from Statistics Canada into Power Query by using New Query – From File – From CSV

01

Select the data file and click Import.

02

Here is where we begin to bump into data file format challenges. Power Query can’t automatically determine file format because there are non-data text strings at top (report titles, descriptions) and bottom (footer notes).

I would like to encourage Statistics Canada and any other data provider to avoid doing this. A data file should be in a predictable tabular (CSV, tab separated) or other format (JSON, XML) without any other content that needs to be processed out.

03

But no problem.  Power Query is equipped to deal with these situations.

Power Query assumed this is csv file but the top rows of titles and descriptions do not have commas so it doesn’t  split top rows into columns and as a result the split rows below are hidden.

One solution which I used was to go into the Power Query Source step settings to change the delimiter from Comma to Equals Sign. Since there are no equals signs the data is not split into columns and remains in one column that we can remove top and bottom non-data rows.

04

 

The Power Query Source step now looks like this.

= Csv.Document(File.Contents("C:\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252])

05

Now we can remove the non-data text from the data file. Select the Remove Rows – Remove Top Rows and enter the number of rows to remove (in this case it is top 7 rows are non-data rows).


= Table.Skip(#"Changed Type",7)

06

Do the same for the bottom rows but select Remove Rows – Remove Bottom Rows.

= Table.RemoveLastN(#"Removed Top Rows",12)

Now we can load the data to the worksheet and we will see one column of comma separated values which we just need to split to get nice columns of data.

07

Tell Power Query to split columns by comma.


= Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"})

08

Also tell Power Query to use first row as headers. If the non-data text rows were not in the file Power Query would have done this automatically.

= Table.PromoteHeaders(#"Changed Type1")

092

A key data transformation to get this data into a useful tabular format is to unpivot the years columns to rows. This is one of Power Query’s most useful transformations. Many data sources, especially in enterprise business, often have time series or other categories presented as columns but to do dimensional analysis these have to be unpivoted so they are in rows instead of columns.

Note that the Statistics Canada data formatting feature often includes moving time series to rows which would be similar transformation to this Power Query Unpivot.

Here is the pivoted data.

pivoted

And here is the unpivoted data. This now gives us unique columns of data that can be used by any analytical software.

unpivoted

 

I also did some other things to clean up the data:

  • Renaming column headers to be shorter, more readable.
  • Changing the dimension values to shorter strings.
  • Changing hundredweight values to pounds multiplying production values by 1000 to get full values.
  • Filtered out Newfoundland because it was missing 2013, 2014, 2015 values.
  • Removed the 2016 column which had no values.

Once loaded back to worksheet we now have nice clean dimensional data ready for us to begin working with in any analytical software. This can be used as data source for Tableau, Qlikview, Cognos, Birst, etc.

5

Lets take a quick look at the data using Excel Pivot Tables and Charts. Select this worksheet table and select Insert – Pivot Table to create a new pivot table like the one below.

7

 

Select Insert Pivot Chart to create new pivot chart from this pivot table.

In this case I selected a stacked bar chart that show total Canadian potato production by province.

This does a good job of demonstrating relative contribution to total by province.

9
A stacked line chart does a better job of illustrating changes in production by provinces by year.

91

 

Here is a Tableau 9 report from the Power Query results.

 

Here is the complete Power Query M-code copied from the Power Query that gets and transforms the Statistics Canada raw csv data file into the transformed data set ready to be used in Excel Pivot Tables/Charts and Tableau or any other reporting tool.

let
    Source = Csv.Document(File.Contents("C:\Users\bb\Documents\Dropbox\Data\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Removed Top Rows" = Table.Skip(#"Changed Type",7),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Removed Top Rows",12),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type1"),
    #"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"2016"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Columns", each ([Geography] <> "Newfoundland and Labrador (2)")),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Filtered Rows", {"Geography", "Area, production and farm value of potatoes"}, "Attribute", "Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Columns",{{"Attribute", "Year"}, {"Area, production and farm value of potatoes", "Dimension"}, {"Geography", "Province"}, {"Value", "Hundredweight Value"}}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Renamed Columns",{{"Hundredweight Value", type number}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type2","Average yield, potatoes (hundredweight per harvested acres) (5,6)","pounds per acre",Replacer.ReplaceText,{"Dimension"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Production, potatoes (hundredweight x 1,000)","production",Replacer.ReplaceText,{"Dimension"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Pounds", each if [Dimension] = "pounds per acre" then [Hundredweight Value] * 100 else if [Dimension] = "production" then [Hundredweight Value] * 100 * 1000 else null)
in
    #"Added Custom"

Django recreate database table

Django’s makemigrations and migrate commands are very useful to update existing database tables to reflect model changes.

However if you have made many existing table column name changes, migrate will ask you a series of ‘y/N’ questions about which column names are changed. This can be tedious to cycle through especially if there are many changes.

Depending on the relationships your table has, it may be easier and quicker to:

  • Create a backup of the table by copying and renaming table or exporting table data to csv
  • Drop the table
  • Recreate table from scratch
  • Reload data into the new updated table

The question is how to recreate the table?

After you drop the table you can remove your table model from the models.py field and then run makemigrations and then run migrate —fake which is the special trick to get past migrate wanting your table to exist before it can delete it.

Then after you run migrate –fake, you can put your update model for your table back into your models.py and then makemigrations and migrate and you will get your new updated table recreated in database.

Then you can recover your data from the backup with SQL INSERT or by using database data import feature.

 

Canadian TCS FDI Officers Twitter list member analysis

Blog post updated to add the following:

    • Updated Python “TCS Members Details” code to get additional information from the List members’ profiles.
    • New Python code “TCS Members Tweets” to download all List members’ Tweets.

————————————————–

The Canadian Trade Commissioner Service maintains a Twitter List named CDN TCS FDI Officers that has a bunch of Canadian Trade Commissioners as members. Investment Canada also has a similar list.

I was interested to learn how many people were on these lists, how long they had been on the lists, what they were Tweeting about.  To undertake this analysis, I used the Twitter API and Python Tweepy to retrieve data about list members including:

    • screen_name
    • name
    • followers_count
    • friends_count
    • statuses_count (# of tweets)
    • favourites_count
    • created_at (Twitter account create date)
    • account_age_days (calculated as # days from July 9, 2016)
    • listed_count
    • verified
    • profile_image_url
    • profile_sidebar_fill_color
    • profile_text_color
    • profile_image_url_https
    • profile_use_background_image
    • default_profile_image
    • profile_sidebar_border_color
    • profile_background_color
    • profile_link_color

You can get find complete definitions about these fields over at Twitter.

The data was output as a simple csv file that I used to create a Tableau Public dashboard which is embedded below. The csv file is attached here.

The Tableau dashboard is interactive and can be sorted by any of the columns by using the sort icon which appears when you hover over the top of column as illustrated in screenshot below.

tcs_click_to_sort

 

 

 

 

It would be interesting to try to determine if this Twitter activity has measurable impacts on FDI and Canadian Trade.  For example perhaps foreign investment finds it way to Canada after reading Tweet by one of our Trade Commissioners.

This would require that TCS maintains a CRM (client relationship manager) system and process that records lead sources.

There is some disparity between use of Twitter by the CDN TCS FDI Officers list members as shown by Tweets/Day which total Tweets divided by # days since the account was created. If there is a measurable lift in lead generation by Twitter use then this would be actionable metric.

For the technically minded the Python code is shown below. Note that you need an API account to use with this code.

There is another file tweet_auth import not shown here that contains Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

Here is the “TCS Member Details” code:

 

 #twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
  import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, 'path/to/folder/with/your/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
     #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    a
    pi = tweepy.API(auth)

    today = datetime.now().date()
            
    tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')

    member_details_csv = csv.writer(open('tcs_member_details.csv', 'wb'))
    member_details_csv.writerow([
        'screen_name',
        'name',
        'followers_count',
        'friends_count',
        'statuses_count',
        'favourites_count',
        'created_at',
        'account_age_days',
        'time_zone',
        'listed_count',
        'profile_image_url',
        'profile_sidebar_fill_color',
        'profile_text_color',
        'profile_image_url_https',
        'profile_use_background_image',
        'default_profile_image',
        'verified',
        'profile_sidebar_border_color',
        'profile_background_color',
        'profile_link_color'])

    members = []
    member_tweets = []
        
    for member in tcs_list_members.items():
        
        member_details_csv.writerow([
            member.screen_name.encode('utf-8', 'ignore'),
            member.name.encode('utf-8', 'ignore'),
            member.followers_count,
            member.friends_count,
            member.statuses_count,
            member.favourites_count,
            member.created_at,
            (today-member.created_at.date()).days,
            member.time_zone,
            member.listed_count,
            member.profile_image_url,
            member.profile_sidebar_fill_color,
            member.profile_text_color,
            member.profile_image_url_https,
            member.profile_use_background_image,
            member.default_profile_image,
            member.verified,
            member.profile_sidebar_border_color,
            member.profile_background_color,
            member.profile_link_color
            ])
            

It would be interesting to see what Tweet topics, other Twitter user mentions, links to webpages, etc. So the next step is to loop through each of the list member’s api.user_timeline to retrieve their Tweet content and do some analysis on them.  For now here is the code and some analysis and visualization in Tableau later.

Here is the “TCS Members’ Tweets” Python code:

    import sys, os
    import csv
    from datetime import datetime, date
    import tweepy
    from dateutil import tz

    ## get twitter auth key file
    sys.path.insert(0, '/path/to/your/folder/with/twitter_creds/')
    from ppcc_ca_app_key import keys

    ## this is consumer key and secret from the ppcc-ca app
    consumer_key = keys['consumer_key']
    consumer_secret = keys['consumer_secret']

    ## don't need to access token bc not tweeting on this timeline, just reading
    #access_token = keys['access_token']
    #access_token_secret = keys['access_token_secret']

    ## get twitter auth
    #auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    #auth.set_access_token(access_token, access_token_secret)
    #api = tweepy.API(auth)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    today = datetime.now().date()

    def get_list_member_tweets():
        ## get invest_canada and cdn-tcs-fdi-officers list members
        tcs_list_members = tweepy.Cursor(api.list_members, 'invest_canada', 'cdn-tcs-fdi-officers')
      
        ## open csv file 
        member_tweets_csv = csv.writer(open('tcs_member_tweets.csv', 'wb'))
        
        ## write header row column names
        member_tweets_csv.writerow([
            'status_id',
            'date_time_UTC',
            'screen_name',
            'hashtags',
            'user_mentions',
            'tweet_text'
            ])
        
        ## loop through list members and get their tweets
        for member in tcs_list_members.items():
            ## get list member tweets
            member_tweets = get_member_tweets(member.screen_name)
            
            for status in member_tweets:
                
                ## check tweets for hashtags
                if status.entities['hashtags']:
                        hastags=[]
                        for hashtag in status.entities['hashtags']:
                            hastags.append(hashtag['text'].encode('utf8','ignore'))
                
                ## check tweets for user_mentions
                if status.entities['user_mentions']:
                        user_mentions=[]
                        for user_mention in status.entities['user_mentions']:
                            user_mentions.append(user_mention['screen_name'].encode('utf8','ignore'))
                
                ## write to csv file      
                member_tweets_csv.writerow([
                    status.id,
                    str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)),
                    member.screen_name,
                    hastags,
                    user_mentions,
                    status.text.replace('\n',' ').replace('\r',' ').encode('utf8','ignore')
                    ])

    def get_member_tweets(screen_name):
        ## get list member's tweets
        
        alltweets = []

        ## can only get max 200 tweets at a time
        new_tweets = api.user_timeline(screen_name = screen_name, count=200)
        alltweets.extend(new_tweets)

        ## get oldest tweet already retrieved
        oldest = alltweets[-1].id - 1

        ## iteratively get remaining tweets
        while len(new_tweets) > 0:
            new_tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
            alltweets.extend(new_tweets)
            oldest = alltweets[-1].id - 1
        
        ## print out member and # tweets retrieved to show progress
        print screen_name + " %s tweets downloaded" % (len(alltweets))
        
        ## return all tweets
        return alltweets

    if __name__ == '__main__':
        get_list_member_tweets()

Full code on Github:

https://github.com/sitrucp/tcs_fdi_list_twitter_data

Use Excel Power Query to scrape & combine Wikipedia tables

Power Query is quick and easy way to scrape HTML tables on web pages. Here is step by step on getting multiple tables from Wikipedia article and appending them into one Power Query Excel table.

The Wikipedia article List of national and international statistical services has multiple tables with lists of countries’ statistics agencies and their website urls. These are separated by world region.

https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services

Step 1: Click the Power Query From Web menu icon and paste the url above into the URL address field and click OK.

stat_scrape_1

Step 2: Power Query finds and shows you a list of tables that are in this web page. Select the tables you want and click Load.

By default you can only select one of the tables.

stat_scrape_2

However you can check the Select multiple items checkbox and you can then select more than one of the tables.

stat_scrape_3

You can also select any of the tables to get a preview of the data they contain. After you have selected all the tables you want, click Load.

stat_scrape_4

After you click Load Power Query will create a new query for each of the selected tables.

stat_scrape_5

Step 3: Append all of the tables into one new combined dataset. You do this with the Table.Combine feature which is confusingly also called Append in the Power Query menu icons.  The menu icon feature combines only two tables by default, however you can simply manually edit the resulting code in the address bar to include all of the tables in the Table.Combine formula.

stat_scrape_6

After you update the Table.Combine formula you will have a new query with all of the tables combined into one dataset.

stat_scrape_7

You can refresh the queries to get any changes to the Wikipedia tables.

Here is M code for the Power Query that gets the Wikipedia table

 

let
Source = Web.Page(Web.Contents("https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services")),
Data0 = Source{0}[Data]
in
Data0

Here is M code for the Power Query that combines the tables into one dataset:

 
let
Source = Table.Combine({#"Africa[edit]",#"Americas[edit]",#"Asia[edit]",#"Europe[edit]",#"Oceania[edit]"})
in
Source

How to update Office 365 password in Power BI dataset refresh

I recently changed my Office 365 user password for an account that I was using for a Power BI Dataset Scheduled Refresh.

The result was that my Power BI Refresh failed which looked like screenshot below.

powerbi sharepoint dataset - update password-01

So all I had to do was update the authentication Power BI was using to access my Office 365 Sharepoint folder.

It was clear that the Edit credentials link was where I needed to update the password.

That link got me the following page where I selected oAuth2 which is referring to the authentication that Power BI uses with my Office 365 user credentials.

powerbi sharepoint dataset - update password-02

Selecting oAuth2 popped a new browser window where I could enter my Office 365 user and new password and authenticate Power BI.

powerbi sharepoint dataset - update password-1

After clicking Sign In I was returned to the Power BI page and the credential errors above were gone and I could successfully refresh the dataset from my Office 365 Sharepoint files.

powerbi sharepoint dataset - update password-4

The Refresh Schedule log showed the previous failed refresh attempt and the just completed successful refresh. Back in business!

powerbi sharepoint dataset - update password-3

How to avoid wide margins on a Power BI dashboard

A Power BI Report with multiple charts or other objects can be added to a Dashboard in Power BI Online using the pin to dashboard feature.

However this results in a dashboard with very wide margins. This is especially problematic on a mobile device as the screenshot from the Power BI Android application shows. There is a lot of wasted white space.

wide margins

The Desktop app view is a bit better but there is still a lot of white space around the edges.

wide margins desktop

 

The resolution, until Power BI team make the margins smaller or add feature to change margin width, is to pin charts one by one to the dashboard in order to have them fill out width.

The screenshot below highlights where you can pin your report to the dashboard using different pins.

You can select an individual report’s pin (the one to the right) which will give you dashboard without the wide margins.

Using the pin on the top toolbar will add the entire report with the multiple reports to the dashboard and results in the wide margins seen above.
pin to dashboard

How to use Google Adsense API to download Adsense data

Google’s APIs make getting Adsense (or any other Google service) data easy to download. The code below downloads Adsense data saving results to csv file.

The code uses Google’s AdSense Management API, OAuth 2.0 authorization and the google-api-python-client SDK.

Overview

When you run this code for the first time it will open a web browser to get approval for the API application to access your Adsense account data.

When this approval is granted the code saves a refresh token that is used for all future executions of the code instead of requiring the web browser approval. This is handy if you want to run this code using a cron job like I do.

Here is the code

In summary, the code does the following:

  • Authenticates against API application
  • Queries the scope to get list of accounts
  • Loops through accounts
  • Returns requested metrics and dimensions from Adsense account
  • Writes results to csv file
 
import csv
from datetime import datetime, timedelta
from apiclient import sample_tools
from apiclient.discovery import build
from oauth2client import client

todaydate = datetime.today().strftime('%Y-%m-%d')

def main():    
    ## Authenticate and construct service
    scope = ['https://www.googleapis.com/auth/adsense.readonly']
    client_secrets_file = 'client_secrets.json' 
    
    ## Authenticate and construct service
    service, flags = sample_tools.init('', 'adsense', 'v1.4', __doc__, client_secrets_file, parents=[], scope=scope)
    
    ## Get accounts
    accounts = service.accounts().list().execute()
     
    try:
        ## Get account(s) data
        results = service.accounts().reports().generate(
            accountId='pub-xxxxxxxxxxxxxxxxxx',
            startDate='2012-01-01', # choose your own start date
            endDate= todaydate,
            metric=['EARNINGS', 'CLICKS','AD_REQUESTS'],
            dimension=['DOMAIN_NAME','DATE','AD_FORMAT_NAME']).execute()
    
    except client.AccessTokenRefreshError:
        print ('The credentials have been revoked or expired, please re-run the '
           'application to re-authorize')
           
    ## output results into csv
    header = ['hostname','date','type','earnings','clicks','ad_requests']
    
    with open('output_adsense.csv', 'wb') as file:
        file.write(','.join(header) + '\n')        
        for row in results.get('rows'):
            file.write(','.join(row) + '\n')

    ## print status for log
    print str(datetime.now()) + ' adsense'
    
if __name__ == '__main__':
	main()

Create API Application, get client_secrets.json

As alluded to above you need to create an API application. This is done in the Google Developers Console . If you haven’t already signed up for this you have to do that first.

Create Project

If you haven’t already you will also have to create a Project. Your API application will be inside this Project. In the Developers Console you can create new Project with drop down at top right.

create app - create project

 

Once you have a Project you can go to the Enabled APIs tab to select which Google service API(s) you want to include in the project. For this code’s purposes you can just select Adsense Management API.

Create Application – Select API Type

Use the Create credentials button to create your new application. (You are creating the credentials for your new app.)

When you click the button you get choice of three options (see below).  Important point that raises lots of questions: Adsense cannot use a Service Account to authenticate.  That was what I thought I could use to avoid having to do user authentication since I am running code on my server in cron job.

No problem though because as you will see later, there is a one time only step to have user authorize Adsense account access. From that point on, the authentication is done using a refresh token.

So my code above uses the OAuth client ID. So select that for your application too.

create app - choose auth type

Create Application – Select Application Type

Then you will be asked to choose the application type. For my purposes I do not want web application which will require real redirect URIs. I just want simple plain app. So I chose Other.

create app - choose app type

You will then get client secret and client id which you can copy or get later. You don’t need these as you will get them in the client_secret.json file you download next.

So just change default application name to something unique so you can identify it later.

Create Application – OAuth Consent Screen

This is not something critical but you have to provide an email and application name to show in the browser pop up for authentication.

The idea is to give a user (you) information about who is asking for permission to access your Adsense account (you are).

create app - consent screen

 

Create Application – Download Client Secret JSON File

Then click the download button to download the client_secret.sjon file.  Once the JSON file is downloaded click create.

create app - download secret json

The JSON file downloads with longer name like “client_secret-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json”. You will have to rename the JSON file so it is spelled “client_secret.json”.

Use Code and client_secret.json File to Authenticate 

Put the python code above into .py file along with the client_secret.json file into a folder.

Local server (desktop/laptop)

Run the .py file using the command line which should do the following:

  • Pop up browser and go to Google login page where and ask you to sign in and allow or allow if you already logged in.
  • The command line will advance and finish.
  • Creates a .dat file in your folder.

The .dat file name will be whatever you named your .py file. This dat file is your refresh token that will be used to authenticate access in future.

Remote Server

Copy your Python, client_secret and .dat (refresh token) files to server and run their by cron jobs.

 

Code Variables

You may want to change the Python code to select different Adsense metrics and dimensions. I am only selecting a handful of the API metrics and dimensions. You can check out  these in more detail AdSense Management API Metrics and Dimensions documentation.

 

Create Python 3 Virtualenv on machine with Python 2 & 3 installed

I have been using Python 2.7 for most of my Python work but have a few projects where I want to use Python 3.x. In these cases I want to create a Virtualenv virtual environment with Python 3.x.

Turns out that is is no problem to have Python 2.7 and Python 3.x installed on the same machine as long as they are installed in their own folders.

You just have to selectively call either when you want to use them but you choose which version you want to have in your Windows path so when you call ‘python’ at command line, you run that version by default. I have Python 2.7 in my Windows environment Path.

So to create a new Python 2.7 virtual environment just call Virtualenv normally:

c:\path\to\my\project>virtualenv venv

To create a new Python 3.x virtual environment just use the ‘-p’ switch and use the Python 3.x path and executable to have Virtualenv create new virtual environment with Python 3.x:

c:\path\to\my\project>virtualenv -p c:\python34\python.exe venv

Use OneDrive API to upload files to Office 365 Sharepoint Site

I have automated uploading files from my web site host’s server to my Office 365 Sharepoint site using scheduled cron jobs running Python scripts on my web host.

The Python scripts use Microsoft’s Azure Active Directory Library (ADAL) to authenticate off Azure Active Directory (Azure AD or ADD), and  OneDrive API and Python Requests to use the authentication to upload the files to Sharepoint from my web host.

Here is the code

import adal
import urllib
import requests

## set variables
username = '[email protected]'
password = 'password'
authorization_url = 'https://login.windows.net/contoso.onmicrosoft.com' # aka Authority
redirect_uri = 'https://login.microsoftonline.com/login.srf' # from Azure AD application
client_id = 'd84cbf4f-dc23-24d1-8a7d-08ff8359879a' # from Azure AD application
file_url = 'https://contoso.sharepoint.com/_api/v2.0/drive/root:/myfoldername/myfilename.csv:/content'

## use ADAL to create token response
token_response = adal.acquire_token_with_username_password(
        authorization_url,
        username,
        password
    )

## Use ADAL to create refresh token and save as text file to reuse
refresh_token = token_response['refreshToken']
refresh_token_file = open('refresh_token.txt', 'w')
refresh_token_file.write(refresh_token)
refresh_token_file.close()

## Get saved refresh token and use it to get new token response
refresh_token = open('refresh_token.txt', 'r').read()
token_response = adal.acquire_token_with_refresh_token(authorization_url, str(refresh_token))

## get access_token from token response JSON string
access_token = token_response.get('accessToken')

## create http header to send access token to authenticate
headers = {'Authorization':'BEARER ' + str(access_token)}

## example to upload file
upload_file = requests.put(file_url, data = open('myfilename.csv', 'rb'), headers=headers)

 

There are many things to consider when working with Microsoft’s APIs to work with its online services such as Office 365.

The first is how to authenticate. Microsoft is trying to move everyone to use Azure AD to do oAuth authentication. Microsoft services still have their own authentication methods but this exercise I used Azure AD.

The second is what API to use. Microsoft has recently released their Graph API that is ‘one endpoint to rule them all’. However Microsoft services still have their own API’s so while Graph API looks tempting for this exercise I used the OneDrive API.

Azure AD Authentication

The authentication will be done in two parts.

  1. Create Azure AD application to do the authentication for the Microsoft service(s) you want to interact with.
  2. Use ADAL to interact with Azure AD to do the oAuth flow.

Setup Azure AD – create application

Microsoft provides free use of Azure AD for light authentication needs. You can register and create account. Once you have your account you need to create a new application.

For my purposes I created an Azure AD native client application. Azure AD also has web application and web APIs but both require user to enter username and password in web browser. The native client application does technically also require user to enter these too but I hacked past this by using ADAL user authentication and hard coding username and password into the Python code. Since these are going onto my web host in protected directory to run as cron jobs they will be safe.

I am not going to go through the detail of creating an Azure AD application there are some good blog posts and Microsoft does good job of describing it. For example take a look at this site which has decent information about creating a new Azure AD application.

The Azure AD applications allow you to choose which Microsoft services it will be used to authenticate. Confusingly these are also called ‘applications’ too. They are represent Microsoft Services such as Office 365 Sharepoint Online, OneNote, Power BI, etc and is the place where you assign the permissions (also called ‘scopes’) that authentication will allows with that Microsoft service.

An Azure AD application might provide authentication for more than one Microsoft Service. But my native client application has only Windows Azure Active Directory permissions (which are there by default) and Office 365 Sharepoint Online permissions set to Read and write user files and Read and write items in all site collections.

After you have created your client application make sure to copy the client_id and resource_uri to use in code below. The client_id is automatically assigned and the resource_uri for a native client app can be any url and is just a unique identifier. I chose the Office 365 login url. The web applications need a real url because that is where the user will be prompted to enter credentials.

Azure Active Directory Library (ADAL)

Microsoft’s Azure Active Directory Library (ADAL) authentication libraries are created for developer’s to use with Azure AD. I used the ADAL Python SDK which was easily installed with pip install adal. 

The oAuth authentication flow can seem very complex but you don’t have to worry about that if you use ADAL. ADAL uses your Azure AD application credentials (client_id, resource_uri in case of native client application) to retrieve a token response which is a text string in JSON format.

This JSON string includes the actual access token that is used to authenticate accessing Sharepoint and upload the files. You can use Python to retrieve the access token (it is a Dictionary). Then you simply put the access token into a header that will be used in the Put Request as the method of passing the access token to the OneDrive API.

ADAL also takes care of refreshing tokens which expire. In my case where the scripts are running on the server as cron jobs I want the token to refresh automatically. ADAL gets a refresh token that you can save to get a new access token when previous one expires. I actually write the refresh token to a text file on the server and refresh the access token each time code is run. I could only refresh it if the previous one expires.

OneDrive API

The OneDrive API has different configurations depending on whether you are using it to access a OneDrive Personal, OneDrive Business or Sharepoint Online account.

Be warned that the documentation for OneDrive API can be very dense and there are different ways of presenting required syntax to identify interactions. Of course the representations vary with different SDKs too.  Also there are different versions of Microsofts file storage services over the years. So I recommend to focus on the newest OneDrive API and make sure you are looking at documentation relevant to newest version.

The Gotchas

ADAL Default Values

ADAL has default client_id and resource values that it uses for the username authentication. I changed these default values to match my Azure AD application.

Before changing these I was getting an Invalid audience Uri error

{“error”:”invalid_client”,”error_description”:”Invalid audience Uri ‘https:\/\/m
anagement.core.windows.net\/’.”}

This error means the url being used to create the token response was not same as the one that the file was being uploaded to.

EDIT August 21, 2016 Microsoft has updated the ADAL library so that you can specify the client id and the resource value because authentication against different services needs different client id and resource endpoint urls. That means the hack I used below is no longer required. For more details seehttps://github.com/AzureAD/azure-activedirectory-library-for-python

In ADAL’s __init__.py file look for the class _DefaultValues class at bottom of code and replace the default values:

  • I changed client_id to my application’s client_id
  • I changed resource from https://management.core.windows.net/ to https://tenant.sharepoint.com/

The acquire_token_with_username_password function sets these to None so they get set to default values. So this could be changed so they accept values from the code.

Sharepoint Site and Folder Paths

The OneDrive API dev documentation https://dev.onedrive.com/getting-started.htm demonstrates the different service urls:

  • OneDrive – https://api.onedrive.com/v1.0
  • OneDrive for Business – https://{tenant}-my.sharepoint.com/_api/v2.0
  • SharePoint Online – https://{tenant}.sharepoint.com/{site-relative-path}/_api/v2.0

The {site-relative-path} notation indicates the Sharepoint site name. My site didn’t have this because it was default site. However you might have to add your site relative path.

Also the Sharepoint url for the file I was uploading looked like this:

https://contoso.sharepoint.com/Shared%20Documents/myfoldername/myfilename.csv

However you will note that the file_url in the code doesn’t make any reference to the Shared%20Documents:

https://contoso.sharepoint.com/_api/v2.0/drive/root:/myfoldername/myfilename.csv:/content

 

Careful! Don’t click “Try Power BI for free”

This was a weird quirk.

I have Power BI Free account and uploaded a report to Power BI Service. The report has dataset that gets data from a Sharepoint file.

In Power BI Service I went to the dataset “Schedule Refresh”, selected “Connect Directly”, “Enter Credentials” as oAuth, then entered my Office 365 credentials. This setup the connection successfully to the Sharepoint file, and then I could switch the “Keep your data up to date” to “Yes”.

Then I accidentally clicked the “Try Pro for Free” button.

From that point on, every time I selected the Power BI Service dataset or the report, I got a pop up blocking message “To see this report upgrade to power bi pro”.

pro upgrade

The only way to make it stop was to switch the “Keep your data up to date” to “No”.

The only Pro feature is hourly updates. The Free Power BI Service version only allows daily updates. I hadn’t selected hourly updates so that wasn’t the problem. Just some weird quirk.

The resolution was to delete the report and dataset that I just scheduled refresh for, and then upload the report again and then redo the schedule refresh as per above (without accidentally clicking on the Try Pro for free button) to make it work again.

How to schedule Power BI dataset refresh

Do you want to create a Power BI Report that gets a daily scheduled refresh of data from a Sharepoint csv file?

The first step is to create your Power BI report in Power BI Desktop using the Sharepoint csv file as data source.

In Power BI Desktop use Get Data – File – Sharepoint Folder to connect to your Sharepoint Folder.

The resulting dataset query (Power Query) will look something like mine below. You will replace “mydomain” with your Sharepoint account name or domain.

You will also replace “datafile.csv” with your csv file name. The Power BI connection is to a Sharepoint folder which might have more than one file like I did. If you have only one file in the folder the filter will be redundant but can’t hurt.

let
Source = SharePoint.Files("https://mydomain.sharepoint.com", [ApiVersion = 15]),
#"Filtered Rows" = Table.SelectRows(Source, each ([Name] = "datafile.csv")),
#"Combined Binaries" = Binary.Combine(#"Filtered Rows"[Content]),
#"Imported CSV" = Csv.Document(#"Combined Binaries",[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV")
in
#"Promoted Headers"

After you publish your report to your Power BI Online account you can select your newly uploaded dataset’s “Schedule Refresh” property where you can set up the refresh schedule.

schedule_refresh

 

First go to “Gateway connection”.

I selected “Connect Directly” which requires that you also enter Sharepoint credentials in the “Edit credentials” link which pops up a web page that prompts you to login into your Sharepoint account. This gives Power BI Service permission to access your Sharepoint account to refresh file.

If you have an enterprise gateway setup you could try “Enterprise Gateway” and enter the required credentials for that.

 

gateway_connection

 

If you entered credentials correctly you should now be able to select the “Keep your data up to date” switch to “Yes”.

Then you can select which four 6-hour window you want refresh to run. Power BI Service free accounts can do daily refreshes. Pro accounts can have hourly updates.

As an aside be warned that if you click the “Try Pro for free” button you might get a blocking message that you are using Pro feature. This happened to me and was clearly a quirky error. I had to delete my report and dataset and re-upload them and redo the scheduling to get rid of the error.

schedule_refresh_option

 

You can try refreshing the dataset manually (On demand) or wait for the next scheduled refresh (Scheduled) to happen to see if the data does refresh. You can see refreshes are successful and when they ran by clicking the “refresh history ” link.

refresh_history

 

 

Power BI Online – get data from Office 365 Sharepoint file

I want to create a Power BI Online report with a data source from a file on a remote web server that updates automatically so my Power BI report is always up to date.

Power BI Desktop and Online have lots of data connectors to third party ‘Online Services’ eg Salesforce, Mailchimp, Github, etc, as well as file and database connectors. But none of these help to get the file from my remote server directly.

There is no feature to connect to a file on a remote server. I could put my remote file data into MySQL or Postgres database and Power BI could connect to those but my remote server doesn’t allow external connections to hosted databases. So that is not an option for me.

A Power BI Online report can get data from a Sharepoint site file that will update automatically on schedule.

Since I have an Office 365 E3 account which has Sharepoint site I upload my remote file to the Sharepoint site and create a Power BI Online report linked to that Sharepoint file.

I would still have to figure out how to automate uploading my remote server file to my Office 365 E3 account Sharepoint site. But I am pretty sure I can do that with the OneDrive API but more on that in Part 2.

Here is a diagram outlining what I think my solution could be.

powerbidatapath

In the meantime to test using Office 365 Sharepoint file as data source for a Power BI Online report, I created a Power BI report in Desktop with file data source from my Office 365 Sharepoint site and Published it to my Power BI Online account.

After publishing the report to my Power BI Online account, I logged into Power BI Online, opened the newly published report and went to data source options and selected ‘Schedule Refresh’ which produced screen below.

onedriverefresh

I set ‘Keep your data up to date’ to ‘Yes’ and selected ‘Connect Directly’ which gave me error message telling me I had to update credentials.

Not surprisingly the report I published to Power BI Online didn’t ‘remember’ that I had already authorized the Power BI Desktop report to get file from my Office 365 Sharepoint site so I have to do it again in Power BI Online.

So I selected ‘Edit Credentials’ and then selected ‘oAuth’ as type of credentials which popped up Office 365 login screen where I entered my user and password clicked login and was returned back to Power BI online page.

The error message was gone so this must have created oAuth authentication to link the file data source from my Office 365 Sharepoint site into the data source in Power BI Online.

Then I set the automatic refresh to one of the daily 6-hour windows (below i have selected 12 pm to 6 pm) for refresh to run (hourly refresh is a Power BI Pro feature).

The report data source now refreshes from my Sharepoint file on daily automatic schedule so it looks like I am half way to my solution.

I will write another blog post detailing how I will automate moving my data file from my remote server to my Office 365 Sharepoint site. Pretty sure I will be using the OneDrive API https://dev.onedrive.com but there are other options too.

In the meantime talk a look at the previous blog post summarizing the OneDrive and Sharepoint API options.

One challenge I have encountered so far is that the OneDrive Python SDK is made for web apps and I want to setup server app (native client app).  More to come.

OneDrive API features

Microsoft has three file storage options:

    1. OneDrive Personal
    2. OneDrive Business
    3. Sharepoint

These have recently been unified into one new OneDrive API https://dev.onedrive.com and oAuth is preferred method of authentication.

However there are some key differences how the API:

    • OneDrive Personal authenticates against oAuth account created at Microsoft Application Registration Portal using a Microsoft account (Live, Microsoft.com). Authentication url is: https://login.live.com/oauth20_authorize.srf
    • OneDrive Business and Sharepoint authenticate against oAuth account created in Azure Active Directory and must be done with Office 365 account.  Authentication url is: https://login.microsoftonline.com/common/oauth2/

You can create two types of applications that will have different methods and parameters:

    • Web application – web site based application that user can sign into. Require definition of an active redirect url and definition of client secret. Scopes or permissions are identified on the fly when authentication is made.
    • Native client application – Android, ‘head-less’ server app, etc. Requires only definition of an unique redirect uri and scopes or permissions that Office 365 account / user have available eg read & write files, etc.

The process for authentication is similar:

    • Sign-in with user account/password or send authentication url to authentication server to get authentication code.
    • Server sends back url with authentication code.
    • Retrieve authentication code from url.
    • Send another url comprised of code and other parameters back to server to get tokens.
    • Use tokens to list, view, upload, download files etc.

There are development SDKs available for popular languages.

I was only interested in thePython SDK . Some key notes about it include:

    • It is created specifically for web applications, not native client applications. The SDK authentication method relies on using a web browser to pass urls with parameters, codes and tokens back and forth. A native client application will not use web browser. A work around was to use head-less browser but that is a bit hacky.
    • It only has OneDrive Personal authentication urls. To use this with OneDrive Business or Sharepoint these urls are easily replaced with the OneDrive Business authentication urls in two files: auth_provider.py and the onedrivesdk_helper.py.

The change to the unified OneDrive API and oAuth authentication only happened in late 2015 so this is pretty new.

There weren’t many well developed or well documented OneDrive Python example code available.

Note it is still also possible to work with OneDrive Business, Sharepoint and OneDrive Personal without using oAuth authentication and this new OneDrive API by simply using urllib, request and templating along with hard coded Office 365 username and password in your code to authenticate.

Finally Microsoft Graph API can be used to interact with OneDrive Business, Sharepoint and OneDrive Personal  once oAuth is setup.

 

Tracking Cuba Gooding Jr’s Twitter follower count

Happened to see Cuba Gooding Jr’s first tweet about 30 minutes or so after he created it.

Update: @cubagoodingjr is no longer active so not getting tweets from it any longer

At the time his profile said he had 559 followers. A few minutes later I refreshed his profile and saw the follower count had increased to 590 and every few minutes it kept increasing. By the end of the day he had about 4,000 followers.

I thought it would be interesting to track how his follower count changed going forward. So I used the Twitter API to get his follower count once per day, save the data, and then created a web page to visualize his follower count per day.

cubatweet

After 2 days Cuba had about 7,000 followers which averaged out to about 175 new followers per hour.  However, the number of new followers slowed down quickly to 30 or so new followers per day and after about 3 months he only gets about 10 new followers per day.  In fact, some days, he has net loss of followers, eg more people unfollow than him, than new follows on that day.

For the technically inclined, I setup an automatic retrieval of Cuba’s Tweets using Twitter’s API and the Tweepy Python module scheduled to run daily using a cron job.

The follower counts get put into a database. I created a PHP web page application to retrieve the data from the database, and create a web page that includes the Google Charts API to create a simple line chart to show Cuba’s regularly updated follower count by day.

You can get the cron job and PHP web page code from my  Github repository. 

If you want to run this code yourself you will need a Twitter developer account and an OAuth file.

Dell ecommerce web site scraping analysis

Once upon a time, I needed to find Dell monitor data to analyse.

A quick search brought me to their eCommerce web site which had all the monitor data I needed and all I had to do was get the data out of the website.

To get the data from the website I used the Python and Python module Scrapy to scrape the webpage and write data to a csv file.

Based on the data I got from the site the counts of monitors by size and country are presented below.


 

However this data is probably not accurate. In fact I know it isn’t. There was a surprising number of variances in the monitor descriptions including screen size which made it hard to get quick accurate counts. I had to do some data munging to clean up the data but there is still a bit more to do.

The surprising thing is that there do not appear to be specific data points for each of the monitor descriptions components. This website is being generated from a data source likely a database that contains Dell’s products. This database does not appear to have fields for each independent data point that are used to categorize and describe Dell monitors.

The reason I say this is that the monitor descriptions single string of text. Within the text string are things like the monitor size, model, common name, and various other features.

These are not in same order, do not all have same spelling, format such as use of text separators, lower or upper case.

Most descriptions are formatted like this example:

Dell UltraSharp 24 InfinityEdge Monitor – U2417H”.

However the many variations on this format at listed below. There is obviously no standardization for Dell to enter monitor descriptions for their ecommerce site.

  • Monitor Dell S2240T serie S 21.5″
  • Dell P2214H – Monitor LED – 22-pulgadas – 1920 x 1080 – 250 cd/m2 – 1000:1 – 8 ms – DVI-D
  • Dell 22 Monitor | P2213 56cm(22′) Black No Stand
  • Monitor Dell UltraSharp de 25″ | Monitor UP2516D
  • Dell Ultrasharp 25 Monitor – UP2516D with PremierColor
  • Dell 22 Monitor – S2216M
  • Monitor Dell UltraSharp 24: U2415
  • Dell S2340M 23 Inch LED monitor – Widescreen 60Hz Full HD Monitor

Some descriptions include the monitor size unit of measurement, usually in inches, sometimes in centimeters, and sometimes none at all.

Sometimes hyphens are used to separate description sections but other times the pipe character ( | ) is used to separate content. Its a real mish mash.

Description do not have consistent order of description components. Sometimes part number is after monitor size, sometimes it is elsewhere.

The problem with this is that customers browsing the site will have to work harder to compare monitors taking into account these variances.

I’d bet this leads to lost sales or poorly chosen sales that result in refunds or disappointed customers.

I’d also bet that Dell enterprise customers and resellers also have a hard time parsing these monitor descriptions too.

This did affect my ability to easily get the data to do analysis of monitors by description categories because they were not in predictable locations and were presented in many different formats.

Another unusual finding was that it looks like Dell has designated default set of 7 monitors to a large number of two digit country codes. For example Bhutan (bt) and Bolivia (rb) both have the same 7 records, as do many others. Take look at the count of records per country at bottom of page. Many countries have only 7 monitors.

Here is the step by step process used to scrape this data.

The screenshot below shows the ecommerce web site page structure. The monitor information is presented on the page in a set of nested HTML tags which contain the monitor data.

dell ecommerce screenshot

These nested HTML tags can be scraped relatively easily. A quick review revealed that the web pages contained identifiable HTML tags that held the data I needed. Those tags are named in Python code below.

The website’s url also had consistent structure so I could automate navigating through paged results as well as navigate through multiple countries to get monitor data for more than one Dell country in the same sessions.

Below is an example of the url for the Dell Canada eCommerce web site’s page 1:

https://accessories.dell.com/sna/category.aspx?c=ca&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=1

The only two variables in url that change for the crawling purposes are:

  • The “c” variable was a 2 character country code eg “ca” = Canada, “sg” = Singapore, “my” = Malaysia, etc.
  • The “p” variable was a number representing the count of web pages that a country’s monitors are shown on about 10 monitors per page. No country I looked at had more than 5 pages of monitors.

Dell is a multi-national corporation so likely has many countries in this eCommerce database.

Rather than guess what they are I got a list of two character country codes from Wikipedia that I could use to create urls to see if that country has data. As a bonus the Wikipedia list gives me the country name.

The Wikipedia country code list needs a bit of clean-up. Some entries are clearly not countries but some type of administrative designation. Some countries are listed twice with two country codes. For example Argentina has “ar” and “ra”. For practical purposes if the Dell url can’t be created from this country codes in this list then the code just skips to next one country code.

The Python code I used is shown below. It outputs a csv file with the website data for each country with the following columns:

  • date (of scraping)
  • country_code (country code entered from Wikipedia)
  • country (country name from Wikipedia)
  • page (page number of website results)
  • desc (HTML tag containing string of text)
  • prod_name (parsed from desc)
  • size (parsed from desc)
  • model (parsed from desc)
  • delivery (HTML tag containing just this string)
  • price (HTML tag containing just this string)
  • url (url generated from country code and page)

The code loops through the list of countries that I got from Wikipedia and within each country it also loops through the pages of results while pagenum < 6:.

I hard coded the number of page loops to 6 as no country had more than 5 pages of results. I could have used other methods perhaps looping until url returned 404 or page not found. It was easier to hard code based on manual observation.

Dell eCommerce website scraping Python code

#-*- coding: utf-8 -*-
import urllib2
import urllib
from cookielib import CookieJar

from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

countries={
    'AC':'Ascension Island',
    'AD':'Andorra',
    'AE':'United Arab Emirates',
     ... etc
    'ZM':'Zambia',
    'ZR':'Zaire',
    'ZW':'Zimbabwe'
}

def main():

    output = list()
    todaydate = datetime.today().strftime('%Y-%m-%d')
    
    with open('dell_monitors.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['date', 'country_code', 'country', 'page', 'desc', 'prod_name', 'size', 'model', 'delivery', 'price', 'url'], delimiter = ',')
        writer.writeheader()
        
        for key in sorted(countries):
            country_code = key.lower()
            country = countries[key]
            pagenum = 1      
            while pagenum < 6:
                url = "https://accessories.dell.com/sna/category.aspx?c="+country_code+"&category_id=6481&l=en&s=dhs&ref=3245_mh&cs=cadhs1&~ck=anav&p=" + str(pagenum)
                #HTTPCookieProcessor allows cookies to be accepted and avoid timeout waiting for prompt
                page = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url).read()
                soup = BeautifulSoup(page)           
                if soup.find("div", {"class":"rgParentH"}):
                    tablediv = soup.find("div", {"class":"rgParentH"})
                    tables = tablediv.find_all('table')
                    data_table = tables[0] # outermost table parent =0 or no parent
                    rows = data_table.find_all("tr")
                    
                    for row in rows:
                        rgDescription = row.find("div", {"class":"rgDescription"})
                        rgMiscInfo = row.find("div", {"class":"rgMiscInfo"})
                        pricing_retail_nodiscount_price = row.find("span", {"class":"pricing_retail_nodiscount_price"})

                        if rgMiscInfo: 
                            delivery = rgMiscInfo.get_text().encode('utf-8')
                        else:
                            delivery = ''
                            
                        if pricing_retail_nodiscount_price:
                            price = pricing_retail_nodiscount_price.get_text().encode('utf-8').replace(',','')
                        else:
                            price = ''
                            
                        if rgDescription:
                            desc = rgDescription.get_text().encode('utf-8')
                            prod_name = desc.split("-")[0].strip()
                            try:
                                size1 = [int(s) for s in prod_name.split() if s.isdigit()]
                                size = str(size1[0])
                            except:
                                size = 'unknown'
                            try:
                                model = desc.split("-")[1].strip()
                            except:
                                model = desc
                                
                            results = str(todaydate)+","+country_code+","+country+","+str(pagenum)+","+desc+","+prod_name+","+size+","+model+","+delivery+","+price+","+url
                            
                            file.write(results + '\n')
                    
                    pagenum +=1
                else:
                    #skip to next country
                    pagenum = 6 
                    continue

                
if __name__ == '__main__':
    main()


The Python code scraping output is attached here as a csv file.

The summary is a list of the scraping output that shows a list of country codes, countries and count of Dell monitor records scraped from a web page using the country code Wikipedia had for these countries.

af – Afghanistan – 7 records
ax – Aland – 7 records
as – American Samoa – 7 records
ad – Andorra – 7 records
aq – Antarctica – 7 records
ar – Argentina – 12 records
ra – Argentina – 7 records
ac – Ascension Island – 7 records
au – Australia – 36 records
at – Austria – 6 records
bd – Bangladesh – 7 records
be – Belgium – 6 records
bx – Benelux Trademarks and Design Offices – 7 records
dy – Benin – 7 records
bt – Bhutan – 7 records
rb – Bolivia – 7 records
bv – Bouvet Island – 7 records
br – Brazil – 37 records
io – British Indian Ocean Territory – 7 records
bn – Brunei Darussalam – 7 records
bu – Burma – 7 records
kh – Cambodia – 7 records
ca – Canada – 46 records
ic – Canary Islands – 7 records
ct – Canton and Enderbury Islands – 7 records
cl – Chile – 44 records
cn – China – 46 records
rc – China – 7 records
cx – Christmas Island – 7 records
cp – Clipperton Island – 7 records
cc – Cocos (Keeling) Islands – 7 records
co – Colombia – 44 records
ck – Cook Islands – 7 records
cu – Cuba – 7 records
cw – Curacao – 7 records
cz – Czech Republic – 6 records
dk – Denmark – 23 records
dg – Diego Garcia – 7 records
nq – Dronning Maud Land – 7 records
tp – East Timor – 7 records
er – Eritrea – 7 records
ew – Estonia – 7 records
fk – Falkland Islands (Malvinas) – 7 records
fj – Fiji – 7 records
sf – Finland – 7 records
fi – Finland – 5 records
fr – France – 17 records
fx – Korea – 7 records
dd – German Democratic Republic – 7 records
de – Germany – 17 records
gi – Gibraltar – 7 records
gr – Greece – 5 records
gl – Greenland – 7 records
wg – Grenada – 7 records
gu – Guam – 7 records
gw – Guinea-Bissau – 7 records
rh – Haiti – 7 records
hm – Heard Island and McDonald Islands – 7 records
va – Holy See – 7 records
hk – Hong Kong – 47 records
in – India – 10 records
ri – Indonesia – 7 records
ir – Iran – 7 records
ie – Ireland – 7 records
im – Isle of Man – 7 records
it – Italy – 1 records
ja – Jamaica – 7 records
jp – Japan – 49 records
je – Jersey – 7 records
jt – Johnston Island – 7 records
ki – Kiribati – 7 records
kr – Korea – 34 records
kp – Korea – 7 records
rl – Lebanon – 7 records
lf – Libya Fezzan – 7 records
li – Liechtenstein – 7 records
fl – Liechtenstein – 7 records
mo – Macao – 7 records
rm – Madagascar – 7 records
my – Malaysia – 25 records
mv – Maldives – 7 records
mh – Marshall Islands – 7 records
mx – Mexico – 44 records
fm – Micronesia – 7 records
mi – Midway Islands – 7 records
mc – Monaco – 7 records
mn – Mongolia – 7 records
mm – Myanmar – 7 records
nr – Nauru – 7 records
np – Nepal – 7 records
nl – Netherlands – 8 records
nt – Neutral Zone – 7 records
nh – New Hebrides – 7 records
nz – New Zealand – 37 records
rn – Niger – 7 records
nu – Niue – 7 records
nf – Norfolk Island – 7 records
mp – Northern Mariana Islands – 7 records
no – Norway – 19 records
pc – Pacific Islands – 7 records
pw – Palau – 6 records
ps – Palestine – 7 records
pg – Papua New Guinea – 7 records
pe – Peru – 43 records
rp – Philippines – 7 records
pi – Philippines – 7 records
pn – Pitcairn – 7 records
pl – Poland – 4 records
pt – Portugal – 7 records
bl – Saint Barthelemy – 7 records
sh – Saint Helena – 7 records
wl – Saint Lucia – 7 records
mf – Saint Martin (French part) – 7 records
pm – Saint Pierre and Miquelon – 7 records
wv – Saint Vincent – 7 records
ws – Samoa – 7 records
sm – San Marino – 7 records
st – Sao Tome and Principe – 7 records
sg – Singapore – 37 records
sk – Slovakia – 23 records
sb – Solomon Islands – 7 records
gs – South Georgia and the South Sandwich Islands – 7 records
ss – South Sudan – 7 records
es – Spain – 10 records
lk – Sri Lanka – 7 records
sd – Sudan – 7 records
sj – Svalbard and Jan Mayen – 7 records
se – Sweden – 6 records
ch – Switzerland – 21 records
sy – Syrian Arab Republic – 7 records
tw – Taiwan – 43 records
th – Thailand – 40 records
tl – Timor-Leste – 7 records
tk – Tokelau – 7 records
to – Tonga – 7 records
ta – Tristan da Cunha – 7 records
tv – Tuvalu – 7 records
uk – United Kingdom – 35 records
un – United Nations – 7 records
us – United States of America – 7 records
hv – Upper Volta – 7 records
su – USSR – 7 records
vu – Vanuatu – 7 records
yv – Venezuela – 7 records
vd – Viet-Nam – 7 records
wk – Wake Island – 7 records
wf – Wallis and Futuna – 7 records
eh – Western Sahara – 7 records
yd – Yemen – 7 records
zr – Zaire – 7 records

Grand Total – 1760 records

Power Query MySQL database connection

Excel Power Query can make a connection to MySQL database but requires that you have a MySQL Connector/Net 6.6.5 for Microsoft Windows. Instructions for that on Microsoft site.

Once you have the connector you can get MySQL db table data into your Excel file using Power Query in two ways. The method you select will be dependent on how you want to work with the MySQL data you retrieve.

The choice is made after you have selected Power Query – From Database – From MySQL Database.

After you select From MySQL Database, you will see the MySQL Database connection popup.

Now you can make a choice. Either method requires you to specify the Server and Database to make direct connection to a single table (or view) but you can optionally enter a SQL Statement to effectively connect to multiple tables and views in one connection. Bonus is that the SQL query is pushed back into the server so the Power Query client doesn’t have to do the work.

Method # 1. Native SQL query – when connecting you have option to enter SQL, and if you enter SQL query there you will get:

let
    Source = MySQL.Database(“127.0.0.1”, “database_name”, [Query=”select * from database_table_name”])
in
    Source

Method # 2. Power Query Navigation – instead of entering SQL, just leave that field blank, then continue on, and Power Query will present you with list of MySQL server database names (Schemas) and tables names. Simply select the “Table” link in the database and table that you want and that will add Navigation step and retrieve that table’s data.

let
    Source = MySQL.Database(“127.0.0.1”, “database_name”),
   database_table_name = Source{[Schema=”database_name”,Item=”database_table_name”]}[Data]
in
    database_table_name

Either way you get the same table results in the Power Query.

Of course, if you want to join in other tables from your MySQL database(s), then method #1 will be more direct. Method #1 also assumes that you can use SQL to get what you want from the MySQL database tables.

However, you could also use method #2 to retrieve all desired tables (even from different databases on that server), and then use Merge or Append to get the desired results.

Method #2 allows you to retrieve and work with your MySQL database table data without using SQL and rely on Excel Power Query instead. That opens the door for relatively non-technical data workers to use the data which is a pretty cool thing!

You can use these methods with MySQL databases on your local computer or on a remote computer. You just have to make sure to enter the correct server url, database name, user and password.

On occasion I have had challenges before I could get a remote MySQL database connection to work. I had to clear the Power Query cache, update Power Query, turn off the Privacy option to make the connection work.

How to serve multiple domain name websites with one CakePHP application

Say you have a need to deliver multiple websites which have same structure but each site is served on different domain names and have different content.

For example, perhaps you are selling a product in many different countries and you want to have a separate website for each country.

This solution allows you to use one CakePHP application and one MySQL database to serve multiple websites each with different domain and content.

The domains need to be hosted on shared hosting which allows multiple domain hosting and allows some way of pointing multiple domains to a single application or route.

You would create your CakePHP application as normal but modify it as per this solution so that it can identify the domain the request is coming from and then serve up content for that domain only!

The modifications are summarized here and the complete PHP code is in Github:

    • Create a table called ‘domain’ (or whatever you want with as many columns as you need) that will hold domain specific values that are retrieved to be used in in application. At a minimum it should have the each site’s domain so you can do a lookup to match it to the domain being requested.
    • Add domain table foreign key (domain_id) to other tables so that they can be filtered by domain when retrieving data. (This means likely you will require additional steps to ensure that the foreign key is written to your tables. How you do that is up to you but for example if you allow users to create accounts or post comments then you can simply record the domain_id (for example you CakePHP Configure::read value when new user or comment is added.)
    • Create a new function  (I called min ‘getDomainSettings‘ ) in the CakePHP AppController beforeFilter section. This function will read domain name from current site visitor request and use it as parameter to filter database records to just that domain’s values. I used the built-in CakePHP Configure::write variables for these domain values that can be used throughout the CakePHP application.
    • Add optional conditions when retrieving model data using these CakePHP Configure:write variables in other CakePHP controllers’ to retrieve specific model records for the current domain and as well as modify view presentation based on model records. Lots of potential here, you could also swap out css styles or CakePHP themes based on domain and make sites totally different.

Always remember to start Python Virtualenv

When learning to develop Django and Flask Python applications using virtual environment I did not know that I should activate the virtual environment before running the application. Well, more precisely, I did know the virtual env should be activated but rather I assumed it would be activated somehow auto-magically.

I was using MS Visual Studio 2015 RC and its new Python Tools which work fine by the way. One can install new Python/Flask/Django/etc modules via the IDE.

This was also the first time I was using Python virtual environment. I hadn’t previously done any Python web development. I had used Python for analytical purposes in files, command line and iPython but all of these used the ‘native’ operating system Python installation.

In MS VS 2015 RC, after Python Tools are installed, there are templates to create new Django or Flask projects, and one of the steps asks if I want to install virtual env or not.

So after a bit of search and read I realized virtual env is the way to go. Its better for many reasons to have a standalone Python environment for each new Python/Flask/Django project.

I just assumed that since I had created my new Python application with a virtual env that when I opened Visual Studio and started working on it, it would be in virtual environment by default somehow auto-magically.

But no, the virtual environment has to be activated manually each time the project is opened or before being able to interact with the project via web browser. So remember to activate your virtual env before running your Python/Flash/Django application.

What mislead me was that running the application without first activating the virtual environment can often be ok because the native operating system Python installation has the required modules, so application runs just fine.

But I ran into problems when after installing new Python modules only to see the application complaining that they weren’t available eg got error message in browser and command line from the server saying ‘no module named xxx’. This was confusing because I was thinking hey I just installed that.

So remember activating the virtual env before running the Python application is required. Now it is second nature to do this.

To activate the virtual env in Windows simply navigate to the ‘Scripts’ folder in your virtual env folder and run the ‘activate.bat’ file.

In command line enter ‘activate’ in that folder. Or you can enter the full path to the activate.bat from anywhere in command line.

You can easily see if virtual env has been started because when it is you will see ‘(env)’ at the start of the command line.

Then you can go back to your application folder with the ‘runserver.py’ (or whatever you call it) and then start the application!

Flask SQLAlchemy relationship associations and backref

Flask-SQLAlchemy relationship associations are useful to easily retrieve related records from a related Model.

In a Python Flask application I wanted to use a many-to-many relationship between Photos and Tags (eg many categories assigned to one or more photos).

To do this I had the following things set up in my application:

  • 3 database tables
    • Photos
    • Tags
    • Photo_Tags
  • 3 model.py sections
    • Photos
    • Tags
    • photo_tags
  • 1 views.py section ‘photo’
  • 1 template ‘photo.html’

The models looked like this:


class Photos(db.Model):
….id = db.Column(db.Integer, primary_key=True)
….filename = db.Column(db.String(100))
….tags = db.relationship(‘Tags’,  secondary=photo_tags, backref=’photos’)

class Tags(db.Model):
…. id = db.Column(db.Integer, primary_key=True)
…. tagname = db.Column(db.String(100))

photo_tags = db.Table(‘photo_tags’,
…. db.Column(‘tag_id’, db.Integer, db.ForeignKey(‘tags.id’)),
…. db.Column(‘photo_id’, db.Integer, db.ForeignKey(‘photos.id’))
)

The bolded text above shows how the ‘tags’ relationship between the Photos and the Tags models that is added to the Photos model. SQLAlchemy wants us to add this ‘tags’ association to one of the models that are in it (only one more on that below).

The ‘tags’ relationship is physically recorded in the database table ‘photo_tags’ and it doesn’t get a ‘class Model’ like the Photos and Tags have. Instead, it is setup as shown above to indicate that it is a table with foreign keys to the Photos and Tags tables.

Key things to note about the ‘tags’ reference in the Photos model:

  • ‘Tags’ model is related model
  • ‘photo_tags’ is referenced as the ‘secondary’ or relationship table.
  • ‘photos’ is referenced as ‘backref’. The significance of this is that you don’t have to add a ‘tags’ to the Tags model. The ‘backref’ will allow you to reversibly reference Photos and Tags eg you can use the ‘tags’ to get
    • Tags related to a specific Photo
    • Photos related to a specific Tag

An example of a view that can use the relationship established above follows. This view retrieves all tags for a specific photo:


@app.route(‘/photo/’)
def photo(requested_photo_id=None):
….requested_photo= Photos.query.get(requested_photo_id)
….return render_template( ‘photo.html’, requested_photo = requested_photo
)

Here is what happens in this view:

  • The requested_photo_id is retrieved from the url.
  • The SQLAlchemy query Photos.query.get(requested_photo_id) uses that requested_photo_id to retrieve the photo from the Photos model/MySQL database table, and its related tags.
  • The retrieved requested_photo object is passed to template ‘photo.html’ along with the related tag info.
  • Finally it is in the photo.html template where everything comes together. The template below is simple HTML page that presents the requested_photo_id and its related tags.






Photo Details

Photo ID: {{ requested_photo.id }}

Photo file name: {{ requested_photo.filename }}

{% for tag in requested_photo.tags %}

tag.id: {{ tag.id}}, tag.tagname: {{ phototag.tagname }}

{% endfor %}

The requested_photo id and filename come from the Photos Model/MySQL table.

The photo’s related tags come from the requested_photo ‘tags’ association by referencing it as ‘requested_photo.tags’ (this is the awesome part) and then looping through the tags to retrieve each tag so they can be printed on webpage individually.

Using ATMs as wifi hotspot

Wirebase was a concept created in 2001 before broadbrand, wired or wireless was widely available.

wirebaselogo_top
wirebaselogo_bottom

The concept was shopped around to a bunch of Vancouver area VC’s and potential investors ultimately got lost in the dot.com bubble crash.

The idea was to create ‘wirebases’ that were a modular component integrated into an automated banking machine or deployed as a stand-alone unit in a public area.

At the time I was contracting with a western Canadian credit union banking services bureau who was in the process of building its first national automated banking machine switching system so we were privy to the details of ATMs and banking machine switching. Automated banking machines have computers inside. At the time they were regular PC’s and could have peripherals connected just like any ordinary PC eg internal slots, etc.

The Wirebase set-up is illustrated in diagram below.

overview diagram

The Wirebase module is connected to the internet via a proprietary network solution that is similar to that used by the global ABM network.

A mobile device (Palm Pilot [LOL, remember this was 2001!!], digital camera, portable MP3 player, portable computer, etc) can connect to the Wirebase.

A copy of the website (long since gone) is maintained at the Internet Archive:

https://web.archive.org/web/20010227065841/http://wirebase.com/

Our business plan and presentation can be downloaded below:

WireBase Biz Plan Slideshow

WireBase Short Biz Plan

Get list of custom segments from Google Analytics API

This is a post on how to create custom Google Analytics Profile Segments for the purpose of removing referral spam (and there is increasingly soo much of it!) from GA reporting.

However if you want to use these new Custom Segments to filter results using Google Analytics API with a Service Account there are some challenges.

If you are retrieving GA results for many web sites you need to get the GA API to loop through each sites’s View / Profiles in your GA Account to retrieve the data for each.

The challenge is that each Profile has its own Custom Segment. In order to filter out referral spam completely, two types of filters are required. The ‘exclude’ filter which is same for all Profile, and the ‘include’ filter which is specific to each Profile as it refers to the Profile’s domain.

So that makes looping through each Profile a bit more challenging. You need a dictionary of each Profile’s Custom Segment Id so it can be applied for each Profile’s data.

These Custom Segment Id’s look something like “gaid::BXxFLXZfSAeXbm4RZuFd9w”

The Custom Segment Id needs to be used in the service.data.ga().get().execute() criteria.


data = service.data().ga().get(
ids=ids,
start_date=”2015-07-01″,
end_date=”2015-07-19″,
segment: “gaid::BXxFLXZfSAeXbm4RZuFd9w”,
metrics=metrics
).execute()

It wasn’t easy to find these Custom Segment Id’s. First I tried looping through the segments() as follows:


    # Authenticate and construct service.
    service = get_service(‘analytics’, ‘v3’, scope, key_file_location,
    service_account_email)
    
    segments = service.management().segments().list().execute()
    
    for segment in segments.get(‘items’, []):
      print ‘Segment ID ‘ + segment.get(‘id’) + ” – ” + segment.get(‘name’)

But that only retrieved the Standard Google Segments, but not the Custom Segments and apparently this is not possible with a Service Account.

So I found that you are able to see the Custom Segment Ids in the https://ga-dev-tools.appspot.com/qery-explorer.

But while you can see the Custom Segments here it wasn’t very helpful as you have to go one by one in the Segments criteria field. If you have many sites it will be time consuming.

Then I finally found the “stand alone explorer” at the bottom of the GA API Segments documentation page.

https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/segments/list#try-it

This outputs a json file containing all of the Segment details. Unfortunately this isn’t useful as a ready dictionary as it only has the segment details, not the account id. But it does have the Custom Segment Ids which can be used to create manual dictionary of Account Id and Segment Id that can be used in the loop.

Perhaps it might also be possible to do a reverse lookup and find the Custom Segment Id by looping through the Segments and finding those with the name.

Hope that helps someone!

Display Django queryset results on Google Map

I wanted to be able to show locations by latitude and longitude with data from Django 1.8 website on a Google Map using the Google Maps API.

I used the accepted answer from a stackoverflow question to create a Google Map with multiple markers that had the store name with a link that would open that store’s details when clicked.

I did the following:

  • copied the stackoverflow solution javascript and html code into new Django template called fountain_map.html
  • created new Django view called fountain_map for that template
  • create new urls.py line to route the fountain_map url for new view/template

The stackoverflow answer used Google Maps javascript that had a javascript array like this:

    var locations = [
      [‘Bondi Beach’, -33.890542, 151.274856, 4],
      [‘Coogee Beach’, -33.923036, 151.259052, 5],
      [‘Cronulla Beach’, -34.028249, 151.157507, 3],
      [‘Manly Beach’, -33.80010128657071, 151.28747820854187, 2],
      [‘Maroubra Beach’, -33.950198, 151.259302, 1]
    ];

However while the example has this hard coded list of locations I wanted a dynamic list populated by queryset records from the new view.

So I created a queryset in the view that retrieved the location records:

   

map_points = Fountains.objects.filter(lat__isnull=False)

Note that I filtered to only retrieve records that had a lat value so I wasn’t sending records that couldn’t be mapped.

Since the queryset object is not immediately readable by the javascript as the location variable, it needed to be transformed into a format acceptable for the javascript.

There are a couple of options:

  • Use Django’s serialization to turn it into JSON
  • Loop through queryset object and manually build the array in correct format, this could be done in the view or in the template

I choose to do this transformation in the template. Django’s serialization has lots of documentation and lots of SO question and answer but seemed easier to do this in template for now.

So in the template i simply looped through the map_point queryset object to create the array that the var locations required.

The javascript required the square brackets as shown in example above along with quotes around the location name.

Note that the Stack Overflow answer also has a digit as the fourth item in the record but I excluded that in mine. Not sure what it was but user obviously wanted to show it in marker label or something like that.

Anyways my template loop looked like this:

      var locations = [
        {% for point in map_points %}
            {% if point.lat = None %}
            {% else %}
              {{ point.name }}’, {{ point.lat }}, {{ point.lon }}],
            {% endif %}
        {% endfor%}
        ]

You can see that I retrieved the following values for the locations array from the queryset results:

  • name (fountain name to show on marker label popup)
  • id (so that it could be used to create marker link to the Django view for that store)
  • lat
  • lon

That was all I needed to do and gave me a Google Map showing each fountain’s location with a Google red pin marker. When user clicked on marker, the fountain name would show that had link to that fountain’s detail page.

google map

How to setup and and use a Twitter API application

The Twitter API allows you to read and write tweets on your own Twitter account.

But you can also create an ‘app’ that can be granted permission by other Twitter accounts to access their information and even tweet on their behalf.

If you have ever granted this permission you will see the app in your Twitter account’s Settings, Apps options where you can see what apps have been granted permissions as well as revoke permissions.

This comes in handy if you have multiple Twitter accounts and you want to use the Twitter API to tweet to all of them but do not want to create multiple Twitter apps. Instead you can create one Twitter app and grant it permission to use all of the other Twitter accounts.

This is also very handy because a Twitter app must be associated with a mobile phone number. It is rare that you will have multiple mobile phone numbers.

Note there is nothing special about the ‘dev’ account. It is a normal Twitter account but you just choose it in which to create the Twitter app that will tweet on all of your accounts’ behalves.

Step 1. Make sure your ‘dev’ account has a mobile phone on the account otherwise you cannot proceed.

Step 2. While logged into your dev account, open your web browser in another tab to https://apps.twitter.com/ which will open the Twitter API app management page.

Step 3. If you haven’t already set up an API app there will be nothing to manage and it will simply show you a ‘create an app’ button. Click that button and follow instructions to setup the API app.

Step 4. Save your new API app’s consumer_key, consumer_secret, access_token, access_token_secret credentials for use in following steps.

Step 5. Log out of your dev account once the dev API app is setup and running properly.

Step 6. Log into another Twitter account that you want to add the dev account API app as a 3rd party app. You will successively log into each additional Twitter account that you want to use with dev API app with. Note you do not have to login in advance, as the code you run in next steps will popup web page to prompt you enter user and pw for Twitter account you want your new API app to work with.

Step 7. Go get this code which is what will do some magic to grant permission for your new Twitter API app to tweet on behalf of your other Twitter accounts. Save this code on your computer and keep it open in your code/text editor so you can make some changes.

Step 8. Get your app’s consumer_key, consumer_secret, access_token, access_token_secret credentials that you got from Twitter API management and paste them into the appropriate place in code you just downloaded and saved.

Step 9. Now run the code using command line interface and it will generate a URL in the command line window and show a URL that contains tokens that were generated specifically for your app and the other Twitter account. You just have to copy and paste the url into a web browser. Copy the url from command line window.

command-line-run-python-code1

 

Step 10. Paste it into a browser which also has your other twitter account logged in.  This is a crucial part of this process. The page that opens will ask you to click button to authorize this the dev app to use your other Twitter account. Click Authorize App button.

paste-url-into-browser1-authorize-app

 

Step 11.  After clicking Authorize App button you will see a web page that shows a special PIN code generated just for this one time use. I have hidden the PIN in screenshot but you will have a number where it says “PIN number is here”. Copy this PIN.

paste-url-into-browser2-get-code

Step 12. Now go back to the command line window. Python code command line also has a prompt to select ‘y’ once you have copied the PIN code and then prompt you to paste/enter the PIN code in the command line. Enter ‘y’ and hit enter and paste the PIN into the next command line.

Entering PIN code automatically registers/adds your dev API app as a 3rd party app in the Twitter account you are currently logged into and generates two strings of characters in the command line window. These are tokens you need to copy and save so you can use them later. You will need to enter them as credentials for the dev API app to read and write Tweets for the currently logged in Twitter account.

command-line-run-python-code2-paste-pin

 

If you go to your Twitter account and select Settings – Apps you will see that your dev App is now registered as an authorized app that can Tweet and read Tweets on behalf of your Twittter account.

twitter-settings-apps-new-app-just-added

 

Now its up to you to create Twitter API code to use these new authorizations. Try using Tweepy or any of the other many Twitter Python modules that are available. Happy Tweeting.

Your dev app can be added as authorized app to multiple other Twitter accounts. You just have to repeat process for each of the other Twitter accounts one by one. Simply log into each in succession (log out of previous account and then log into the next).

You will end up with a bunch of tokens that you can use to read and tweet for all of your Twitter accounts.

So just to summarize what you have done is authorized one of your Twitter accounts, which has a verified mobile phone number, to read and write Tweets for any number of other Twitter accounts you have, using the Twitter API.

For more information on Twitter API check out their developer site at https://dev.twitter.com/

How to move Google Chart x-axis to top of chart

The newest Google Charts API allows you to more easily change the axis that you want the ticks and labels are on.

In older version you had to create a secondary axis with dummy data and hide it showing only the axis with the real data on the where you want it (eg right side instead of left or top instead of bottom).

You will still do this in the new version but it is easier because you can use the ‘axes’ option which is part of the ‘Dual-axis’ feature .

Note that you will continue to use the hAxis and vAxis options as normal but will also use a new option called axes in which you specify the ‘side’ you want axis on.

Also note that the axes option will be used to name the axis so the hAxis or vAxis option axis title will be ignored.

Here is example options showing axes options being used to the x-axis on the top of a horizontal bar chart:

//set chart options
var options = {
     title: ”,
     width: ‘80%’,
     height: 700,
     bar: { groupWidth: ‘95%’ },
     legend: { position: ‘none’ },
     bars: ‘horizontal’,
     hAxis: {
          title: ‘my x axis’,
          minValue: 0,
          format: ‘decimal’,
          textPosition: ‘out’,
     },
     vAxis: {
          title: ‘my y axis’,
          textPosition: ‘out’
     },
     axes: {
          x: {
               0: { side: ‘top’, label: ‘my x axis’}
          },
          y: {
               0: { side: ‘left’, label: ‘my y axis’}
          }
}

BC Hydro’s amazing #BCStorm social media turnaround

BC Hydro made an amazing social media turnaround to communicate with customers in crisis! Go BC Hydro!

On August 29, 2015 high winds caused tree falls that took out BC Hydro power lines throughout BC’s Lower Mainland, Vancouver Island, Sunshine Coast leaving up to 500,000 customers without electricity. Many including me were without power for 48 hours.

BC Hydro’s web site was their primary channel for communicating with customers about extent of damage and expected time for repairs, but the site also went down during this time. H

They had used Twitter in the past to communicate with customers but they weren’t using it much on first day of crisis.

However on the second day of the crisis with 100,000+ customers still without power BC Hydro dramatically increased Twitter communication by responding directly to customer tweets about extent of damage and expected time for repairs.

To visualize this dramatic increase in BC Hydro Twitter usage I downloaded all @BCHydro tweets for August 29 and 30 from Twitter’s API using Python Tweepy and used Microsoft Power BI to create a visualization of BC Hydro tweet counts which is shown below.

Some notes on chart:

    • x axis shows date and hour of day
    • y-axis shows count of tweets
    • ‘response’ tweets are light green part of bar
    • ‘status’  tweets are dark green part of bar

You can see that on August 30 at 11 AM, about 28 hours after the storm hit, BC Hydro suddenly starts tweeting responses to customers’ questions. They continued for the next few days until their website was back up in addition to their regular ‘status’ tweets.

The chart clearly shows a very amazing social media turnaround! BC Hydro dramatically increased their use of Twitter to get customers answers and information. Go BC Hydro!

Note: the Twitter data was last updated Aug 31, 2015 at 16.45 PM PST.

bchtweets

 

Historically BC Hydro did not use Twitter this way.  The chart below shows BC Hydro’s tweeting before the storm. They were tweeting once or twice per day with occasional spikes to 7 tweets per day.

 

bchtweets_before

 

The ‘response‘ category includes tweets by BC Hydro responding to a customer tweet. Example ‘response’ tweet:

2015-08-30 14:26:01, @Adamhillz Crews will be on site as soon as possible. Stay back from any downed lines and call 911 if it is on fire.

The ‘status‘ category includes tweets by BC Hydro about ongoing status of repair work etc. Example ‘status’ tweet:

2015-08-30 14:28:32, Crews have been brought in from across province, including Prince George, Terrace, Kamloops, Smithers, Vernon & Vancouver Island #bcstorm

FYI if anyone is interested, here is the text file with tweet text – bchtweets

Here is the Python code used to get the BC Hydro Tweets using the Twitter API.

import tweepy
#import MySQLdb
from datetime import datetime
import random
import sys, os
import json
from dateutil import tz
from tweet_auth import * #this is another .py file with the twitter api credentials

#get twitter auth
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

def main():
with open('bchtweets.csv', 'wb') as file:
for status in tweepy.Cursor(api.user_timeline, id='bchydro').items(1800):
file.write(str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)) + ', ' + status.text.encode('utf8') + '\n')

if __name__ == '__main__':
main()

Note that there is a separate file with my Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

#twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

How to create OData data source Gateway for Power BI Report

Microsoft has a free Microsoft Data Management Gateway that you can install on your desktop or server to act as a broker to create an OData data source from a local data source. The OData source can be consumed by any application that can connect to OData url endpoint such as Power BI in Office 365 or in Microsoft’s new standalone online version of Power BI, or Excel etc.

For example setup Microsoft Data Management Gateway on your desktop which creates a local Service that is connects a local SQL Server database to a Data Management Gateway Cloud Service in Power BI.

Remote users can then connect to Office 365  or Power BI via internet and connect to the endpoint to use data using Excel Power Query add-on.

 

Introducing Speedvisit – a quick and easy way to record visits from people

Here is a demo web application / SaaS called SpeedVisit that individuals and companies can use to easily and quickly record visits from people to record basic customer demographic information.

Features include:

  • Groups – create a group so you add other users, locations and referrers to it.
  • Users – add additional users to your group as you want.
  • Locations – add locations where visits will take place.
  • Referrers – add referrers to record how visitor was referred.

The definition of ‘visit’ is pretty broad. It could be customers, co-workers, classmates, attendees in many situations including:

  • Retail Stores
  • Staff Meetings
  • Convenience Stores
  • Restaurants
  • Street Vendors
  • Art Studios
  • Non-Profits
  • Community Clubs
  • Sports Events
  • Social Events
  • Trade Shows

The demographic information that you can record currently includes:

  • Gender
  • Age
  • Zipcode or postal code
  • Income
  • Prev Visit
  • Referrer
  • Location
  • Purchase
  • Email
  • Comments

The site is mobile friendly responsive design so it works on phones, tablets, laptops and desktops.

The data is totally portable with easy export of all visits to csv file. It is setup with SSL so your data will be securely transferred between your browser and the server.

This is a screenshot of the top of the visit capture page which is the central feature of the application and workflow.  Simply use radio buttons and choosers to select categorize the visitor and add record. The page will then be ready for the next visitor.

speedvisit1

 

The data is available for anyone else to view it, to view aggregated statistics, etc. This is screenshot of the recorded visit data.

speedvisit2

You can sign into the demo account to see what sample account and visits look like too.

How to filter referral spam from Google Analytics using API and Python

Google Analytics data has become incredibly polluted by “spam referrals” which inflate site visits with what are essentially spam advertisements delivered to you via Google Analytics.

The spammers are entirely bypassing your site and directly hitting Google’s servers pretending to be a visitor to your site. Its a bit odd that a technological superpower like Google has fallen prey to spammers. Apparently a fix is in the works but it feels like its taking way too long.

In the meantime the fix is to filter out any “visit” that doesn’t have a legitimate referrer hostname. You determine what hostnames you find legitimate. At a minimum you want to include your domain. You can also filter out spam visits based on where their source. The source name is the where the spammers advertise to you by giving their spam domains hoping you will visit their sites. Setting up these filters can be done in Google Analytics built-in filters and it takes some manual effort and some ongoing updating as spammers keep changing source names.

The screenshot below shows the Google Analytics filter screen where you build filters for hostname and source using rules based filtering.

google filter

However this same rules based filtering can be done using the Google Analytics API. There is a lot of code around for you to work with and Google documentation is pretty good. I have implemented a hostname and source filter using Python and the code below. This enables me to download run the code in scheduled job and always have analytics data for analysis.

The “hostMatch” and “sourceExp” are the two things that filter out fake hostnames and fake visit source respectively.

You will need to get yourself Google API access and setup the OAuth (which I am not describing here). You will need the OAuth key and a secret file to authorize access to the API then you can use the code below.

'''access the Google Analytics API.'''
# https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults

import argparse
import csv
import re
from apiclient.discovery import build
from oauth2client.client import SignedJwtAssertionCredentials
import httplib2
from oauth2client import client
from oauth2client import file
from oauth2client import tools
from datetime import datetime, timedelta

todaydate = datetime.today().strftime('%Y-%m-%d')

def get_service(api_name, api_version, scope, key_file_location,
				service_account_email):
	'''Get a service that communicates to a Google API.
	Args:
	api_name: The name of the api to connect to.
	api_version: The api version to connect to.
	scope: A list auth scopes to authorize for the application.
	key_file_location: The path to a valid service account p12 key file.
	service_account_email: The service account email address.
	Returns:
	A service that is connected to the specified API.
	'''
	# client_secrets.p12 is secrets file for analytics
	f = open(key_file_location, 'rb')
	key = f.read()
	f.close()
	credentials = SignedJwtAssertionCredentials(service_account_email, key,
	scope=scope)
	http = credentials.authorize(httplib2.Http())
	# Build the service object.
	service = build(api_name, api_version, http=http)

	return service


def get_accounts(service):
	# Get a list of all Google Analytics accounts for this user
	accounts = service.management().accounts().list().execute()

	return accounts


def hostMatch(host):
        #this is used to filter analytics results to only those that came from your hostnames eg not from a spam referral host
	hostnames="domainname1","domainname2","domainname3"

	hostExp = "(" + ")|(".join(hostnames) + ")"
	hostMatch = re.search(hostExp, host[3].lower())

	if hostMatch:
		return True
	else:
		return False


def main():

    #this is where you build your filter expression, note it similar to what you would build in Google Analytics filter feature, you can be as specific of generalized using regex as you want/need
    #ga:source filter
    sourceExp=('ga:[email protected];ga:source!@justprofit;ga:source!@rankings-analytics;ga:source!@semalt;ga:[email protected];ga:source!@anticrawler;ga:source!@best-seo-offer;ga:source!@best-seo-solution;ga:source!@buttons-for-website;ga:source!@buttons-for-your-website;ga:source!@7makemoneyonline;ga:source!@-musicas*-gratis;ga:source!@kambasoft;ga:source!@savetubevideo;ga:source!@ranksonic;ga:source!@medispainstitute;ga:[email protected];ga:source!@100dollars-seo;ga:source!@sitevaluation;ga:source!@dailyrank;ga:source!@videos-for-your-business;ga:source!@videos-for-your-business;ga:source!@success-seo;ga:source!@4webmasters;ga:source!@get-free-traffic-now;ga:source!@free-social-buttons;ga:source!@trafficmonetizer;ga:source!@traffic2money;ga:source!@floating-share-buttons;ga:source!@video--production')

    # Define the auth scopes to request.
    scope = ['https://www.googleapis.com/auth/analytics.readonly']

    #Provide service account email and relative location of your key file.
    service_account_email = 'xxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com'
    key_file_location = 'client_secrets.p12'
    #scope = 'http://www.googleapis.com/auth/analytics'

    # Authenticate and construct service.
    service = get_service('analytics', 'v3', scope, key_file_location, service_account_email)

    #get accounts
    accounts = service.management().accounts().list().execute()
    #create list for results
    output = list()

    # loop through accounts
    for account in accounts.get('items', []):
    	account_id = account.get('id')
    	account_name = account.get('name')

    #get properties
    	properties = service.management().webproperties().list(accountId=account_id).execute()

    #loop through each account property default profileid (set in GA admin)
    #get metrics from profile/view level
    #instead of looping through all profiles/views
    	for property in properties.get('items', []):
    		data = service.data().ga().get(
    			ids='ga:' + property.get('defaultProfileId'),
    			start_date='2012-01-01',
    			end_date= todaydate, #'2015-08-05',
    			metrics = 'ga:sessions, ga:users, ga:newUsers, ga:sessionsPerUser, ga:bounceRate, ga:sessionDuration, ga:adsenseRevenue',
    			dimensions = 'ga:date, ga:source, ga:hostname',
                max_results = '10000',
    			filters = sourceExp # the filters from above 
    		).execute()


    		for row in data.get('rows', '1'):
    			results = account_name, row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9]
    			output.append(results)
	#print output
		#count of response rows
        #print account_name, data['itemsPerPage'], len(data['rows'])

    #here is the hostname filter call to function above
    hostFilter = [host for host in output if hostMatch(host)==True]

    with open('output_analytics.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['account', 'date', 'source', 'hostname', 'sessions', 'users', 'newUsers', 'sessionsPerUser', 'bounceRate', 'sessionDuration',  'adsenseRevenue'], delimiter = ',')
        writer.writeheader()
        for line in hostFilter:
			file.write(','.join(line) + '\n')
            #print>>file, ','.join(line)

if __name__ == '__main__':
	main()

Tableau vizualization of Toronto Dine Safe data

The City of Toronto’s open data site includes the results of the city’s regular food and restaurant inspections. This data was as of August 2014.

The interactive chart below allows filtering by review criteria and can be zoomed into to view more detail and specific locations.

 

The data file for Dine Safe contained about 73,000 rows and was in XML format. In order to work with it I transformed it to csv format.

from xml.dom.minidom import parse
from csv import DictWriter
 
fields = [
	'row_id',
	'establishment_id',
	'inspection_id',
	'establishment_name',
	'establishmenttype',
	'establishment_address',
	'establishment_status',
	'minimum_inspections_peryear',
	'infraction_details',
	'inspection_date',
	'severity',
	'action',
	'court_outcome',
	'amount_fined'
]
 
doc = parse(file('dinesafe.xml'))
 
writer = DictWriter(file('dinesafe.csv', 'w'), fields)
writer.writeheader()
 
row_data = doc.getElementsByTagName('ROWDATA')[0]
 
for row in row_data.getElementsByTagName('ROW'):
	row_values = dict()
	for field in fields:
		text_element = row.getElementsByTagName(field.upper())[0].firstChild
		value = ''
		if text_element:
			value = text_element.wholeText.strip()
			row_values[field] = value
	writer.writerow(row_values)

This data was not geocoded so I had to do that before it could be mapped. I wanted to use a free geocoding service but they have limits on the number of records that could be geooded per day. I used MapQuest’s geocoding API using a Python library that would automate the geocoding in daily batched job so that I could maximize free daily geocoding.

The Dine Safe data file addresses needed some cleaning up so that the geocoding service could read them. For example street name variations needed to be conformed to something that MapQuest would accept. This was a manual batch find and replace effort. If I was automating this I would use a lookup table of street name variations and replace them with accepted spellling/format.

#Uses MapQuest's Nominatim mirror.

import anydbm
import urllib2
import csv
import json
import time

# set up the cache. 'c' means create if necessary
cache = anydbm.open('geocode_cache', 'c')

# Use MapQuest's open Nominatim server.
# https://developer.mapquest.com/web/products/open/nominatim
API_ENDPOINT = 'http://open.mapquestapi.com/nominatim/v1/search.php?format=json&q={}'

def geocode_location(location):
    '''
    Fetch the geodata associated with the given address and return
    the entire response object (loaded from json).
    '''
    if location not in cache:
        # construct the URL
        url = API_ENDPOINT.format(urllib2.quote(location))
        
        # load the content at the URL
        print 'fetching %s' % url
        result_json = urllib2.urlopen(url).read()
        
        # put the content into the cache
        cache[location] = result_json
        
        # pause to throttle requests
        time.sleep(1)
    
    # the response is (now) in the cache, so load it
    return json.loads(cache[location])

if __name__ == '__main__':
    # open the input and output file objects
    with open('dinesafe.csv') as infile, open('dinesafe_geocoded.csv', 'w') as outfile:
      
        # wrap the files with CSV reader objects.
        # the output file has two additional fields, lat and lon
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, reader.fieldnames + ['lat', 'lon'])
        
        # write the header row to the output file
        writer.writeheader()
        
        # iterate over the file by record 
        for record in reader:
            # construct the full address
            address = record['establishment_address']
            address += ', Toronto, ON, Canada'
            
            # log the address to the console
            print address
            
            try:
                # Nominatim returns a list of matches; take the first
                geo_data = geocode_location(address)[0]
                record['lat'] = geo_data['lat']
                record['lon'] = geo_data['lon']
            except IndexError:
                # if there are no matches, don't raise an error
                pass
            writer.writerow(record)

After the Dine Safe data was geocoded so that it had two new columns, one for latitude and another for longitude, all that was left to do was bring the data into Tableau and create the Tableau map visualization which is shown below.

Single CakePHP application & MySQL database to serve multiple domains

I have a bunch of websites that have the same structure and content that vary only by the category for each site. So content for all sites can reside in a single database schema and filtered to specific categories for each site.

In order to do this I created a custom CakePHP application with one MySQL database to serve multiple public facing websites each with their own unique domain.

In order to make this work you need a hosting account that will allow you to map website serving directories to domain names. Different hosting providers may have different methods of doing this.

The secret to make this work is to have the application parse the domain name that is calling the application and then use that to filter the database records for that domain category.

In CakePHP do this by creating a function in the AppController that will check to see what domain the visitor is browsing to and then to retrieve that domain’s specific values from the special table to filter the CakePHP application. You will need to to call this function in the AppController beforeFilter function. I have included an example function named ‘fetchSettings’ below.

It will also be very helpful to make a table that contains meta data for each domain such as keywords, analytics tracking code, etc. I didn’t do include a different CSS file for each domain but you could include CSS file variable to refer to separate CSS file for each domain.

 

Check Github repository for more information:

https://github.com/sitrucp/single_cakephp_mysql_multiple_domain_app

<?php

App::uses('Controller', 'Controller');

class AppController extends Controller {
    
    public function beforeFilter() {
        $this-> getDomainSettings(); 	
    }
    
    //function which reads settings from domain table  
    function getDomainSettings(){
        //get current domain from server name as variable
        $domain = preg_replace('/^www\./', '', $_SERVER['SERVER_NAME']);
        //Load domain model data
        $this->loadModel('Domain');
        //retrieve only current domain table record
        $domain_settings = $this->Domain->find('all', array('conditions' => 
        array('Domain.domain' => $domain)));  
        foreach($domain_settings as $value){
            //create Configure::Write variables from domain record to use elsewhere in application
            Configure::write('domain', $value['Domain']['domain']);
            Configure::write('domain_id', $value['Domain']['id']);
            Configure::write('ga_code', $value['Domain']['ga_code']);
            Configure::write('meta_title', $value['Domain']['meta_title']);
            Configure::write('meta_keywords', $value['Domain']['meta_keywords']);
            Configure::write('meta_description', $value['Domain']['meta_description']); 
            
            //etc retrieve as many domain specific values as required from database
        }
    }
}

?>

How a new salmon stream was carved from the land

Back in the days when I was working in environmental science, the consulting firm I was working for was doing storm water drainage planning in Surrey, BC, Canada.

The City of Surrey was planning for a new sub-division. The new sub-division would have lots of new hardened surfaces eg roads, roofs, and lawns instead of natural scrub and tree cover, which would require more frequent and higher volume water drainage to be channeled somewhere.

Our company proposed diverting this new water flow under a road and into a city right of way. The right of way was a huge section of land that was covered in natural grassland, dense shrubs, trees and a marshy area. The proposed new watercourse would flow about 1000 metres through the property and drain into the Serpentine River. Normally stormwater drainage engineers would simply channel drainage into ditches and/or covered storm water drainage that would eventually reaching the Serpentine River anyways.

The new channel was designed to mimic a natural watercourse that may have been created naturally in this field given the field’s topography, soil type and a water flow rate and volume that our engineers estimated from new sub-division.

Creating a new watercourse is a mix of science, art, physics and engineering. Water has a predictable pattern of flowing over a surface given the surface topography, soil type and  water flow rate and volume. Try putting some drops of water on one end of cookie baking sheet and then slowly lift that side up. The water drops will flow down the cookie sheet and form an s-shaped path. The higher you lift it the more straight the pattern. The same thing happens in nature.

There are other features of natural watercourses that need to be considered. The watercourse will have specific curve measurements in its s-shape. The size of curve, the depth of the outside bank, the gradient of the stream bed from the inside bank to the outside bank varies with flow.

One of the goals for this new watercourse was to provide natural habitat for fish and wildlife. So we wanted to include features that would mimic what could happen naturally such as providing stream cover in form of undercut banks, overhanging trees, grass, shrubs.

Also because we didn’t want nature to dramatically alter the stream’s course which is a natural occurrence we wanted to ensure that that shape would resist erosion. We provided bank protection in form of large tree root wads and other placements of bank hardening.

Our company had a specialist who had created scores of similar water courses. The advice I got was to use a large engineering plan drawing of the field and start sketching out the water course.

So I literally took pencil to drawings and started drawing what I thought the stream would look like. Then we worked backwards, accounting for soil type, expected water flow, curve of bend, depth of flow.

I left the company before the stream was constructed but the eventual shape was pretty close to what we had drawn. As construction progressed modifications were required to accommodate soil composition variances and some of the bank hardening was not used due to concerns about its long term integrity.

However within 2 years there was fish and wildlife habitat where there was none before. That is pretty cool!

Here is Google Maps aerial photo of the new stream. Below is a screenshot with dotted line showing where stream is located.

new steam

Tableau vizualization of Vancouver Car2Go vehicle locations

I happened to have an dataset of Car2Go vehicle locations in Vancouver and used Tableau to plot them by latitude and longitude.

 

 

There are a lot of observations over the time data was collected. There are limited physical number of parking spots so cars are parked in locations that are very close to other car locations. This makes for a very dense plot.

But if you zoom into the view the detail will resolve. For example below is a screenshot of a zoomed into downtown Vancouver. You can very clearly see that cars are parked along city streets. Fixed parking spots are identified by darker color.

 

car2go vehicle locations

 

How to vizualize a work schedule using a Qlikview Gantt chart

This is a Qlikview solution I posted on Qlikview’s forum in response to a user request for displaying employee work schedules in a Gantt chart style presentation.

https://community.qlik.com/thread/51554

The solution included setting up the data for the Qlikview chart in a table format as shown below.

Original:
Load WorkDate,
    Employee
    From [C:\BI\schedule demo - data.xls]
    (biff, embedded labels);       
 
Set IntervalStart = 700;
Set IntervalEnd = 800;
 
//Loop through each employee record to get hour intervals
For i = 1 to 11
Original2:
    Load $(i) as rowNum,
        WorkDate,
        Employee,
        DayStart,
        DayEnd,
        $(IntervalStart) &amp; '-' &amp; $(IntervalEnd) as Interval,
          //Check if hour interval is within emp start&lt;&gt; end and give it "1" (so it can be conditionally formatted later")
          //I didn't do the extra check to see if the interval is break interval but you could just add add'l If to IntervalValue
        If($(IntervalStart) &gt;= DayStart and $(IntervalEnd) &lt;= DayEnd, 1, 0) as IntervalValue
        From [C:\BI\schedule demo - data.xls]
        (biff, embedded labels);
    Let IntervalStart  = IntervalStart + 100;
    Let IntervalEnd = IntervalEnd + 100;                
Next i

The data for this table could come from any data source eg Excel, csv file, database query results, etc. In this example it is an Excel .xls file.

qlik solution data table

Use the Qlikview pivot table using WorkDate, Employee &  Interval as dimensions and IntervalValue as Expression.  Then conditionally format for fill with Expression so if 1 then it green. Note that the “0” values are still showing in my example but you could do conditional text format on 0 to turn them white eg to hide the 0’s.

 The pivot table can be changed to show daily employee coverage as well.

 

Many additional things could be added to this solution. For example if you are tracking employee schedules by another category such as event or department, these can be added to the data and then used to slice and dice by these other dimensions in the pivot tables.

How a UK tent rental company used Google analytics and Tableau to improve sales

UK wedding marquee rental website Google analytics data was analysed using Tableau Public.  The client’s target geographic area is UK Northwest centered around Manchester.

It was interesting to see significant number of site visitors from Pakistan, India and Phillippines.

A bit of customer research reveals that these site visitors are friends and family helping with UK wedding.

The client does have first hand information that his clients have family members offshore who might have helped do wedding planning. But getting hard data from website analytics and seeing this clearly highlighted in the Tableau analysis prompted a call to action for the client to do sales and marketing efforts to advertise to Pakistani, Indian and Phillippines offshore but also to specifically target advertising to these demographic groups inside the UK.

The result was increased bookings and a lift in word of mouth advertising within these demographic groups.

Well done analytical entrepreneur. Yes, analytics can be that easy and effective. Just use the tools, do the work, and listen to the analysis!

Time based SLA KPI created using MS Access VBA

A client had a time based activity that had a start and end datetime . The client wanted to be able to calculate a service level agreement (SLA) metric based on the duration of time to complete the activity.

This SLA included requirement that time only be included in the metric if it was during work hours defined as 8am to 4pm excluding weekends and holidays. Holidays were identified by listing them manually in a database table (below).

The SLA metric was calculated for each activity as follows:

SLA Actual = End Datetime – Start Datetime in hours decimal value, excluding weekends, holidays and non-work hours.

This SLA metric really changed people’s view of the policy. Without specific data people’s perceptions of the time elapsed from start to end were wildly variant.

For example if an activity started at 3.55 PM on a Friday with a long weekend and was finished the following Tuesday at 10 AM it would only have 5 minutes time on Friday and 2 hours on Tuesday for total of 2 hours and 5 minutes or 2.083 hours. However before the SLA metric some perceived it elapsed SLA time as much as 5 days processing time. Using the SLA to analyse activity by people and department involved, activity category, etc was invaluable in identifying business process, training and communication issues and improvements.

The system used was an MS Access database.

The metric was to be calculated in and presented to user in an Access form so agents could see their real-time performance based on the SLA metric. The form had an auto refresh so the metric (and others) would be current.

The SLA metric also had to be calculated in MS Access queries to generate SLA metrics values for real-time and historical reporting and analysis purposes.

The function was included in required queries and an current SLA metric was created when the query was run, normally when a report was refreshed.

To enable both of these requirements a VBA function was created that could be called by both an Access form and Access query. The function VBA code is shown below.

The function output is a decimal value representing the number of hours. You can get an hour and minute value from the hour decimal by using modulo formula.

A challenge for using function in queries was that reports often included long time spans so potentially 100,000+ records could be included in query and as this function would run on each record it would be very slow. To address this the query function would only be applied to records that were more recent, with business rule assuming that older records would not have changing SLA eg they would be complete.

This VBA code and process can be reused for other similar purposes as long as you provide the function the required start and end datetime values (as parameters) and you use a holiday table which is a simple two column table (see below). You could modify the hard coded work day start and end hours to suit your business needs.

Option Compare Database
Option Explicit
'*** This is the code that is used to help calculate SLA actual hours.  The code calculates
'*** the net hours (total hours excluding weekends, holidays &amp; hours outside of workday (8am-4pm)
'*** for the following:
'*** - net decision time (hours from IC Date/Time aka Application Received Date to Decision Date/Time)
'*** - net event time (for each event, hours from Start Date/Time to End Date/Time)
'*** The SLA actual hours = Net decision time - Sum(net event time) is calculated later
'*** in a set of queries.

Function NetHours(dteStart As Date, dteEnd As Date) As Single

    Dim GrossDays As Integer
    Dim WorkingDays As Integer
    Dim OneDayHours As Single

    Dim fixedEnd As Date
    Dim fixedStart As Date
    Dim dteStartEnd As Date
    Dim dteEndStart As Date

    Dim StartDayHours As Single
    Dim EndDayHours As Single

    Dim intCount As Integer
    Dim dteCurrDate As Date

    dteStartEnd = DateValue(dteStart) + TimeValue("4:00pm")
    dteEndStart = DateValue(dteEnd) + TimeValue("8:00am")

    'If dteStart or dteEnd before 8am, use 8 am, if after 4 pm use 4 pm otherwise use dteStart or dteEnd
        If dteStart &gt; DateValue(dteStart) + TimeValue("8:00am") And dteStart &lt;        DateValue(dteStart) + TimeValue("4:00pm") Then
            fixedStart = dteStart
        Else
            Select Case dteStart
            Case Is &lt;= DateValue(dteStart) + TimeValue("8:00am")
                fixedStart = DateValue(dteStart) + TimeValue("8:00am")
            Case Is &gt;= DateValue(dteStart) + TimeValue("4:00pm")
                fixedStart = DateValue(dteStart) + TimeValue("4:00pm")
            End Select
    End If

    If dteEnd &gt; DateValue(dteEnd) + TimeValue("8:00am") And dteEnd &lt; DateValue(dteEnd) + TimeValue("4:00pm") Then
        fixedEnd = dteEnd
    Else
        Select Case dteEnd
        Case Is &lt;= DateValue(dteEnd) + TimeValue("8:00am")
            fixedEnd = DateValue(dteEnd) + TimeValue("8:00am")
        Case Is &gt;= DateValue(dteEnd) + TimeValue("4:00pm")
            fixedEnd = DateValue(dteEnd) + TimeValue("4:00pm")
        End Select
    End If

    'Calculate hours on 1st day but check if they are weekends or holidays first
    'and if hours aren't within workday then assign hours
    StartDayHours = 0
    If Weekday(dteStart, vbMonday) &lt;= 5 And IsNull(DLookup("[Holiday_Date]", "A_Holidays", _
    "[Holiday_Date] = " &amp; Format(dteStart, "\#mm\/dd\/yyyy\#;;;\N\u\l\l"))) Then
        StartDayHours = (DateDiff("n", fixedStart, dteStartEnd)) / 60
    Else
        StartDayHours = 0
    End If

    'Calculate hours on last day but check if they are weekends or holidays first
    'and if hours aren't within workday then assign hours
    EndDayHours = 0
    If Weekday(dteEnd, vbMonday) &lt;= 5 And IsNull(DLookup("[Holiday_Date]", "A_Holidays", _
    "[Holiday_Date] = " &amp; Format(dteEnd, "\#mm\/dd\/yyyy\#;;;\N\u\l\l"))) Then
        EndDayHours = (DateDiff("n", dteEndStart, fixedEnd)) / 60
    Else
        EndDayHours = 0
    End If

    'Count workdays excluding weekends, holidays and first and last date
    WorkingDays = 0
    If DateDiff("d", dteStart, dteEnd) &gt; 1 Then
        intCount = 0
        dteCurrDate = DateValue(dteStart)
        Do While dteCurrDate &lt; DateValue(dteEnd)
            If Weekday(dteCurrDate, vbMonday) &lt;= 5 And dteCurrDate &lt;&gt; DateValue(dteStart) And dteCurrDate &lt;&gt; DateValue(dteEnd) _
            And IsNull(DLookup("[Holiday_Date]", "A_Holidays", _
            "[Holiday_Date] = " &amp; Format(dteCurrDate, "\#mm\/dd\/yyyy\#;;;\N\u\l\l"))) Then
                intCount = intCount + 1
            Else
                intCount = intCount
            End If
            dteCurrDate = dteCurrDate + 1
        Loop
        WorkingDays = intCount
    Else
        WorkingDays = 0
    End If

    'Calculate gross # days between start and end
    GrossDays = 0
    GrossDays = DateDiff("d", dteStart, dteEnd)

    'Calculate hours between start and end times on same day
    OneDayHours = 0
    'If fixedStart &lt; fixedEnd Then
    If Weekday(dteStart, vbMonday) &lt;= 5 And IsNull(DLookup("[Holiday_Date]", "A_Holidays", _
    "[Holiday_Date] = " &amp; Format(dteStart, "\#mm\/dd\/yyyy\#;;;\N\u\l\l"))) Then
        OneDayHours = (DateDiff("n", fixedStart, fixedEnd)) / 60
    Else
        OneDayHours = 0
    End If
    'Else
    '    OneDayHours = 0
    'End If

    'Depending on # GrossDays then give final NetHours result
    NetHours = 0
    Select Case GrossDays
        Case 0
        'Start and end time on same day
            NetHours = OneDayHours
        Case 1
        'Start and end time on consecutive days
            NetHours = NetHours + StartDayHours
            NetHours = NetHours + EndDayHours
        Case Is &gt; 1
        'Start and end time on non consecutive days
            NetHours = WorkingDays * 8
            NetHours = NetHours + EndDayHours
            NetHours = NetHours + StartDayHours
    End Select

End Function

 

The holidays table had these two columns, one for holiday date and another for the holiday name (which was not used by function just to identify the date).

Holiday_Date Holiday
01/01/2008 New Year’s Day
01/02/2008 Family Day (Markham)
21/03/2008 Good Friday
19/05/2008 Victoria Day
01/07/2008 Canada Day
04/08/2008 Civic Day
01/09/2008 Labour Day
13/10/2008 Thanksgiving Day
11/11/2008 Remembrance Day
25/12/2008 Christmas
26/12/2008 Boxing Day
01/01/2009 New Year’s Day
10/04/2009 Good Friday
18/05/2009 Victoria Day
01/07/2009 Canada Day
03/08/2009 Civic Holiday
07/09/2009 Labour Day
12/10/2009 Thanksgiving Day
11/11/2009 Remembrance Day
25/12/2009 Christmas
28/12/2009 Boxing Day lieu
01/01/2010 New Year’s Day
02/04/2010 Good Friday
24/05/2010 Victoria Day
01/07/2010 Canada Day
02/08/2010 Civic Holiday
06/09/2010 Labour Day
11/10/2010 Thanksgiving Day
11/11/2010 Remembrance Day
27/12/2010 Christmas Day lieu
28/12/2010 Boxing Day lieu
03/01/2011 New Year’s Day lieu
22/04/2011 Good Friday
23/05/2011 Victoria Day
01/07/2011 Canada Day
01/08/2011 Civic Holiday
05/09/2011 Labour Day
10/10/2011 Thanksgiving Day
11/11/2011 Remembrance Day
26/12/2011 Christmas Day lieu
27/12/2011 Boxing Day lieu

Visualization of Toronto’s 311 contact centre open data

This is based on City of Toronto’s Open Data website 311 call performance data for the 4 years 2010, 2011, 2012 and 2013.

The data is provided by the City in a Google Spread Sheet and has been downloaded and saved here in Excel 2013 format 311_daily_contact_centre_data.

I used Excel pivot tables and charts for analysis and visualization.

The call volume has remained relatively consistent year over year from 2010 to 2013. The chart below shows all daily calls from 2010 to 2013. On average there are about 5,000 calls per day. There are seasonal variations evident with peaks in summer and a few big spikes notably one at end of December 2013 where it spiked to over 20,000 calls per day. Not sure what that was.

311-chart-by day

Weekend calls volume is dramatically lower compared to weekday calls. That indicates that 311 calls are business related.

311-chart-wkday vs wkend

The 311 line keeps a fairly consistent average call talk time of 230 seconds or about 4 minutes  as represented by the black line in chart below.

The average speed of answer metric varies quite a bit (red line in chart below). We can see that answer time follows call volume.

When a call center gets more calls it takes longer to answer the call. This indicates the call center has same number of agents available regardless of season, day of week or special event. It is probably too expensive and or challenging to hire staff to work part time or on call for these surge events.  There are also some anomalously high call answer times that might be due to under staffing or equipment failures.

The grouping of calls, talk times and answer times by month in the chart below may obfuscate daily variations. Also daily outliers may skew the monthly totals but viewed month over month does a good job of showing trends.

311-chart-by day bar

The call center metrics around call abandonment are important measures. We will see near end of the post how this is used to create a ‘service level’ metric.  The chart below shows a breakdown of how many calls are actually answered and connected to the a 311 Call Center agent.

  • Only about 75% of the total calls that come into the 311 call centre are actually answered (blue part of bar).
  • The remaining 25% of the calls are abandoned by the caller.
    • On average, 15% are abandoned within 30 seconds (green part of bar). These people are busy, they won’t wait, and leave the hold queue relatively quickly.
    • On average, 10% wait longer than 30 seconds before hanging up (red part of bar). These people invested their time in waiting on hold.

311-call relative volume

As mentioned above, the call center creates a ‘service level’ metric that is a percentage value based on  abandoned calls and call answer time. When there are no abandoned calls the service level approaches 100%.  However, the Toronto 311 call center has not hit 100% very often as shown by the orange line in chart below which is average service level percent by month. In fact it has never been 100% for any month over the 4 years.

service level

Another way to look at service level is to look at how many days met service level categories.  The chart below groups daily service level counts into the following categories: 0-25%, 25-50%, 50-75% and 75-100%.  Roughly only 70% of the total days in 2012 and 2013 were in the 75-100% service level.

service level category year

Yet another way to look at the service level is to look at the service level daily attainment frequency over the 4 years. The chart below shows service level daily frequency. For example service level of 100% was attained only for 4 days from 2010 to 2013. This view provides more granularity on call center service levels.

service level daily freq

This analysis suggests that the call center needs to increase its service level metric. Since service level is largely dependent on call abandonment, something needs to be done decrease number of calls sent to hold which results in 25% of callers waiting but then hanging up.  What can a call center do about this?

  • Increase number of agents to answer calls is the most obvious action. The challenge is they should be part time or shift workers to handle spikes in call volume.
  • Reduce call talk time so agents can answer more calls and reduce calls sent to hold and abandoned calls. This can be done by improving training or more making information more accessible to the agents so they can answer questions faster and more efficiently.
  • Instead of putting callers on hold offer them the option to alternative information channels perhaps to an interactive automated voice or interactive website.
  • Do analysis of reasons people are calling 311 and endeavour to provide that information proactively in other channels eg on City websites, billboards, or other public communication channels.

The 311 call center data is available at the City of Toronto Open Data website with their catalogue name “311 Contact Centre Performance Metrics”.

Introducing Cardivvy – a website showing Car2Go real time car locations, parking and service areas

Car2Go provides developer access to their real-time vehicle location and parking availability, and service area boundaries for all of their city locations around the world.

I was using the Car2Go API to power a website created for my own use after struggling with the official Car2Go map on their site when using my mobile phone. I wanted a way to find cars by street locations.

The site includes link to get vehicle location, parking lot location and service area boundaries on a Google Map. You can click into each city, view available cars alphabetically by street location or parking lot location, and then click through to car current location on a Google Map. Each city’s Car2Go service area can be viewed on a Google Map.

I am waiting for ZipCar to get their API up and running which should be available Fall 2015, then I will integrate that into the cardivvy site too so I can see where both cars are available.

This is example of Car2Go service area boundary for Vancouver area.

car2gomap

LinkedIn ‘People You May Know’ web scraping and analysis

A while back LinkedIn sneakily vacuumed up all of my contacts from my phone via the Android Cardmunch app. Turns out Cardmunch is owned by LinkedIn. There must be fine print somewhere that indicates they do this but I sure didn’t see it.

Many of my contacts that were imported were out of date and in most cases not someone I wanted to be LinkedIn with. Some had actually passed away.

It took a mini-campaign of customer service interaction to get LinkedIn to delete the imported contacts.

Anyways I discovered this had happened when suddenly large numbers of my contacts started showing up in my LinkedIn’s “People You May Know” page.

The PYMK page is a LinkedIn feature that identifies 1,000 people LinkedIn thinks you may know. LinkedIn identifies people you may know by matching contacts they vacuum up from everyone’s address books. They probably also do matching of people on company name, profession, city, LinkedIn groups, etc too.

When LinkedIn finally agreed to delete the contacts I monitored the PYMK page to make sure they were doing it and that it was permanent.

My monitoring was a mix of manual work and automation. I regularly manually downloaded and saved the PYMK webpage and extracted the names on the page to see if my stolen contacts were still on the page. The contacts were removed very quickly (thank you LinkedIn : )) but I continued downloading the PYMK page because I was curious to see how the names would change over time. I ended up downloading the page 29 times over a 3 month period.

I used Python and BeautifulSoup to process the downloaded PYMK html pages and scrape them for the data I wanted.

I used Excel add-in Power Query to shape the data and Excel Pivot tables and charts for the visualizations.

After I downloaded a new page I would run the code on a folder containing the PYMK web page files to produce a data file for analysis. I just wanted to see that my 2,000 imported contacts were deleted. Finally after a few weeks they were gone.

Here are some of the results.

Over the 3 month period about 6,300 unique people were on my PYMK page at least once.

The data I have is incomplete because it wasn’t a daily sample of the PYMK page. I downloaded the pages only 29 times over a 3 month period of time.

Even so it does give some relative information about people’s appearances on my PYMK page.

People’s appearances were not a contiguous series of days. There were gaps in appearances. LinkedIn appears to swap people in and out over a duration of days.

A Gantt chart style visualization made the pattern of people’s appearances obvious. The screenshot below shows an overview a huge 6,300 row Gantt chart that has one unique person per row. The columns are the 29 downloads. The records are sorted descending by 1st date of appearance eg so most recent are on top.

The pink cells indicate that a person appeared on that downloaded PYMK page. Blue cells cells are where they did not appear on the page. At a quick glance you can easily see the regular patterns of appearances on the PYMK web page.

Over time eg going from bottom of chart to top (it is sorted by date people are added to page eg most recently added on top) you can see people are always being introduced to the PYMK page. Some people added in past continue to appear on the page and some appear a few times never to reappear on page.

The varying patterns indicate that the methodology LinkedIn uses to select people to show on my PYMK page changed over time.

The two big columns of pink at the very bottom there are the 2,000 people that were imported from my contact book. Most appeared for only the first few downloads and then LinkedIn deleted them so they don’t ever appear again.

Gantt chart style presentation of all 6,300 people (one unique person per row). Records are sorted descending by 1st date of appearance eg so most recent are on top.
Click on image to open in new tab to view full size.

pymk_by_day_added

Using the 1st and last day people appeared on the 29 downloads I could calculate a ‘duration’ for each person. These are durations are shown in a frequency distribution chart below.

duration freq

Many people appeared over most of the 3 months. About 50% remained on the page for more than 2 months. This would have changed had the sampling continued eg more people may have remained on the page as long too.

However, the relative distribution does indicate that there is a split between people that stay on page and those that are just passing through.

The bulk of the people who appear only once on the PYMK page once are the 2,000 contacts that were imported and then deleted. Some of these appeared in the first PYMK page download and never again, some appeared in one or two subsequent downloads until they were all finally deleted.

What is interesting is that LinkedIn was not able to match these contacts to me using their other methods. That hints that the basic mechanism behind the PYMK matching is simple contact name matching between LinkedIn accounts.

Of the 29 downloaded PYMK pages most people were on the page less than 9 times as shown in the frequency distribution below. Daily sampling would likely see these counts increase though I expect the relative distribution would be similar.

days freq

I created a ‘presence’ metric that is the relative # days appearances people have over their entire # days from their 1st appearance to last appearance. This is shown in the frequency distribution below. The big spike at 100% are the imported contacts which showed up in only one download (and then were deleted from LinkedIn forever).

presence freq

Daily sampling would have seen the distribution shift to the right towards 100%. I guess that the peak of the distribution would shift up to around 30% eg most people appear about 30% of the time from when they 1st appear to their last appearance.

The Python code used to scrape the downnloaded PYMK web pages was the following:

  1. Go to LinkedIn PYMK page, scroll down until all 1000 contacts are shown. The page has ‘infinite scrolling’ that pages through entire 1000 contacts incrementally. I couldn’t find easy way to get Python to do this for me automatically which would have been nice.
  2.  Save resulting web page as html file on my computer.
  3.  Run Python script with BeautifulSoup (below) to get list of names.
  4.  Compare list of names day to day to see the changes made.

Python code to get names from a single html file:

from bs4 import BeautifulSoup
soup = BeautifulSoup (open("pymk.htm"))
    for card in soup.findAll('a', 'title'):
        for title in card.find_all('span', 'a11y-headline'):
            print str(card['title'].encode("utf-8")) + ' | ' + str(title.string.encode("utf-8"))

 

Python code to get names from a folder of html files:

import os
from bs4 import BeautifulSoup
    path = str('C:/PYMK Page Downloads/')
    for file in os.listdir(path):
        file_path = os.path.join(path, file)
        soup = BeautifulSoup(open(file_path))
        for card in soup.findAll('div', 'entityblock profile'):
            name2 = card.find("a", {"class":"title"})['title']
            name = name2.encode("utf-8")
            title2 = card.find("span", {"class":"a11y-headline"})
            title = title2.text.encode("utf-8")
            connections2 = card.find("span", {"class":"glyph-text"})
            if connections2 is None:
                connections = 0
            else:
                connections = connections2.text.encode("utf-8")
            print str(file) + ' | ' + str(name) + ' | ' + str(title) + ' | ' + str(connections)

Analysis of results

Visualizing human arterial blood flow with a D3.js Sankey chart

This Sankey chart represents human arterial blood flow from the heart down into the smallest named arteries.

A Sankey chart visualizes directional connections between nodes. In this case the nodes are artery names and the connections are flow of blood through the arteries.

The full sized chart is huge, so first take a look at the small size view below to get oriented.

arterial

The data used to create the visualization can also be viewed in tabular format here.

Two screenshots below, from the visualization’s top level branches, highlight where blood flows from heart to body and lungs. The grey connections represent blood flow. The colored ‘nodes’ represent the arteries.

You are able to click on the nodes which will highlight the connections and blood flow. This is a great way to see how blood flows from artery to artery.

The data was sourced from this UAMS web page 
http://anatomy.uams.edu/anatomyhtml/arteries_alpha.html.

Quite a bit of cleaning was required to get the data ready to be used by D3 to create the Sankey charts. Some of the data transformations required are described below:

    • Edited artery names to remove commas eg, alveolar, anterior superior to anterior superior alveolar. This was done using Excel and VBA.
    • Creating new table rows to link arteries with their sources eg, Left Ventricle is source for Aorta.
    • Creating new table rows for both right and left arterial pairs. This was done by duplicating entire table, to create two sets (one for left, one for right) then manually editing to remove where there are no pairs eg Aorta.
    • Occasionally modified artery name by adding its source artery name to differentiate name from others. Eg Sigmoid Ascending Br. instead of Ascending Br. This was done because sankey.js requires unique node names and some arteries had same name in data source eg Ascending Br.
    • Created new table rows to include arterial branches. Used Excel and VBA to separate branches separated by comma into new rows.

Some interesting observations about the artery data used to create chart and possibilities to extend the data and chart:

    • There are missing arteries or errors in naming and linkages so please don’t study for your anatomy exam using this : )
    • The Sankey chart termination nodes are major arteries. However,  the UAMS arterial tables indicates that ‘unnamed branches’ are actually the terminating nodes. If I included these, I would have had to create unique names for each ‘unnamed’ arterial reference but I was too lazy to do that 🙂
    • The termination nodes are assumed to be capillaries eg other than the ‘unnamed’ arteries, the blood flows into capillaries, then goes back to heart via the Venous system.
    • I was surprised to learn that in reality there can be a large number of ‘arterial anastomoses’ where hierarchically unrelated arteries join together. I didn’t include any of these.
    • The ‘other half’ of the body’s blood flow eg the Venous is not included in this chart. One day I might revisit and include the Venous system.
    • It would be cool to add accurate blood flow ‘value’ to each artery-source connection which would be an additional column in the data table for blood flow volume for each artery.
      • The Sankey chart would represent these volumes by varying artery-source connection thickness.
      • These would be proportionate values. For example the blood flow from Left Ventricle to Aorta (Ascending Aorta) would be 100% or whatever relative numerical value we could assign (I used 10). From that point, say 30% goes to brachiocephalic trunk and 70% continues into Aortic Arch and down into the body, and so on with each further branching into other arteries.
    • It would be even more interesting to setup a dynamic system that changes these blood flow volumes based on heart rate.
      • Could also represent blood flow changes, for example, if a femoral artery were cut, what would be the effect of flow on rest of system?
      • The Sankey chart values could change to reflect changes. We could change artery-source link color eg red (increase), green (normal) or yellow(decrease) to indicate increase/decrease in blood flow resulting from these system perturbations.

The Sankey chart was created using a D3 JavaScript library Sankey plugin.

This arterial Sankey chart includes some additional D3 JavaScript that modified the chart formatting and includes feature to highlight node links and to allow node drag and drop re-positioning. You can view the JavaScript for my Sankey chart simply by viewing the Sankey chart’s page source.

The code used to create this is available in this Github repository.