Excel Power Query tutorial using Canadian potato production statistics

This data comes from Statistics Canada.

1

Statistics Canada download pages often provide the opportunity to modify the data structure and content before it is downloaded.

For example clicking on the [Add/Remove data] link provides options select different groupings of data by provinces. I choose to group data by provinces.

2

I also limited the content to only yield and production. There are other metrics available though they are often differing presentations of the data eg relative amounts by category etc.

3

I selected option to download the data as a csv formatted text file.

This csv file data had to be transformed before it could be used for analysis.

Statistics Canada often includes non-data text such as report titles and descriptions in the first rows of text files and footer notes in rows below the actual data.

toptext

Before we can work with this data these title and footer rows have to be removed.

For this job I used Excel Power Query which can do the ETL. It can extract the data from the csv file, transform the data to a format that is amenable to analysis and load the data into a worksheet (or model) so it can be used for analysis.

Power Query has capabilities and features that match those of many more advanced ETL / integration software such as SSIS, Talend, Informatica, etc.

I have been encouraging and providing training to Excel users to move all of their ETL work in to Power Query. It brings the advanced capabilities of these tools to the relatively non technical desktop business user.

People too often copy and paste data from other sources into their Excel file.

With Power Query the original data file is a data source, and is linked into the Power Query, thus remains untouched and can simply be refreshed to get new or modified data at any time, providing that the columns and content remain unchanged. This makes it easy to get updated time sequence data as it comes available.

So back to our job. The first step was to link to the csv data file downloaded from Statistics Canada into Power Query by using New Query – From File – From CSV

01

Select the data file and click Import.

02

Here is where we begin to bump into data file format challenges. Power Query can’t automatically determine file format because there are non-data text strings at top (report titles, descriptions) and bottom (footer notes).

I would like to encourage Statistics Canada and any other data provider to avoid doing this. A data file should be in a predictable tabular (CSV, tab separated) or other format (JSON, XML) without any other content that needs to be processed out.

03

But no problem.  Power Query is equipped to deal with these situations.

Power Query assumed this is csv file but the top rows of titles and descriptions do not have commas so it doesn’t  split top rows into columns and as a result the split rows below are hidden.

One solution which I used was to go into the Power Query Source step settings to change the delimiter from Comma to Equals Sign. Since there are no equals signs the data is not split into columns and remains in one column that we can remove top and bottom non-data rows.

04

 

The Power Query Source step now looks like this.

= Csv.Document(File.Contents("C:\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252])

05

Now we can remove the non-data text from the data file. Select the Remove Rows – Remove Top Rows and enter the number of rows to remove (in this case it is top 7 rows are non-data rows).


= Table.Skip(#"Changed Type",7)

06

Do the same for the bottom rows but select Remove Rows – Remove Bottom Rows.

= Table.RemoveLastN(#"Removed Top Rows",12)

Now we can load the data to the worksheet and we will see one column of comma separated values which we just need to split to get nice columns of data.

07

Tell Power Query to split columns by comma.


= Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"})

08

Also tell Power Query to use first row as headers. If the non-data text rows were not in the file Power Query would have done this automatically.

= Table.PromoteHeaders(#"Changed Type1")

092

A key data transformation to get this data into a useful tabular format is to unpivot the years columns to rows. This is one of Power Query’s most useful transformations. Many data sources, especially in enterprise business, often have time series or other categories presented as columns but to do dimensional analysis these have to be unpivoted so they are in rows instead of columns.

Note that the Statistics Canada data formatting feature often includes moving time series to rows which would be similar transformation to this Power Query Unpivot.

Here is the pivoted data.

pivoted

And here is the unpivoted data. This now gives us unique columns of data that can be used by any analytical software.

unpivoted

 

I also did some other things to clean up the data:

  • Renaming column headers to be shorter, more readable.
  • Changing the dimension values to shorter strings.
  • Changing hundredweight values to pounds multiplying production values by 1000 to get full values.
  • Filtered out Newfoundland because it was missing 2013, 2014, 2015 values.
  • Removed the 2016 column which had no values.

Once loaded back to worksheet we now have nice clean dimensional data ready for us to begin working with in any analytical software. This can be used as data source for Tableau, Qlikview, Cognos, Birst, etc.

5

Lets take a quick look at the data using Excel Pivot Tables and Charts. Select this worksheet table and select Insert – Pivot Table to create a new pivot table like the one below.

7

 

Select Insert Pivot Chart to create new pivot chart from this pivot table.

In this case I selected a stacked bar chart that show total Canadian potato production by province.

This does a good job of demonstrating relative contribution to total by province.

9
A stacked line chart does a better job of illustrating changes in production by provinces by year.

91

 

Here is a Tableau 9 report from the Power Query results.

 

Here is the complete Power Query M-code copied from the Power Query that gets and transforms the Statistics Canada raw csv data file into the transformed data set ready to be used in Excel Pivot Tables/Charts and Tableau or any other reporting tool.

let
    Source = Csv.Document(File.Contents("C:\Users\bb\Documents\Dropbox\Data\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Removed Top Rows" = Table.Skip(#"Changed Type",7),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Removed Top Rows",12),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type1"),
    #"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"2016"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Columns", each ([Geography] <> "Newfoundland and Labrador (2)")),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Filtered Rows", {"Geography", "Area, production and farm value of potatoes"}, "Attribute", "Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Columns",{{"Attribute", "Year"}, {"Area, production and farm value of potatoes", "Dimension"}, {"Geography", "Province"}, {"Value", "Hundredweight Value"}}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Renamed Columns",{{"Hundredweight Value", type number}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type2","Average yield, potatoes (hundredweight per harvested acres) (5,6)","pounds per acre",Replacer.ReplaceText,{"Dimension"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Production, potatoes (hundredweight x 1,000)","production",Replacer.ReplaceText,{"Dimension"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Pounds", each if [Dimension] = "pounds per acre" then [Hundredweight Value] * 100 else if [Dimension] = "production" then [Hundredweight Value] * 100 * 1000 else null)
in
    #"Added Custom"

Use Excel Power Query to scrape & combine Wikipedia tables

Power Query is quick and easy way to scrape HTML tables on web pages. Here is step by step on getting multiple tables from Wikipedia article and appending them into one Power Query Excel table.

The Wikipedia article List of national and international statistical services has multiple tables with lists of countries’ statistics agencies and their website urls. These are separated by world region.

https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services

Step 1: Click the Power Query From Web menu icon and paste the url above into the URL address field and click OK.

stat_scrape_1

Step 2: Power Query finds and shows you a list of tables that are in this web page. Select the tables you want and click Load.

By default you can only select one of the tables.

stat_scrape_2

However you can check the Select multiple items checkbox and you can then select more than one of the tables.

stat_scrape_3

You can also select any of the tables to get a preview of the data they contain. After you have selected all the tables you want, click Load.

stat_scrape_4

After you click Load Power Query will create a new query for each of the selected tables.

stat_scrape_5

Step 3: Append all of the tables into one new combined dataset. You do this with the Table.Combine feature which is confusingly also called Append in the Power Query menu icons.  The menu icon feature combines only two tables by default, however you can simply manually edit the resulting code in the address bar to include all of the tables in the Table.Combine formula.

stat_scrape_6

After you update the Table.Combine formula you will have a new query with all of the tables combined into one dataset.

stat_scrape_7

You can refresh the queries to get any changes to the Wikipedia tables.

Here is M code for the Power Query that gets the Wikipedia table

 

let
Source = Web.Page(Web.Contents("https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services")),
Data0 = Source{0}[Data]
in
Data0

Here is M code for the Power Query that combines the tables into one dataset:

 
let
Source = Table.Combine({#"Africa[edit]",#"Americas[edit]",#"Asia[edit]",#"Europe[edit]",#"Oceania[edit]"})
in
Source

How to schedule Power BI dataset refresh

Do you want to create a Power BI Report that gets a daily scheduled refresh of data from a Sharepoint csv file?

The first step is to create your Power BI report in Power BI Desktop using the Sharepoint csv file as data source.

In Power BI Desktop use Get Data – File – Sharepoint Folder to connect to your Sharepoint Folder.

The resulting dataset query (Power Query) will look something like mine below. You will replace “mydomain” with your Sharepoint account name or domain.

You will also replace “datafile.csv” with your csv file name. The Power BI connection is to a Sharepoint folder which might have more than one file like I did. If you have only one file in the folder the filter will be redundant but can’t hurt.

let
Source = SharePoint.Files("https://mydomain.sharepoint.com", [ApiVersion = 15]),
#"Filtered Rows" = Table.SelectRows(Source, each ([Name] = "datafile.csv")),
#"Combined Binaries" = Binary.Combine(#"Filtered Rows"[Content]),
#"Imported CSV" = Csv.Document(#"Combined Binaries",[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV")
in
#"Promoted Headers"

After you publish your report to your Power BI Online account you can select your newly uploaded dataset’s “Schedule Refresh” property where you can set up the refresh schedule.

schedule_refresh

 

First go to “Gateway connection”.

I selected “Connect Directly” which requires that you also enter Sharepoint credentials in the “Edit credentials” link which pops up a web page that prompts you to login into your Sharepoint account. This gives Power BI Service permission to access your Sharepoint account to refresh file.

If you have an enterprise gateway setup you could try “Enterprise Gateway” and enter the required credentials for that.

 

gateway_connection

 

If you entered credentials correctly you should now be able to select the “Keep your data up to date” switch to “Yes”.

Then you can select which four 6-hour window you want refresh to run. Power BI Service free accounts can do daily refreshes. Pro accounts can have hourly updates.

As an aside be warned that if you click the “Try Pro for free” button you might get a blocking message that you are using Pro feature. This happened to me and was clearly a quirky error. I had to delete my report and dataset and re-upload them and redo the scheduling to get rid of the error.

schedule_refresh_option

 

You can try refreshing the dataset manually (On demand) or wait for the next scheduled refresh (Scheduled) to happen to see if the data does refresh. You can see refreshes are successful and when they ran by clicking the “refresh history ” link.

refresh_history

 

 

Power Query MySQL database connection

Excel Power Query can make a connection to MySQL database but requires that you have a MySQL Connector/Net 6.6.5 for Microsoft Windows. Instructions for that on Microsoft site.

Once you have the connector you can get MySQL db table data into your Excel file using Power Query in two ways. The method you select will be dependent on how you want to work with the MySQL data you retrieve.

The choice is made after you have selected Power Query – From Database – From MySQL Database.

After you select From MySQL Database, you will see the MySQL Database connection popup.

Now you can make a choice. Either method requires you to specify the Server and Database to make direct connection to a single table (or view) but you can optionally enter a SQL Statement to effectively connect to multiple tables and views in one connection. Bonus is that the SQL query is pushed back into the server so the Power Query client doesn’t have to do the work.

Method # 1. Native SQL query – when connecting you have option to enter SQL, and if you enter SQL query there you will get:

let
    Source = MySQL.Database(“127.0.0.1”, “database_name”, [Query=”select * from database_table_name”])
in
    Source

Method # 2. Power Query Navigation – instead of entering SQL, just leave that field blank, then continue on, and Power Query will present you with list of MySQL server database names (Schemas) and tables names. Simply select the “Table” link in the database and table that you want and that will add Navigation step and retrieve that table’s data.

let
    Source = MySQL.Database(“127.0.0.1”, “database_name”),
   database_table_name = Source{[Schema=”database_name”,Item=”database_table_name”]}[Data]
in
    database_table_name

Either way you get the same table results in the Power Query.

Of course, if you want to join in other tables from your MySQL database(s), then method #1 will be more direct. Method #1 also assumes that you can use SQL to get what you want from the MySQL database tables.

However, you could also use method #2 to retrieve all desired tables (even from different databases on that server), and then use Merge or Append to get the desired results.

Method #2 allows you to retrieve and work with your MySQL database table data without using SQL and rely on Excel Power Query instead. That opens the door for relatively non-technical data workers to use the data which is a pretty cool thing!

You can use these methods with MySQL databases on your local computer or on a remote computer. You just have to make sure to enter the correct server url, database name, user and password.

On occasion I have had challenges before I could get a remote MySQL database connection to work. I had to clear the Power Query cache, update Power Query, turn off the Privacy option to make the connection work.

BC Hydro’s amazing #BCStorm social media turnaround

BC Hydro made an amazing social media turnaround to communicate with customers in crisis! Go BC Hydro!

On August 29, 2015 high winds caused tree falls that took out BC Hydro power lines throughout BC’s Lower Mainland, Vancouver Island, Sunshine Coast leaving up to 500,000 customers without electricity. Many including me were without power for 48 hours.

BC Hydro’s web site was their primary channel for communicating with customers about extent of damage and expected time for repairs, but the site also went down during this time. H

They had used Twitter in the past to communicate with customers but they weren’t using it much on first day of crisis.

However on the second day of the crisis with 100,000+ customers still without power BC Hydro dramatically increased Twitter communication by responding directly to customer tweets about extent of damage and expected time for repairs.

To visualize this dramatic increase in BC Hydro Twitter usage I downloaded all @BCHydro tweets for August 29 and 30 from Twitter’s API using Python Tweepy and used Microsoft Power BI to create a visualization of BC Hydro tweet counts which is shown below.

Some notes on chart:

    • x axis shows date and hour of day
    • y-axis shows count of tweets
    • ‘response’ tweets are light green part of bar
    • ‘status’  tweets are dark green part of bar

You can see that on August 30 at 11 AM, about 28 hours after the storm hit, BC Hydro suddenly starts tweeting responses to customers’ questions. They continued for the next few days until their website was back up in addition to their regular ‘status’ tweets.

The chart clearly shows a very amazing social media turnaround! BC Hydro dramatically increased their use of Twitter to get customers answers and information. Go BC Hydro!

Note: the Twitter data was last updated Aug 31, 2015 at 16.45 PM PST.

bchtweets

 

Historically BC Hydro did not use Twitter this way.  The chart below shows BC Hydro’s tweeting before the storm. They were tweeting once or twice per day with occasional spikes to 7 tweets per day.

 

bchtweets_before

 

The ‘response‘ category includes tweets by BC Hydro responding to a customer tweet. Example ‘response’ tweet:

2015-08-30 14:26:01, @Adamhillz Crews will be on site as soon as possible. Stay back from any downed lines and call 911 if it is on fire.

The ‘status‘ category includes tweets by BC Hydro about ongoing status of repair work etc. Example ‘status’ tweet:

2015-08-30 14:28:32, Crews have been brought in from across province, including Prince George, Terrace, Kamloops, Smithers, Vernon & Vancouver Island #bcstorm

FYI if anyone is interested, here is the text file with tweet text – bchtweets

Here is the Python code used to get the BC Hydro Tweets using the Twitter API.

import tweepy
#import MySQLdb
from datetime import datetime
import random
import sys, os
import json
from dateutil import tz
from tweet_auth import * #this is another .py file with the twitter api credentials

#get twitter auth
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

def main():
with open('bchtweets.csv', 'wb') as file:
for status in tweepy.Cursor(api.user_timeline, id='bchydro').items(1800):
file.write(str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)) + ', ' + status.text.encode('utf8') + '\n')

if __name__ == '__main__':
main()

Note that there is a separate file with my Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

#twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'