Using Excel Power Query to analyse data from Our World In Data

Our World In Data (OWID) has been doing a hero’s job of collating the world’s covid vaccine distribution and administration data. They generate additional metrics and allow site visitors to analyse them in awesome interactive visualizations.

However, I wanted to do some analysis that went beyond what was possible on the site. No problem, because the OWID team also provides the data as csv files on a Github page.

These csv files can be easily downloaded from Github and you are then able to use any data transformation or visualization tool to do any analysis you need.

However if you need to to do frequent analysis because the vaccine data is updated often twice per day you will want to be able to dynamically refresh the data without having to download updated csv files from Github.

There are many tools and methods that can meet the requirement for  automated dynamic data refresh, but in this blog post I want to quickly highlight how Excel users can use the Excel Power Query Get Data->From Other Sources->From Web feature to link to the csv files in Github. After setting up the link, it only takes a simple click to refresh to get latest csv data from OWID’s Github repository.

First you need to get the url to the csv file on Github.

The easiest way to get correct url is to open the csv file in Github in Raw format by clicking on the file name and then clicking on the Raw button on top right of page. This will open the file in native “raw” format. Then you can simply copy the url from browser location bar which is what you want.

However, you can also manually create a url to any Github repository file as follows below.

Let’s use this url to the vaccinations.csv file in OWID’s Github repository below.

https://raw.githubusercontent.com/owid/COVID-19-data/master/public/data/vaccinations/vaccinations.csv

The url can be separated into the following parts:

    • Base url: https://raw.githubusercontent.com
    • Repository owner: owid
    • Repository name: COVID-19-data
    • Repository branch: master
    • Repository base element: public
    • Repository folders: data/vaccinations
    • Repository file name: vaccinations.csv

You should be able to manually create your own url using these components for any file in any Github repository.

Once you have your url, you can test if it is correct by simply copy pasting it into a browser location bar and verifying the url is good and retrieves the data you are looking for.

Once you have verified the url is correct, simply go to your Excel file and use the Excel Power Query Get Data feature and select From Other Sources and then From Web options and paste your url into the dialogue space and continue to load the file.

Once the query has been saved and data is loaded you can simply click and refresh it anytime to  retrieve the most recent csv file from the OWID Github repository!

Heat maps of Canadian activity changes due to COVID-19 using Google Community Mobility Reports

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 related activity changes across Canada.

Google has helpfully provided freely available global “Community Mobility Reporting” which shows Google location history change compared to baseline by country, and country sub-regions. These provide changes in activity by location categories: Workplace, Retail & Recreation, Transit Stations, Grocery & Pharmacy and Parks locations, and Residential locations. For Canada it is available by province. As of April 19, data contained daily values from Feb 15 to Apr 11.

The Community Mobility Reporting data is available as a single csv file for all countries at Google Community Mobility Report site. In addition, Google provides feature to filter for specific country or country sub regions eg state or provinces, etc and download resulting PDF format.

As the COVID-19 lockdowns occurred across Canada you would expect that people were less likely to be in public spaces and more likely to be at home. The Community Mobility Reporting location history allows us to get some insight into whether or not this happened, and if it did, to what degree and how this changed over time.

I used the Community Mobility Report data to create a D3.js heat map visualization which is described in more detail below and in this Github repository.

I also created an Excel version of this heat map visualization using Pivot Table & Chart plus conditional formatting. This Excel file, described in more detail below, is available in the Github repository.

More detail and screenshots of visualizations is provided below:

Heatmaps
Heatmaps are grids where columns represent date and rows province/territory. Each heatmap is a grid representing a single mobility report category. The grid cell colors represent value of percent change which could be positive or negative. Changes can be observed as lockdowns occurred where locations in public areas decreased relative to baseline. Inversely, residential location increased relative to baseline as people sheltered in place at their homes.

1) Heatmap created using Excel / Power Query: For this heatmap visualization the global csv data was transformed using Excel Power Query. The Excel file has two Pivot Table and Chart combos. The Excel files and Power Query M Code are in the repository. Excel files are available in Github repository.

2) Heatmap created using D3.js: For this heatmap visualization the global csv data was transformed using Excel Power Query. The heatmap visualization was created using slightly modified code from ONSvisual.

Bar charts
These were created using Excel to visualize percent change by Province/Territory and location category using Excel / Power Query. These allow comparison between provinces by date and category. This Excel / Power Query file can be used for analytical purposes to slice and dice global data by date, country, sub region 1 & 2 and category. Excel files are available in Github repository.

Introduction to Excel Power Query aka Power BI Query

What is Power Query?

Power Query is an Microsoft tool for Excel that is used to do the following:

  • Connect to a data source
  • Extract data from source
  • Transform the extracted data
  • Load the transformed data into Excel

This functionality is called ETL (Extract, Transform and Load).

Why use Power Query?

Excel already has another tab of features called Get External Data, why not use that?

  • Excel’s other methods for connecting to and retrieving data may still be useful, but they are not able to transform the data.
  • Power Query connections are non-techie, user friendly, and provide much more versatile and robust functionality.

Where is Power Query in Excel?

You can find Power Query in the Data tab Get & Transform icon group in Excel 2016

Note that in previous versions of Excel, Power Query is an add-in that has its own tab. However, in Excel 2016 Power Query is built into Excel.

Get & Transform has 4 icons

  • Show Queries – this hides/shows the Query pane docked on right side of Excel.
  • New Query – choose data source & modify Power Query settings and options.
  • From Table – choose data source as currently selected Excel table.
  • Recent Sources – choose from recently selected data sources.

Step 1 – Connect to a data source and retrieve data

Select the New Query icon to select which type of data you want to connect to

  • FromFile (csv, Excel)
  • FromDatabase (SQL Server db, Access db)
  • FromOnline Services (Facebook)
  • From Other Sources (webpage, SharePoint List, ODBC connection)

Screenshot shows example of connecting to an Excel Workbook.

  • Select From File – From Workbook
  • Popup dialogue allows you to select an Excel file.

Step 2 – Transform the data – common usage

  • “Calculated columns” replace any worksheet formula requirement
  • Power Query has many more formulas than plain old Excel, including VBA type functionality
  • Power Query has been called “vlookup on steroids”
  • “Pivot” and “unpivot” data
  • Group and aggregate data into sums, min, max, averages
  • Simple to combine many files contents in a folder into one table.
  • Filter, sort, replace with complex logic.
  • Merge two or more separate tables of data together on matching values. If you are familiar with SQL, this is a JOIN query.
  • Append two or more separate tables of data into one combined table of data. If you are familiar with SQL, this is a UNION query.

Step 3 – Load the data into Excel

  • Power Query queries are loaded to a worksheet by default.
  • However, you can choose to save a query as a connection only by selecting Load To – Only Create Connection. This query can be used in other Queries eg in Merge or Append, and by other Excel features eg it can be used as datasource for a Pivot Table.

Step 4 – Click and refresh to retrieve new data

  • When data source(s) are updated, just click Refresh to retrieve and transform new data.
  • No copy & paste of data from other files.
  • No worry about formulas being deleted or ranges being moved.
  • All transformations are safely contained inside the Query.

Power Query is in Microsoft Power BI

  • Power Query has been integrated into Power BI as the “Get Data” feature.
  • Has the same functionality and features as the Power Query in Excel.
  • Same programming language ‘M’.

Excel Power Query tutorial using Canadian potato production statistics

This data comes from Statistics Canada.

1

Statistics Canada download pages often provide the opportunity to modify the data structure and content before it is downloaded.

For example clicking on the [Add/Remove data] link provides options select different groupings of data by provinces. I choose to group data by provinces.

2

I also limited the content to only yield and production. There are other metrics available though they are often differing presentations of the data eg relative amounts by category etc.

3

I selected option to download the data as a csv formatted text file.

This csv file data had to be transformed before it could be used for analysis.

Statistics Canada often includes non-data text such as report titles and descriptions in the first rows of text files and footer notes in rows below the actual data.

toptext

Before we can work with this data these title and footer rows have to be removed.

For this job I used Excel Power Query which can do the ETL. It can extract the data from the csv file, transform the data to a format that is amenable to analysis and load the data into a worksheet (or model) so it can be used for analysis.

Power Query has capabilities and features that match those of many more advanced ETL / integration software such as SSIS, Talend, Informatica, etc.

I have been encouraging and providing training to Excel users to move all of their ETL work in to Power Query. It brings the advanced capabilities of these tools to the relatively non technical desktop business user.

People too often copy and paste data from other sources into their Excel file.

With Power Query the original data file is a data source, and is linked into the Power Query, thus remains untouched and can simply be refreshed to get new or modified data at any time, providing that the columns and content remain unchanged. This makes it easy to get updated time sequence data as it comes available.

So back to our job. The first step was to link to the csv data file downloaded from Statistics Canada into Power Query by using New Query – From File – From CSV

01

Select the data file and click Import.

02

Here is where we begin to bump into data file format challenges. Power Query can’t automatically determine file format because there are non-data text strings at top (report titles, descriptions) and bottom (footer notes).

I would like to encourage Statistics Canada and any other data provider to avoid doing this. A data file should be in a predictable tabular (CSV, tab separated) or other format (JSON, XML) without any other content that needs to be processed out.

03

But no problem.  Power Query is equipped to deal with these situations.

Power Query assumed this is csv file but the top rows of titles and descriptions do not have commas so it doesn’t  split top rows into columns and as a result the split rows below are hidden.

One solution which I used was to go into the Power Query Source step settings to change the delimiter from Comma to Equals Sign. Since there are no equals signs the data is not split into columns and remains in one column that we can remove top and bottom non-data rows.

04

 

The Power Query Source step now looks like this.

= Csv.Document(File.Contents("C:\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252])

05

Now we can remove the non-data text from the data file. Select the Remove Rows – Remove Top Rows and enter the number of rows to remove (in this case it is top 7 rows are non-data rows).


= Table.Skip(#"Changed Type",7)

06

Do the same for the bottom rows but select Remove Rows – Remove Bottom Rows.

= Table.RemoveLastN(#"Removed Top Rows",12)

Now we can load the data to the worksheet and we will see one column of comma separated values which we just need to split to get nice columns of data.

07

Tell Power Query to split columns by comma.


= Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"})

08

Also tell Power Query to use first row as headers. If the non-data text rows were not in the file Power Query would have done this automatically.

= Table.PromoteHeaders(#"Changed Type1")

092

A key data transformation to get this data into a useful tabular format is to unpivot the years columns to rows. This is one of Power Query’s most useful transformations. Many data sources, especially in enterprise business, often have time series or other categories presented as columns but to do dimensional analysis these have to be unpivoted so they are in rows instead of columns.

Note that the Statistics Canada data formatting feature often includes moving time series to rows which would be similar transformation to this Power Query Unpivot.

Here is the pivoted data.

pivoted

And here is the unpivoted data. This now gives us unique columns of data that can be used by any analytical software.

unpivoted

 

I also did some other things to clean up the data:

  • Renaming column headers to be shorter, more readable.
  • Changing the dimension values to shorter strings.
  • Changing hundredweight values to pounds multiplying production values by 1000 to get full values.
  • Filtered out Newfoundland because it was missing 2013, 2014, 2015 values.
  • Removed the 2016 column which had no values.

Once loaded back to worksheet we now have nice clean dimensional data ready for us to begin working with in any analytical software. This can be used as data source for Tableau, Qlikview, Cognos, Birst, etc.

5

Lets take a quick look at the data using Excel Pivot Tables and Charts. Select this worksheet table and select Insert – Pivot Table to create a new pivot table like the one below.

7

 

Select Insert Pivot Chart to create new pivot chart from this pivot table.

In this case I selected a stacked bar chart that show total Canadian potato production by province.

This does a good job of demonstrating relative contribution to total by province.

9
A stacked line chart does a better job of illustrating changes in production by provinces by year.

91

 

Here is a Tableau 9 report from the Power Query results.

 

Here is the complete Power Query M-code copied from the Power Query that gets and transforms the Statistics Canada raw csv data file into the transformed data set ready to be used in Excel Pivot Tables/Charts and Tableau or any other reporting tool.

let
    Source = Csv.Document(File.Contents("C:\Users\bb\Documents\Dropbox\Data\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Removed Top Rows" = Table.Skip(#"Changed Type",7),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Removed Top Rows",12),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type1"),
    #"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"2016"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Columns", each ([Geography] <> "Newfoundland and Labrador (2)")),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Filtered Rows", {"Geography", "Area, production and farm value of potatoes"}, "Attribute", "Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Columns",{{"Attribute", "Year"}, {"Area, production and farm value of potatoes", "Dimension"}, {"Geography", "Province"}, {"Value", "Hundredweight Value"}}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Renamed Columns",{{"Hundredweight Value", type number}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type2","Average yield, potatoes (hundredweight per harvested acres) (5,6)","pounds per acre",Replacer.ReplaceText,{"Dimension"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Production, potatoes (hundredweight x 1,000)","production",Replacer.ReplaceText,{"Dimension"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Pounds", each if [Dimension] = "pounds per acre" then [Hundredweight Value] * 100 else if [Dimension] = "production" then [Hundredweight Value] * 100 * 1000 else null)
in
    #"Added Custom"

Use Excel Power Query to scrape & combine Wikipedia tables

Power Query is quick and easy way to scrape HTML tables on web pages. Here is step by step on getting multiple tables from Wikipedia article and appending them into one Power Query Excel table.

The Wikipedia article List of national and international statistical services has multiple tables with lists of countries’ statistics agencies and their website urls. These are separated by world region.

https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services

Step 1: Click the Power Query From Web menu icon and paste the url above into the URL address field and click OK.

stat_scrape_1

Step 2: Power Query finds and shows you a list of tables that are in this web page. Select the tables you want and click Load.

By default you can only select one of the tables.

stat_scrape_2

However you can check the Select multiple items checkbox and you can then select more than one of the tables.

stat_scrape_3

You can also select any of the tables to get a preview of the data they contain. After you have selected all the tables you want, click Load.

stat_scrape_4

After you click Load Power Query will create a new query for each of the selected tables.

stat_scrape_5

Step 3: Append all of the tables into one new combined dataset. You do this with the Table.Combine feature which is confusingly also called Append in the Power Query menu icons.  The menu icon feature combines only two tables by default, however you can simply manually edit the resulting code in the address bar to include all of the tables in the Table.Combine formula.

stat_scrape_6

After you update the Table.Combine formula you will have a new query with all of the tables combined into one dataset.

stat_scrape_7

You can refresh the queries to get any changes to the Wikipedia tables.

Here is M code for the Power Query that gets the Wikipedia table

 

let
Source = Web.Page(Web.Contents("https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services")),
Data0 = Source{0}[Data]
in
Data0

Here is M code for the Power Query that combines the tables into one dataset:

 
let
Source = Table.Combine({#"Africa[edit]",#"Americas[edit]",#"Asia[edit]",#"Europe[edit]",#"Oceania[edit]"})
in
Source