Canadian Government First Nations long term water advisory data

The Government of Canada is working with Canadian First Nations communities to end long-term drinking water advisories which have been in effect for more than 12 months since November 2015. This includes projects to build or renew drinking water infrastructure in these communities.

The First Nation drinking water advisories were a big part of the 2021 Canadian federal election campaigns. The current government had made big promises to fix all of the advisories in 2015. The opposition criticized them for not having achieved that goal to-date. However there was poor communication about the actual number and status of the advisory projects. So I wanted to find some data to learn the actual status for myself.

TLDR the results can be seen on a Github.io hosted page.  A screenshot is provided below. The code to retrieve and transform the data to create that web page are in my Github https://github.com/sitrucp/first_nations_water repository.

Finding the data I needed only took a  few minutes of Google search from an Indigenous Services Canada  website that mapped the water advisories by counts of advisories by FN community, project status, advisory dates, and project type.

While the web page also included a link to download the map data, after a bit of web page scraping and inspection, I learned that the downloadable map data was created by web page Javascript from another larger data source referenced in the map page code https://www.sac-isc.gc.ca//DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt.

This larger data source is a text file containing JSON format records of all Canadian First Nations communities and is many thousands of records. As a side note this larger file appears to be an official Canadian government dataset used in other government websites. The map data is limited to only about 160 First Nations communities with drinking water issues.

So rather than download the 160 record map page data, I retrieved the larger JSON format text file and used similar logic as that for the map web page code to get the 160 records. This was done in two steps.

Step 1: retrieve JSON format text file using a Python script water_map_data.py to retrieve and save the data file locally. I may yet automate this using a scheduled task so the map gets regularly updated data as advisory status changes over time.

Step 2: process the saved data file and present it in an HTML web page as Plotly.js charts and an HTML tabular format using Javascript in this file  first_nations_water.js

Finally, I also separately created a Excel file with Pivot Table and Chart that you can download and use to do your own analysis. Download this Excel file from the Github repository. The file contains an Excel Power Query link to the larger text JSON file described above. You can simply refresh the query to get the latest data from the Indigenous Services Canada  website.

Heat maps of Canadian activity changes due to COVID-19 using Google Community Mobility Reports

During the 2020 COVID-19 pandemic in Canada I wanted to get better understanding of the geographical distribution of COVID-19 related activity changes across Canada.

Google has helpfully provided freely available global “Community Mobility Reporting” which shows Google location history change compared to baseline by country, and country sub-regions. These provide changes in activity by location categories: Workplace, Retail & Recreation, Transit Stations, Grocery & Pharmacy and Parks locations, and Residential locations. For Canada it is available by province. As of April 19, data contained daily values from Feb 15 to Apr 11.

The Community Mobility Reporting data is available as a single csv file for all countries at Google Community Mobility Report site. In addition, Google provides feature to filter for specific country or country sub regions eg state or provinces, etc and download resulting PDF format.

As the COVID-19 lockdowns occurred across Canada you would expect that people were less likely to be in public spaces and more likely to be at home. The Community Mobility Reporting location history allows us to get some insight into whether or not this happened, and if it did, to what degree and how this changed over time.

I used the Community Mobility Report data to create a D3.js heat map visualization which is described in more detail below and in this Github repository.

I also created an Excel version of this heat map visualization using Pivot Table & Chart plus conditional formatting. This Excel file, described in more detail below, is available in the Github repository.

More detail and screenshots of visualizations is provided below:

Heatmaps
Heatmaps are grids where columns represent date and rows province/territory. Each heatmap is a grid representing a single mobility report category. The grid cell colors represent value of percent change which could be positive or negative. Changes can be observed as lockdowns occurred where locations in public areas decreased relative to baseline. Inversely, residential location increased relative to baseline as people sheltered in place at their homes.

1) Heatmap created using Excel / Power Query: For this heatmap visualization the global csv data was transformed using Excel Power Query. The Excel file has two Pivot Table and Chart combos. The Excel files and Power Query M Code are in the repository. Excel files are available in Github repository.

2) Heatmap created using D3.js: For this heatmap visualization the global csv data was transformed using Excel Power Query. The heatmap visualization was created using slightly modified code from ONSvisual.

Bar charts
These were created using Excel to visualize percent change by Province/Territory and location category using Excel / Power Query. These allow comparison between provinces by date and category. This Excel / Power Query file can be used for analytical purposes to slice and dice global data by date, country, sub region 1 & 2 and category. Excel files are available in Github repository.

Excel Power Query tutorial using Canadian potato production statistics

This data comes from Statistics Canada.

1

Statistics Canada download pages often provide the opportunity to modify the data structure and content before it is downloaded.

For example clicking on the [Add/Remove data] link provides options select different groupings of data by provinces. I choose to group data by provinces.

2

I also limited the content to only yield and production. There are other metrics available though they are often differing presentations of the data eg relative amounts by category etc.

3

I selected option to download the data as a csv formatted text file.

This csv file data had to be transformed before it could be used for analysis.

Statistics Canada often includes non-data text such as report titles and descriptions in the first rows of text files and footer notes in rows below the actual data.

toptext

Before we can work with this data these title and footer rows have to be removed.

For this job I used Excel Power Query which can do the ETL. It can extract the data from the csv file, transform the data to a format that is amenable to analysis and load the data into a worksheet (or model) so it can be used for analysis.

Power Query has capabilities and features that match those of many more advanced ETL / integration software such as SSIS, Talend, Informatica, etc.

I have been encouraging and providing training to Excel users to move all of their ETL work in to Power Query. It brings the advanced capabilities of these tools to the relatively non technical desktop business user.

People too often copy and paste data from other sources into their Excel file.

With Power Query the original data file is a data source, and is linked into the Power Query, thus remains untouched and can simply be refreshed to get new or modified data at any time, providing that the columns and content remain unchanged. This makes it easy to get updated time sequence data as it comes available.

So back to our job. The first step was to link to the csv data file downloaded from Statistics Canada into Power Query by using New Query – From File – From CSV

01

Select the data file and click Import.

02

Here is where we begin to bump into data file format challenges. Power Query can’t automatically determine file format because there are non-data text strings at top (report titles, descriptions) and bottom (footer notes).

I would like to encourage Statistics Canada and any other data provider to avoid doing this. A data file should be in a predictable tabular (CSV, tab separated) or other format (JSON, XML) without any other content that needs to be processed out.

03

But no problem.  Power Query is equipped to deal with these situations.

Power Query assumed this is csv file but the top rows of titles and descriptions do not have commas so it doesn’t  split top rows into columns and as a result the split rows below are hidden.

One solution which I used was to go into the Power Query Source step settings to change the delimiter from Comma to Equals Sign. Since there are no equals signs the data is not split into columns and remains in one column that we can remove top and bottom non-data rows.

04

 

The Power Query Source step now looks like this.

= Csv.Document(File.Contents("C:\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252])

05

Now we can remove the non-data text from the data file. Select the Remove Rows – Remove Top Rows and enter the number of rows to remove (in this case it is top 7 rows are non-data rows).

= Table.Skip(#"Changed Type",7)

06

Do the same for the bottom rows but select Remove Rows – Remove Bottom Rows.

= Table.RemoveLastN(#"Removed Top Rows",12)

Now we can load the data to the worksheet and we will see one column of comma separated values which we just need to split to get nice columns of data.

07

Tell Power Query to split columns by comma.

= Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"})

08

Also tell Power Query to use first row as headers. If the non-data text rows were not in the file Power Query would have done this automatically.

= Table.PromoteHeaders(#"Changed Type1")

092

A key data transformation to get this data into a useful tabular format is to unpivot the years columns to rows. This is one of Power Query’s most useful transformations. Many data sources, especially in enterprise business, often have time series or other categories presented as columns but to do dimensional analysis these have to be unpivoted so they are in rows instead of columns.

Note that the Statistics Canada data formatting feature often includes moving time series to rows which would be similar transformation to this Power Query Unpivot.

Here is the pivoted data.

pivoted

And here is the unpivoted data. This now gives us unique columns of data that can be used by any analytical software.

unpivoted

 

I also did some other things to clean up the data:

  • Renaming column headers to be shorter, more readable.
  • Changing the dimension values to shorter strings.
  • Changing hundredweight values to pounds multiplying production values by 1000 to get full values.
  • Filtered out Newfoundland because it was missing 2013, 2014, 2015 values.
  • Removed the 2016 column which had no values.

Once loaded back to worksheet we now have nice clean dimensional data ready for us to begin working with in any analytical software. This can be used as data source for Tableau, Qlikview, Cognos, Birst, etc.

5

Lets take a quick look at the data using Excel Pivot Tables and Charts. Select this worksheet table and select Insert – Pivot Table to create a new pivot table like the one below.

7

 

Select Insert Pivot Chart to create new pivot chart from this pivot table.

In this case I selected a stacked bar chart that show total Canadian potato production by province.

This does a good job of demonstrating relative contribution to total by province.

9
A stacked line chart does a better job of illustrating changes in production by provinces by year.

91

 

Here is a Tableau 9 report from the Power Query results.

 

Here is the complete Power Query M-code copied from the Power Query that gets and transforms the Statistics Canada raw csv data file into the transformed data set ready to be used in Excel Pivot Tables/Charts and Tableau or any other reporting tool.

let
    Source = Csv.Document(File.Contents("C:\Users\bb\Documents\Dropbox\Data\statscan data\potatoes\cansim-0010014-eng-1573729857763215673.csv"),[Delimiter="=", Encoding=1252]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Removed Top Rows" = Table.Skip(#"Changed Type",7),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Removed Top Rows",12),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Bottom Rows","Column1",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type1"),
    #"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"2016"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Columns", each ([Geography] <> "Newfoundland and Labrador (2)")),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Filtered Rows", {"Geography", "Area, production and farm value of potatoes"}, "Attribute", "Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Columns",{{"Attribute", "Year"}, {"Area, production and farm value of potatoes", "Dimension"}, {"Geography", "Province"}, {"Value", "Hundredweight Value"}}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Renamed Columns",{{"Hundredweight Value", type number}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type2","Average yield, potatoes (hundredweight per harvested acres) (5,6)","pounds per acre",Replacer.ReplaceText,{"Dimension"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Production, potatoes (hundredweight x 1,000)","production",Replacer.ReplaceText,{"Dimension"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Pounds", each if [Dimension] = "pounds per acre" then [Hundredweight Value] * 100 else if [Dimension] = "production" then [Hundredweight Value] * 100 * 1000 else null)
in
    #"Added Custom"

Visualization of Toronto’s 311 contact centre open data

This is based on City of Toronto’s Open Data website 311 call performance data for the 4 years 2010, 2011, 2012 and 2013.

The data is provided by the City in a Google Spread Sheet and has been downloaded and saved here in Excel 2013 format 311_daily_contact_centre_data.

I used Excel pivot tables and charts for analysis and visualization.

The call volume has remained relatively consistent year over year from 2010 to 2013. The chart below shows all daily calls from 2010 to 2013. On average there are about 5,000 calls per day. There are seasonal variations evident with peaks in summer and a few big spikes notably one at end of December 2013 where it spiked to over 20,000 calls per day. Not sure what that was.

311-chart-by day

Weekend calls volume is dramatically lower compared to weekday calls. That indicates that 311 calls are business related.

311-chart-wkday vs wkend

The 311 line keeps a fairly consistent average call talk time of 230 seconds or about 4 minutes  as represented by the black line in chart below.

The average speed of answer metric varies quite a bit (red line in chart below). We can see that answer time follows call volume.

When a call center gets more calls it takes longer to answer the call. This indicates the call center has same number of agents available regardless of season, day of week or special event. It is probably too expensive and or challenging to hire staff to work part time or on call for these surge events.  There are also some anomalously high call answer times that might be due to under staffing or equipment failures.

The grouping of calls, talk times and answer times by month in the chart below may obfuscate daily variations. Also daily outliers may skew the monthly totals but viewed month over month does a good job of showing trends.

311-chart-by day bar

The call center metrics around call abandonment are important measures. We will see near end of the post how this is used to create a ‘service level’ metric.  The chart below shows a breakdown of how many calls are actually answered and connected to the a 311 Call Center agent.

  • Only about 75% of the total calls that come into the 311 call centre are actually answered (blue part of bar).
  • The remaining 25% of the calls are abandoned by the caller.
    • On average, 15% are abandoned within 30 seconds (green part of bar). These people are busy, they won’t wait, and leave the hold queue relatively quickly.
    • On average, 10% wait longer than 30 seconds before hanging up (red part of bar). These people invested their time in waiting on hold.

311-call relative volume

As mentioned above, the call center creates a ‘service level’ metric that is a percentage value based on  abandoned calls and call answer time. When there are no abandoned calls the service level approaches 100%.  However, the Toronto 311 call center has not hit 100% very often as shown by the orange line in chart below which is average service level percent by month. In fact it has never been 100% for any month over the 4 years.

service level

Another way to look at service level is to look at how many days met service level categories.  The chart below groups daily service level counts into the following categories: 0-25%, 25-50%, 50-75% and 75-100%.  Roughly only 70% of the total days in 2012 and 2013 were in the 75-100% service level.

service level category year

Yet another way to look at the service level is to look at the service level daily attainment frequency over the 4 years. The chart below shows service level daily frequency. For example service level of 100% was attained only for 4 days from 2010 to 2013. This view provides more granularity on call center service levels.

service level daily freq

This analysis suggests that the call center needs to increase its service level metric. Since service level is largely dependent on call abandonment, something needs to be done decrease number of calls sent to hold which results in 25% of callers waiting but then hanging up.  What can a call center do about this?

  • Increase number of agents to answer calls is the most obvious action. The challenge is they should be part time or shift workers to handle spikes in call volume.
  • Reduce call talk time so agents can answer more calls and reduce calls sent to hold and abandoned calls. This can be done by improving training or more making information more accessible to the agents so they can answer questions faster and more efficiently.
  • Instead of putting callers on hold offer them the option to alternative information channels perhaps to an interactive automated voice or interactive website.
  • Do analysis of reasons people are calling 311 and endeavour to provide that information proactively in other channels eg on City websites, billboards, or other public communication channels.

The 311 call center data is available at the City of Toronto Open Data website with their catalogue name “311 Contact Centre Performance Metrics”.