OneDrive API features

Microsoft has three file storage options:

    1. OneDrive Personal
    2. OneDrive Business
    3. Sharepoint

These have recently been unified into one new OneDrive API https://dev.onedrive.com and oAuth is preferred method of authentication.

However there are some key differences how the API:

    • OneDrive Personal authenticates against oAuth account created at Microsoft Application Registration Portal using a Microsoft account (Live, Microsoft.com). Authentication url is: https://login.live.com/oauth20_authorize.srf
    • OneDrive Business and Sharepoint authenticate against oAuth account created in Azure Active Directory and must be done with Office 365 account.  Authentication url is: https://login.microsoftonline.com/common/oauth2/

You can create two types of applications that will have different methods and parameters:

    • Web application – web site based application that user can sign into. Require definition of an active redirect url and definition of client secret. Scopes or permissions are identified on the fly when authentication is made.
    • Native client application – Android, ‘head-less’ server app, etc. Requires only definition of an unique redirect uri and scopes or permissions that Office 365 account / user have available eg read & write files, etc.

The process for authentication is similar:

    • Sign-in with user account/password or send authentication url to authentication server to get authentication code.
    • Server sends back url with authentication code.
    • Retrieve authentication code from url.
    • Send another url comprised of code and other parameters back to server to get tokens.
    • Use tokens to list, view, upload, download files etc.

There are development SDKs available for popular languages.

I was only interested in thePython SDK . Some key notes about it include:

    • It is created specifically for web applications, not native client applications. The SDK authentication method relies on using a web browser to pass urls with parameters, codes and tokens back and forth. A native client application will not use web browser. A work around was to use head-less browser but that is a bit hacky.
    • It only has OneDrive Personal authentication urls. To use this with OneDrive Business or Sharepoint these urls are easily replaced with the OneDrive Business authentication urls in two files: auth_provider.py and the onedrivesdk_helper.py.

The change to the unified OneDrive API and oAuth authentication only happened in late 2015 so this is pretty new.

There weren’t many well developed or well documented OneDrive Python example code available.

Note it is still also possible to work with OneDrive Business, Sharepoint and OneDrive Personal without using oAuth authentication and this new OneDrive API by simply using urllib, request and templating along with hard coded Office 365 username and password in your code to authenticate.

Finally Microsoft Graph API can be used to interact with OneDrive Business, Sharepoint and OneDrive Personal  once oAuth is setup.

 

How to filter referral spam from Google Analytics using API and Python

Google Analytics data has become incredibly polluted by “spam referrals” which inflate site visits with what are essentially spam advertisements delivered to you via Google Analytics.

The spammers are entirely bypassing your site and directly hitting Google’s servers pretending to be a visitor to your site. Its a bit odd that a technological superpower like Google has fallen prey to spammers. Apparently a fix is in the works but it feels like its taking way too long.

In the meantime the fix is to filter out any “visit” that doesn’t have a legitimate referrer hostname. You determine what hostnames you find legitimate. At a minimum you want to include your domain. You can also filter out spam visits based on where their source. The source name is the where the spammers advertise to you by giving their spam domains hoping you will visit their sites. Setting up these filters can be done in Google Analytics built-in filters and it takes some manual effort and some ongoing updating as spammers keep changing source names.

The screenshot below shows the Google Analytics filter screen where you build filters for hostname and source using rules based filtering.

google filter

However this same rules based filtering can be done using the Google Analytics API. There is a lot of code around for you to work with and Google documentation is pretty good. I have implemented a hostname and source filter using Python and the code below. This enables me to download run the code in scheduled job and always have analytics data for analysis.

The “hostMatch” and “sourceExp” are the two things that filter out fake hostnames and fake visit source respectively.

You will need to get yourself Google API access and setup the OAuth (which I am not describing here). You will need the OAuth key and a secret file to authorize access to the API then you can use the code below.

'''access the Google Analytics API.'''
# https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults

import argparse
import csv
import re
from apiclient.discovery import build
from oauth2client.client import SignedJwtAssertionCredentials
import httplib2
from oauth2client import client
from oauth2client import file
from oauth2client import tools
from datetime import datetime, timedelta

todaydate = datetime.today().strftime('%Y-%m-%d')

def get_service(api_name, api_version, scope, key_file_location,
				service_account_email):
	'''Get a service that communicates to a Google API.
	Args:
	api_name: The name of the api to connect to.
	api_version: The api version to connect to.
	scope: A list auth scopes to authorize for the application.
	key_file_location: The path to a valid service account p12 key file.
	service_account_email: The service account email address.
	Returns:
	A service that is connected to the specified API.
	'''
	# client_secrets.p12 is secrets file for analytics
	f = open(key_file_location, 'rb')
	key = f.read()
	f.close()
	credentials = SignedJwtAssertionCredentials(service_account_email, key,
	scope=scope)
	http = credentials.authorize(httplib2.Http())
	# Build the service object.
	service = build(api_name, api_version, http=http)

	return service


def get_accounts(service):
	# Get a list of all Google Analytics accounts for this user
	accounts = service.management().accounts().list().execute()

	return accounts


def hostMatch(host):
        #this is used to filter analytics results to only those that came from your hostnames eg not from a spam referral host
	hostnames="domainname1","domainname2","domainname3"

	hostExp = "(" + ")|(".join(hostnames) + ")"
	hostMatch = re.search(hostExp, host[3].lower())

	if hostMatch:
		return True
	else:
		return False


def main():

    #this is where you build your filter expression, note it similar to what you would build in Google Analytics filter feature, you can be as specific of generalized using regex as you want/need
    #ga:source filter
    sourceExp=('ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]*-gratis;ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]')

    # Define the auth scopes to request.
    scope = ['https://www.googleapis.com/auth/analytics.readonly']

    #Provide service account email and relative location of your key file.
    service_account_email = 'xxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com'
    key_file_location = 'client_secrets.p12'
    #scope = 'http://www.googleapis.com/auth/analytics'

    # Authenticate and construct service.
    service = get_service('analytics', 'v3', scope, key_file_location, service_account_email)

    #get accounts
    accounts = service.management().accounts().list().execute()
    #create list for results
    output = list()

    # loop through accounts
    for account in accounts.get('items', []):
    	account_id = account.get('id')
    	account_name = account.get('name')

    #get properties
    	properties = service.management().webproperties().list(accountId=account_id).execute()

    #loop through each account property default profileid (set in GA admin)
    #get metrics from profile/view level
    #instead of looping through all profiles/views
    	for property in properties.get('items', []):
    		data = service.data().ga().get(
    			ids='ga:' + property.get('defaultProfileId'),
    			start_date='2012-01-01',
    			end_date= todaydate, #'2015-08-05',
    			metrics = 'ga:sessions, ga:users, ga:newUsers, ga:sessionsPerUser, ga:bounceRate, ga:sessionDuration, ga:adsenseRevenue',
    			dimensions = 'ga:date, ga:source, ga:hostname',
                max_results = '10000',
    			filters = sourceExp # the filters from above 
    		).execute()


    		for row in data.get('rows', '1'):
    			results = account_name, row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9]
    			output.append(results)
	#print output
		#count of response rows
        #print account_name, data['itemsPerPage'], len(data['rows'])

    #here is the hostname filter call to function above
    hostFilter = [host for host in output if hostMatch(host)==True]

    with open('output_analytics.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['account', 'date', 'source', 'hostname', 'sessions', 'users', 'newUsers', 'sessionsPerUser', 'bounceRate', 'sessionDuration',  'adsenseRevenue'], delimiter = ',')
        writer.writeheader()
        for line in hostFilter:
			file.write(','.join(line) + '\n')
            #print>>file, ','.join(line)

if __name__ == '__main__':
	main()

Tableau vizualization of Toronto Dine Safe data

The City of Toronto’s open data site includes the results of the city’s regular food and restaurant inspections. This data was as of August 2014.

The interactive chart below allows filtering by review criteria and can be zoomed into to view more detail and specific locations.

 

The data file for Dine Safe contained about 73,000 rows and was in XML format. In order to work with it I transformed it to csv format.

from xml.dom.minidom import parse
from csv import DictWriter
 
fields = [
	'row_id',
	'establishment_id',
	'inspection_id',
	'establishment_name',
	'establishmenttype',
	'establishment_address',
	'establishment_status',
	'minimum_inspections_peryear',
	'infraction_details',
	'inspection_date',
	'severity',
	'action',
	'court_outcome',
	'amount_fined'
]
 
doc = parse(file('dinesafe.xml'))
 
writer = DictWriter(file('dinesafe.csv', 'w'), fields)
writer.writeheader()
 
row_data = doc.getElementsByTagName('ROWDATA')[0]
 
for row in row_data.getElementsByTagName('ROW'):
	row_values = dict()
	for field in fields:
		text_element = row.getElementsByTagName(field.upper())[0].firstChild
		value = ''
		if text_element:
			value = text_element.wholeText.strip()
			row_values[field] = value
	writer.writerow(row_values)

This data was not geocoded so I had to do that before it could be mapped. I wanted to use a free geocoding service but they have limits on the number of records that could be geooded per day. I used MapQuest’s geocoding API using a Python library that would automate the geocoding in daily batched job so that I could maximize free daily geocoding.

The Dine Safe data file addresses needed some cleaning up so that the geocoding service could read them. For example street name variations needed to be conformed to something that MapQuest would accept. This was a manual batch find and replace effort. If I was automating this I would use a lookup table of street name variations and replace them with accepted spellling/format.

#Uses MapQuest's Nominatim mirror.

import anydbm
import urllib2
import csv
import json
import time

# set up the cache. 'c' means create if necessary
cache = anydbm.open('geocode_cache', 'c')

# Use MapQuest's open Nominatim server.
# https://developer.mapquest.com/web/products/open/nominatim
API_ENDPOINT = 'http://open.mapquestapi.com/nominatim/v1/search.php?format=json&q={}'

def geocode_location(location):
    '''
    Fetch the geodata associated with the given address and return
    the entire response object (loaded from json).
    '''
    if location not in cache:
        # construct the URL
        url = API_ENDPOINT.format(urllib2.quote(location))
        
        # load the content at the URL
        print 'fetching %s' % url
        result_json = urllib2.urlopen(url).read()
        
        # put the content into the cache
        cache[location] = result_json
        
        # pause to throttle requests
        time.sleep(1)
    
    # the response is (now) in the cache, so load it
    return json.loads(cache[location])

if __name__ == '__main__':
    # open the input and output file objects
    with open('dinesafe.csv') as infile, open('dinesafe_geocoded.csv', 'w') as outfile:
      
        # wrap the files with CSV reader objects.
        # the output file has two additional fields, lat and lon
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, reader.fieldnames + ['lat', 'lon'])
        
        # write the header row to the output file
        writer.writeheader()
        
        # iterate over the file by record 
        for record in reader:
            # construct the full address
            address = record['establishment_address']
            address += ', Toronto, ON, Canada'
            
            # log the address to the console
            print address
            
            try:
                # Nominatim returns a list of matches; take the first
                geo_data = geocode_location(address)[0]
                record['lat'] = geo_data['lat']
                record['lon'] = geo_data['lon']
            except IndexError:
                # if there are no matches, don't raise an error
                pass
            writer.writerow(record)

After the Dine Safe data was geocoded so that it had two new columns, one for latitude and another for longitude, all that was left to do was bring the data into Tableau and create the Tableau map visualization which is shown below.

Introducing Cardivvy – a website showing Car2Go real time car locations, parking and service areas

Car2Go provides developer access to their real-time vehicle location and parking availability, and service area boundaries for all of their city locations around the world.

I was using the Car2Go API to power a website created for my own use after struggling with the official Car2Go map on their site when using my mobile phone. I wanted a way to find cars by street locations.

The site includes link to get vehicle location, parking lot location and service area boundaries on a Google Map. You can click into each city, view available cars alphabetically by street location or parking lot location, and then click through to car current location on a Google Map. Each city’s Car2Go service area can be viewed on a Google Map.

I am waiting for ZipCar to get their API up and running which should be available Fall 2015, then I will integrate that into the cardivvy site too so I can see where both cars are available.

This is example of Car2Go service area boundary for Vancouver area.

car2gomap