Flask SQLAlchemy relationship associations and backref

Flask-SQLAlchemy relationship associations are useful to easily retrieve related records from a related Model.

In a Python Flask application I wanted to use a many-to-many relationship between Photos and Tags (eg many categories assigned to one or more photos).

To do this I had the following things set up in my application:

  • 3 database tables
    • Photos
    • Tags
    • Photo_Tags
  • 3 model.py sections
    • Photos
    • Tags
    • photo_tags
  • 1 views.py section ‘photo’
  • 1 template ‘photo.html’

The models looked like this:

class Photos(db.Model):
….id = db.Column(db.Integer, primary_key=True)
….filename = db.Column(db.String(100))
….tags = db.relationship(‘Tags’,  secondary=photo_tags, backref=’photos’)

class Tags(db.Model):
…. id = db.Column(db.Integer, primary_key=True)
…. tagname = db.Column(db.String(100))

photo_tags = db.Table(‘photo_tags’,
…. db.Column(‘tag_id’, db.Integer, db.ForeignKey(‘tags.id’)),
…. db.Column(‘photo_id’, db.Integer, db.ForeignKey(‘photos.id’))
)

The bolded text above shows how the ‘tags’ relationship between the Photos and the Tags models that is added to the Photos model. SQLAlchemy wants us to add this ‘tags’ association to one of the models that are in it (only one more on that below).

The ‘tags’ relationship is physically recorded in the database table ‘photo_tags’ and it doesn’t get a ‘class Model’ like the Photos and Tags have. Instead, it is setup as shown above to indicate that it is a table with foreign keys to the Photos and Tags tables.

Key things to note about the ‘tags’ reference in the Photos model:

  • ‘Tags’ model is related model
  • ‘photo_tags’ is referenced as the ‘secondary’ or relationship table.
  • ‘photos’ is referenced as ‘backref’. The significance of this is that you don’t have to add a ‘tags’ to the Tags model. The ‘backref’ will allow you to reversibly reference Photos and Tags eg you can use the ‘tags’ to get
    • Tags related to a specific Photo
    • Photos related to a specific Tag

An example of a view that can use the relationship established above follows. This view retrieves all tags for a specific photo:

@app.route(‘/photo/’)
def photo(requested_photo_id=None):
….requested_photo= Photos.query.get(requested_photo_id)
….return render_template( ‘photo.html’, requested_photo = requested_photo
)

Here is what happens in this view:

  • The requested_photo_id is retrieved from the url.
  • The SQLAlchemy query Photos.query.get(requested_photo_id) uses that requested_photo_id to retrieve the photo from the Photos model/MySQL database table, and its related tags.
  • The retrieved requested_photo object is passed to template ‘photo.html’ along with the related tag info.
  • Finally it is in the photo.html template where everything comes together. The template below is simple HTML page that presents the requested_photo_id and its related tags.





Photo Details

Photo ID: {{ requested_photo.id }}

Photo file name: {{ requested_photo.filename }}

{% for tag in requested_photo.tags %}

tag.id: {{ tag.id}}, tag.tagname: {{ phototag.tagname }}

{% endfor %}

The requested_photo id and filename come from the Photos Model/MySQL table.

The photo’s related tags come from the requested_photo ‘tags’ association by referencing it as ‘requested_photo.tags’ (this is the awesome part) and then looping through the tags to retrieve each tag so they can be printed on webpage individually.

Get list of custom segments from Google Analytics API

This is a post on how to create custom Google Analytics Profile Segments for the purpose of removing referral spam (and there is increasingly soo much of it!) from GA reporting.

However if you want to use these new Custom Segments to filter results using Google Analytics API with a Service Account there are some challenges.

If you are retrieving GA results for many web sites you need to get the GA API to loop through each sites’s View / Profiles in your GA Account to retrieve the data for each.

The challenge is that each Profile has its own Custom Segment. In order to filter out referral spam completely, two types of filters are required. The ‘exclude’ filter which is same for all Profile, and the ‘include’ filter which is specific to each Profile as it refers to the Profile’s domain.

So that makes looping through each Profile a bit more challenging. You need a dictionary of each Profile’s Custom Segment Id so it can be applied for each Profile’s data.

These Custom Segment Id’s look something like “gaid::BXxFLXZfSAeXbm4RZuFd9w”

The Custom Segment Id needs to be used in the service.data.ga().get().execute() criteria.

data = service.data().ga().get(
ids=ids,
start_date=”2015-07-01″,
end_date=”2015-07-19″,
segment: “gaid::BXxFLXZfSAeXbm4RZuFd9w”,
metrics=metrics
).execute()

It wasn’t easy to find these Custom Segment Id’s. First I tried looping through the segments() as follows:

    # Authenticate and construct service.
    service = get_service(‘analytics’, ‘v3’, scope, key_file_location,
    service_account_email)
    
    segments = service.management().segments().list().execute()
    
    for segment in segments.get(‘items’, []):
      print ‘Segment ID ‘ + segment.get(‘id’) + ” – ” + segment.get(‘name’)

But that only retrieved the Standard Google Segments, but not the Custom Segments and apparently this is not possible with a Service Account.

So I found that you are able to see the Custom Segment Ids in the https://ga-dev-tools.appspot.com/qery-explorer.

But while you can see the Custom Segments here it wasn’t very helpful as you have to go one by one in the Segments criteria field. If you have many sites it will be time consuming.

Then I finally found the “stand alone explorer” at the bottom of the GA API Segments documentation page.

https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/segments/list#try-it

This outputs a json file containing all of the Segment details. Unfortunately this isn’t useful as a ready dictionary as it only has the segment details, not the account id. But it does have the Custom Segment Ids which can be used to create manual dictionary of Account Id and Segment Id that can be used in the loop.

Perhaps it might also be possible to do a reverse lookup and find the Custom Segment Id by looping through the Segments and finding those with the name.

Hope that helps someone!

BC Hydro’s amazing #BCStorm social media turnaround

BC Hydro made an amazing social media turnaround to communicate with customers in crisis! Go BC Hydro!

On August 29, 2015 high winds caused tree falls that took out BC Hydro power lines throughout BC’s Lower Mainland, Vancouver Island, Sunshine Coast leaving up to 500,000 customers without electricity. Many including me were without power for 48 hours.

BC Hydro’s web site was their primary channel for communicating with customers about extent of damage and expected time for repairs, but the site also went down during this time. H

They had used Twitter in the past to communicate with customers but they weren’t using it much on first day of crisis.

However on the second day of the crisis with 100,000+ customers still without power BC Hydro dramatically increased Twitter communication by responding directly to customer tweets about extent of damage and expected time for repairs.

To visualize this dramatic increase in BC Hydro Twitter usage I downloaded all @BCHydro tweets for August 29 and 30 from Twitter’s API using Python Tweepy and used Microsoft Power BI to create a visualization of BC Hydro tweet counts which is shown below.

Some notes on chart:

    • x axis shows date and hour of day
    • y-axis shows count of tweets
    • ‘response’ tweets are light green part of bar
    • ‘status’  tweets are dark green part of bar

You can see that on August 30 at 11 AM, about 28 hours after the storm hit, BC Hydro suddenly starts tweeting responses to customers’ questions. They continued for the next few days until their website was back up in addition to their regular ‘status’ tweets.

The chart clearly shows a very amazing social media turnaround! BC Hydro dramatically increased their use of Twitter to get customers answers and information. Go BC Hydro!

Note: the Twitter data was last updated Aug 31, 2015 at 16.45 PM PST.

bchtweets

 

Historically BC Hydro did not use Twitter this way.  The chart below shows BC Hydro’s tweeting before the storm. They were tweeting once or twice per day with occasional spikes to 7 tweets per day.

 

bchtweets_before

 

The ‘response‘ category includes tweets by BC Hydro responding to a customer tweet. Example ‘response’ tweet:

2015-08-30 14:26:01, @Adamhillz Crews will be on site as soon as possible. Stay back from any downed lines and call 911 if it is on fire.

The ‘status‘ category includes tweets by BC Hydro about ongoing status of repair work etc. Example ‘status’ tweet:

2015-08-30 14:28:32, Crews have been brought in from across province, including Prince George, Terrace, Kamloops, Smithers, Vernon & Vancouver Island #bcstorm

FYI if anyone is interested, here is the text file with tweet text – bchtweets

Here is the Python code used to get the BC Hydro Tweets using the Twitter API.

import tweepy
#import MySQLdb
from datetime import datetime
import random
import sys, os
import json
from dateutil import tz
from tweet_auth import * #this is another .py file with the twitter api credentials

#get twitter auth
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

def main():
with open('bchtweets.csv', 'wb') as file:
for status in tweepy.Cursor(api.user_timeline, id='bchydro').items(1800):
file.write(str(status.created_at.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/Los_Angeles')).replace(tzinfo=None)) + ', ' + status.text.encode('utf8') + '\n')

if __name__ == '__main__':
main()

Note that there is a separate file with my Twitter OAuth credentials that looks like the following. You just have to replace the ‘xxx…’ with your credentials.

#twitter api oauth credentials - replace with yours
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

How to filter referral spam from Google Analytics using API and Python

Google Analytics data has become incredibly polluted by “spam referrals” which inflate site visits with what are essentially spam advertisements delivered to you via Google Analytics.

The spammers are entirely bypassing your site and directly hitting Google’s servers pretending to be a visitor to your site. Its a bit odd that a technological superpower like Google has fallen prey to spammers. Apparently a fix is in the works but it feels like its taking way too long.

In the meantime the fix is to filter out any “visit” that doesn’t have a legitimate referrer hostname. You determine what hostnames you find legitimate. At a minimum you want to include your domain. You can also filter out spam visits based on where their source. The source name is the where the spammers advertise to you by giving their spam domains hoping you will visit their sites. Setting up these filters can be done in Google Analytics built-in filters and it takes some manual effort and some ongoing updating as spammers keep changing source names.

The screenshot below shows the Google Analytics filter screen where you build filters for hostname and source using rules based filtering.

google filter

However this same rules based filtering can be done using the Google Analytics API. There is a lot of code around for you to work with and Google documentation is pretty good. I have implemented a hostname and source filter using Python and the code below. This enables me to download run the code in scheduled job and always have analytics data for analysis.

The “hostMatch” and “sourceExp” are the two things that filter out fake hostnames and fake visit source respectively.

You will need to get yourself Google API access and setup the OAuth (which I am not describing here). You will need the OAuth key and a secret file to authorize access to the API then you can use the code below.

'''access the Google Analytics API.'''
# https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults

import argparse
import csv
import re
from apiclient.discovery import build
from oauth2client.client import SignedJwtAssertionCredentials
import httplib2
from oauth2client import client
from oauth2client import file
from oauth2client import tools
from datetime import datetime, timedelta

todaydate = datetime.today().strftime('%Y-%m-%d')

def get_service(api_name, api_version, scope, key_file_location,
				service_account_email):
	'''Get a service that communicates to a Google API.
	Args:
	api_name: The name of the api to connect to.
	api_version: The api version to connect to.
	scope: A list auth scopes to authorize for the application.
	key_file_location: The path to a valid service account p12 key file.
	service_account_email: The service account email address.
	Returns:
	A service that is connected to the specified API.
	'''
	# client_secrets.p12 is secrets file for analytics
	f = open(key_file_location, 'rb')
	key = f.read()
	f.close()
	credentials = SignedJwtAssertionCredentials(service_account_email, key,
	scope=scope)
	http = credentials.authorize(httplib2.Http())
	# Build the service object.
	service = build(api_name, api_version, http=http)

	return service


def get_accounts(service):
	# Get a list of all Google Analytics accounts for this user
	accounts = service.management().accounts().list().execute()

	return accounts


def hostMatch(host):
        #this is used to filter analytics results to only those that came from your hostnames eg not from a spam referral host
	hostnames="domainname1","domainname2","domainname3"

	hostExp = "(" + ")|(".join(hostnames) + ")"
	hostMatch = re.search(hostExp, host[3].lower())

	if hostMatch:
		return True
	else:
		return False


def main():

    #this is where you build your filter expression, note it similar to what you would build in Google Analytics filter feature, you can be as specific of generalized using regex as you want/need
    #ga:source filter
    sourceExp=('ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]*-gratis;ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]')

    # Define the auth scopes to request.
    scope = ['https://www.googleapis.com/auth/analytics.readonly']

    #Provide service account email and relative location of your key file.
    service_account_email = '[email protected]erviceaccount.com'
    key_file_location = 'client_secrets.p12'
    #scope = 'http://www.googleapis.com/auth/analytics'

    # Authenticate and construct service.
    service = get_service('analytics', 'v3', scope, key_file_location, service_account_email)

    #get accounts
    accounts = service.management().accounts().list().execute()
    #create list for results
    output = list()

    # loop through accounts
    for account in accounts.get('items', []):
    	account_id = account.get('id')
    	account_name = account.get('name')

    #get properties
    	properties = service.management().webproperties().list(accountId=account_id).execute()

    #loop through each account property default profileid (set in GA admin)
    #get metrics from profile/view level
    #instead of looping through all profiles/views
    	for property in properties.get('items', []):
    		data = service.data().ga().get(
    			ids='ga:' + property.get('defaultProfileId'),
    			start_date='2012-01-01',
    			end_date= todaydate, #'2015-08-05',
    			metrics = 'ga:sessions, ga:users, ga:newUsers, ga:sessionsPerUser, ga:bounceRate, ga:sessionDuration, ga:adsenseRevenue',
    			dimensions = 'ga:date, ga:source, ga:hostname',
                max_results = '10000',
    			filters = sourceExp # the filters from above 
    		).execute()


    		for row in data.get('rows', '1'):
    			results = account_name, row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9]
    			output.append(results)
	#print output
		#count of response rows
        #print account_name, data['itemsPerPage'], len(data['rows'])

    #here is the hostname filter call to function above
    hostFilter = [host for host in output if hostMatch(host)==True]

    with open('output_analytics.csv', 'wb') as file:
        writer = csv.DictWriter(file, fieldnames = ['account', 'date', 'source', 'hostname', 'sessions', 'users', 'newUsers', 'sessionsPerUser', 'bounceRate', 'sessionDuration',  'adsenseRevenue'], delimiter = ',')
        writer.writeheader()
        for line in hostFilter:
			file.write(','.join(line) + '\n')
            #print>>file, ','.join(line)

if __name__ == '__main__':
	main()

Tableau vizualization of Toronto Dine Safe data

The City of Toronto’s open data site includes the results of the city’s regular food and restaurant inspections. This data was as of August 2014.

The interactive chart below allows filtering by review criteria and can be zoomed into to view more detail and specific locations.

 

The data file for Dine Safe contained about 73,000 rows and was in XML format. In order to work with it I transformed it to csv format.

from xml.dom.minidom import parse
from csv import DictWriter
 
fields = [
	'row_id',
	'establishment_id',
	'inspection_id',
	'establishment_name',
	'establishmenttype',
	'establishment_address',
	'establishment_status',
	'minimum_inspections_peryear',
	'infraction_details',
	'inspection_date',
	'severity',
	'action',
	'court_outcome',
	'amount_fined'
]
 
doc = parse(file('dinesafe.xml'))
 
writer = DictWriter(file('dinesafe.csv', 'w'), fields)
writer.writeheader()
 
row_data = doc.getElementsByTagName('ROWDATA')[0]
 
for row in row_data.getElementsByTagName('ROW'):
	row_values = dict()
	for field in fields:
		text_element = row.getElementsByTagName(field.upper())[0].firstChild
		value = ''
		if text_element:
			value = text_element.wholeText.strip()
			row_values[field] = value
	writer.writerow(row_values)

This data was not geocoded so I had to do that before it could be mapped. I wanted to use a free geocoding service but they have limits on the number of records that could be geooded per day. I used MapQuest’s geocoding API using a Python library that would automate the geocoding in daily batched job so that I could maximize free daily geocoding.

The Dine Safe data file addresses needed some cleaning up so that the geocoding service could read them. For example street name variations needed to be conformed to something that MapQuest would accept. This was a manual batch find and replace effort. If I was automating this I would use a lookup table of street name variations and replace them with accepted spellling/format.

#Uses MapQuest's Nominatim mirror.

import anydbm
import urllib2
import csv
import json
import time

# set up the cache. 'c' means create if necessary
cache = anydbm.open('geocode_cache', 'c')

# Use MapQuest's open Nominatim server.
# https://developer.mapquest.com/web/products/open/nominatim
API_ENDPOINT = 'http://open.mapquestapi.com/nominatim/v1/search.php?format=json&q={}'

def geocode_location(location):
    '''
    Fetch the geodata associated with the given address and return
    the entire response object (loaded from json).
    '''
    if location not in cache:
        # construct the URL
        url = API_ENDPOINT.format(urllib2.quote(location))
        
        # load the content at the URL
        print 'fetching %s' % url
        result_json = urllib2.urlopen(url).read()
        
        # put the content into the cache
        cache[location] = result_json
        
        # pause to throttle requests
        time.sleep(1)
    
    # the response is (now) in the cache, so load it
    return json.loads(cache[location])

if __name__ == '__main__':
    # open the input and output file objects
    with open('dinesafe.csv') as infile, open('dinesafe_geocoded.csv', 'w') as outfile:
      
        # wrap the files with CSV reader objects.
        # the output file has two additional fields, lat and lon
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, reader.fieldnames + ['lat', 'lon'])
        
        # write the header row to the output file
        writer.writeheader()
        
        # iterate over the file by record 
        for record in reader:
            # construct the full address
            address = record['establishment_address']
            address += ', Toronto, ON, Canada'
            
            # log the address to the console
            print address
            
            try:
                # Nominatim returns a list of matches; take the first
                geo_data = geocode_location(address)[0]
                record['lat'] = geo_data['lat']
                record['lon'] = geo_data['lon']
            except IndexError:
                # if there are no matches, don't raise an error
                pass
            writer.writerow(record)

After the Dine Safe data was geocoded so that it had two new columns, one for latitude and another for longitude, all that was left to do was bring the data into Tableau and create the Tableau map visualization which is shown below.