Google Analytics data has become incredibly polluted by “spam referrals” which inflate site visits with what are essentially spam advertisements delivered to you via Google Analytics.
The spammers are entirely bypassing your site and directly hitting Google’s servers pretending to be a visitor to your site. Its a bit odd that a technological superpower like Google has fallen prey to spammers. Apparently a fix is in the works but it feels like its taking way too long.
In the meantime the fix is to filter out any “visit” that doesn’t have a legitimate referrer hostname. You determine what hostnames you find legitimate. At a minimum you want to include your domain. You can also filter out spam visits based on where their source. The source name is the where the spammers advertise to you by giving their spam domains hoping you will visit their sites. Setting up these filters can be done in Google Analytics built-in filters and it takes some manual effort and some ongoing updating as spammers keep changing source names.
The screenshot below shows the Google Analytics filter screen where you build filters for hostname and source using rules based filtering.
However this same rules based filtering can be done using the Google Analytics API. There is a lot of code around for you to work with and Google documentation is pretty good. I have implemented a hostname and source filter using Python and the code below. This enables me to download run the code in scheduled job and always have analytics data for analysis.
The “hostMatch” and “sourceExp” are the two things that filter out fake hostnames and fake visit source respectively.
You will need to get yourself Google API access and setup the OAuth (which I am not describing here). You will need the OAuth key and a secret file to authorize access to the API then you can use the code below.
'''access the Google Analytics API.''' # https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults import argparse import csv import re from apiclient.discovery import build from oauth2client.client import SignedJwtAssertionCredentials import httplib2 from oauth2client import client from oauth2client import file from oauth2client import tools from datetime import datetime, timedelta todaydate = datetime.today().strftime('%Y-%m-%d') def get_service(api_name, api_version, scope, key_file_location, service_account_email): '''Get a service that communicates to a Google API. Args: api_name: The name of the api to connect to. api_version: The api version to connect to. scope: A list auth scopes to authorize for the application. key_file_location: The path to a valid service account p12 key file. service_account_email: The service account email address. Returns: A service that is connected to the specified API. ''' # client_secrets.p12 is secrets file for analytics f = open(key_file_location, 'rb') key = f.read() f.close() credentials = SignedJwtAssertionCredentials(service_account_email, key, scope=scope) http = credentials.authorize(httplib2.Http()) # Build the service object. service = build(api_name, api_version, http=http) return service def get_accounts(service): # Get a list of all Google Analytics accounts for this user accounts = service.management().accounts().list().execute() return accounts def hostMatch(host): #this is used to filter analytics results to only those that came from your hostnames eg not from a spam referral host hostnames="domainname1","domainname2","domainname3" hostExp = "(" + ")|(".join(hostnames) + ")" hostMatch = re.search(hostExp, host.lower()) if hostMatch: return True else: return False def main(): #this is where you build your filter expression, note it similar to what you would build in Google Analytics filter feature, you can be as specific of generalized using regex as you want/need #ga:source filter sourceExp=('ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]*-gratis;ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected];ga:[email protected]') # Define the auth scopes to request. scope = ['https://www.googleapis.com/auth/analytics.readonly'] #Provide service account email and relative location of your key file. service_account_email = '[email protected]erviceaccount.com' key_file_location = 'client_secrets.p12' #scope = 'http://www.googleapis.com/auth/analytics' # Authenticate and construct service. service = get_service('analytics', 'v3', scope, key_file_location, service_account_email) #get accounts accounts = service.management().accounts().list().execute() #create list for results output = list() # loop through accounts for account in accounts.get('items', ): account_id = account.get('id') account_name = account.get('name') #get properties properties = service.management().webproperties().list(accountId=account_id).execute() #loop through each account property default profileid (set in GA admin) #get metrics from profile/view level #instead of looping through all profiles/views for property in properties.get('items', ): data = service.data().ga().get( ids='ga:' + property.get('defaultProfileId'), start_date='2012-01-01', end_date= todaydate, #'2015-08-05', metrics = 'ga:sessions, ga:users, ga:newUsers, ga:sessionsPerUser, ga:bounceRate, ga:sessionDuration, ga:adsenseRevenue', dimensions = 'ga:date, ga:source, ga:hostname', max_results = '10000', filters = sourceExp # the filters from above ).execute() for row in data.get('rows', '1'): results = account_name, row, row, row, row, row, row, row, row, row, row output.append(results) #print output #count of response rows #print account_name, data['itemsPerPage'], len(data['rows']) #here is the hostname filter call to function above hostFilter = [host for host in output if hostMatch(host)==True] with open('output_analytics.csv', 'wb') as file: writer = csv.DictWriter(file, fieldnames = ['account', 'date', 'source', 'hostname', 'sessions', 'users', 'newUsers', 'sessionsPerUser', 'bounceRate', 'sessionDuration', 'adsenseRevenue'], delimiter = ',') writer.writeheader() for line in hostFilter: file.write(','.join(line) + '\n') #print>>file, ','.join(line) if __name__ == '__main__': main()