An Alternative to Google Analytics BigQuery Export Using R and Google Tag Manager

Google Analytics has proven to be one of the most influential tools ever created for marketing analysis.  Google is pretty unrelenting in their pursuit of innovation for Google Analytics and that innovation shows in the number of other tools they’ve built for analysts.  From Google Sheets to BigQuery to Google Data Studio, the complementary tools built are a great aid for dealing with the dearth of data that can be mined from Google Analytics.  One of the little known yet game-breaking tools available for use with Google Analytics data is the Google Analytics BigQuery Export.

Google Analytics Premium BigQuery Export
Google Analytics Premium BigQuery Export

This tool, which is only available for users of Google Analytics premium product, is in essence a raw data export of a website’s Google Analytics data.  This unlocks any analyst with a decent knowledge of SQL from the shackles of Google canned reports.  This also allows an analyst to create much more robust logic for creating reports.  For instance, if an analyst wanted to create a report for all users that viewed a particular page during their session and returned to the site within 6 days, they would only be limited by only their knowledge SQL and their ability to fork out the $150K Google charges for their premium product!!!

Google Analytics does not provide data at the user level out of the box, however, with the aid of a process outlined in Simo Ahava’s tremendously useful blog, you can use Google Tag Manager to pull Google’s user and session IDs out of the cookie (also known as Client ID) and feed them back to the interface in custom dimension or event.  This gives an analyst the ability to report on user activity at the user ID level.

Remember: passing personally identifying information to Google Analytics is a violation of the terms of service, so don’t pass any personal identifying information to Google if you might have it, like email addresses.

Below are the steps I use for passing pageview data along with user and session data to the GTM data layer for logging in Google Analytics, however, you could technically use a slightly different process to pass ecommerce, event, goal, custom metric/dimension data as well.  I’ll cover that in a later post.  I’ll assume that the reader has already tagged all of their pages with a Google Tag Manager container, but if not, start by reading this post and make sure to tag your pages.


  1. Create a Custom Dimension by going to the admin page in your Google Analytics view:Under “Custom Definintions” select “Custom Dimensions”, create a dimension and call it “Client ID” or whatever name you prefer. This dimension will have a scope of “Session”.  Make note of the dimension index (you’ll need to enter that later).
  2. Create a Custom JavaScript Variable in Google Tag Manager and give it a title such as {{Set Client ID in Dimension 1}}.
    Here is the code:

    Make sure to include the correct index to the customDimensionIndex variable.  If you’ve completed this step correctly, you will be able to see the ClientId being passed under whichever custom dimension you have set it up for in the Google Analytics Debugger tool.

    Client ID being passed into dimension 1
  3. If everything shows up, move back to Google Tag Manager and edit the pageview Tag for your site. Under “Fields to Set”, type “customTask” and under “Value” use the dropdown to select the variable we created in step B, {{Set Client ID in Dimension 1}}.Now that concludes the first part of the process.  Once you’ve reached this step, you could technically start playing with the user and session Client ID dimension in Google Analytics’ custom reports.Pull Client ID Custom Dimension DataSo we’ve tagged our site to send user and session data to Google Analytics and have dealt with sampling, now for the fun part.  This string pulls page URLs, user and session IDs by date based on the dimensions detailed above.  Where pro

    Run some other scripts

    Using some other scripts, an analyst can answer a number of other questions, like how long does it take a new user to become a repeat user.  These scripts rely heavily on the data.table syntax instead using base R.  Please take a look at my prior post on using data.table to learn why I do so.

    Data Returned

    Then run the rest:

    This will return the number of days on average it takes a new user to become a repeat user.

    calculate number of days for new user to return
    There are a number of other uses for the client ID data in GA.  For instance, a marketer might want to do some attribution modelling or a content manager would want to know if viewing an article in one session might effect subsequent sessions.

One consideration around doing this type of analysis is scale.  Most smaller websites won’t pose an issue, but some larger sites (like the one I currently work on) will.  Pulling an individual non-aggregated row for every session, page or event can yield some extremely large datasets.  In this case, it would make sense to send the data to a cloud storage data warehouse such as BigQuery.  Want to learn more on using R to solve for this?  Stay tuned…

Which is Faster?: R Data Manipulation with data.table vs dplyr

I started my voyage into learning R by taking Datacamp’s online courses.  After finishing courses on data manipulation in both base R and dplyr, I stumbled upon a course on using the data.table library.  I was taken back a bit after learning that data.table using a different syntax than base R.  This was unnerving as I didn’t know what I would gain from learning data manipulation in yet another syntax.  My skepticism, however, changed to optimism once I began working on a rather large dataset a few weeks later.  This dataset (a 43M row set of email opens and clickthroughs), took something like 30-40 minutes to read into R using the base read.csv function.  Instead, I tried using the fread function in data.table.  Low and behold, what took 30-40 minutes using base R took about 5 minutes using fread.  Here’s timing data for a 3M row text file:

fread for r
85% improvement in median performance using fread

The speed improvements are not just limited to reading data.  Manipulations are also faster using data.table.  Here are 2 functions written to group and count rows using the world cities population dataset.

data.table performance
83% increase in median performance using data.table

Lastly, as you can see in the functions written above, the data.table function (dt_func) is less verbose than the dplyr function (df_func).  One of the reasons for this is that dplyr is meant to be easily expressed from one programmer to another, however, some programmers will not need to share their code from one user to another.  Nevertheless, once I learned the data.table syntax, I preferred using it over the dplyr syntax.  This seems to be the case for many programmers with a previous foundation in SQL.

datacamp data.table SQL similarity
Datacamp’s data.table tutorial explains the data.table – SQL similarity

While learning syntax can be a tough task, I have to admit that the extra work of learning data.table syntax is worth it.

If you haven’t had a chance to read my last post on using R with Google Analytics data, please take a look.  Also, if you have any comments, questions or if simply want to call me crazy, drop a comment below.