4 Skills Analysts Won’t Learn from Courses and Books


I have had a pretty long career in analytics, at least long in comparison to other analysts (+13 years).  I would not have been in this industry for as long as I have if there weren’t aspects about it that I enjoy.  I could probably spend multiple posts on the positive aspects associated with working in the analytics industry (demand, teamwork, pay, etc.).  However, with every yin, there is a yang. In this post I will discuss 4 things that analysts don’t learn from courses and books on analytics. I’m not advocating that traditional education can’t help an analyst learn new techniques, but there are some skills that can only be mastered by experience.


In my opinion, the first 30-90 days can make or break a role.  A strong start to a role can be the difference between a great engagement with a company and an eventual early exit.  Make sure that you get sponsorship in the early days of the role. I would define sponsorship as an endorsement or a close relationship  with a senior employee (your manager does not count). This could be a simple as meeting with a senior employee, discussing the issues surrounding his/her job and how your skill set can assist them.  This can be a tough task especially for an introvert like myself, but the payoff can be tremendous. In most cases, these individuals have no idea that the data or the skill that can help them exists.  Their lives don’t revolve around SQL and statistics, but the analyst’s life does. Grab a cup of coffee, lunch, a shot of whiskey, whatever you have to cultivate those relationships.

Analytics is a Mix of Science and Art

I have learned over the years that analytics is a mix of science and art.  There are a number of technical disciplines that must be learned. The usual suspects that most programs encourage analysts to learn are programming (SQL, R, Python) and statistics.  However, the science that they don’t expect you to learn is…psychology. I was fortunate enough to start my career at an organization that understood that psychology was integral to creating and optimizing the web experience.  Web content, be it ecommerce websites, institutional websites or blogs, have a main goal in mind. That goal is to educate the visitor and to convince them that their site, their service or their product has something that they lack.  There is no better way to convince them of this than having a deep understanding of their wants, needs and behavior. This type of thinking has been important to my career and how I break down analytical problems, especially problems dealing with web analytics.  The art comes in how you solve the issues once you’ve broken them down. The decisions that are made in creating optimizations, designs and new features. The analyst must have a big part of those decisions, because the analyst has the best understanding of the underlying data.

An Analyst’s Ears are Their Biggest Assets

Earlier in the post, I detailed how finding a sponsor is one of the most important things for an analyst to do in their first days of a role.  The main point is that, if the analyst isn’t communicating with other individuals in the organization, they will get lost in anonymity. Anonymity is not the place for an analyst.  Analysts should field questions, analysts should be considered experts and (most of all) analysts should be relied upon to help steer the strategy of the organization. With that said, the analyst should take the initiative in understanding the issues (both current and potential) of the organization.  This is done by keeping an ear out on the concerns of the individuals of the organization. This is done by attending meetings planned for groups that have no connection with analytics like marketing, advertising, operations and executives. Practice active listening when speaking with colleagues. Only after their concerns are understood can the analyst really become an asset to an organization.

Analysts Must Have Thick Skin and Unwavering Honesty

It is no secret that some of the most productive people in a number of different industries have the skill of effectively saying “no”.  These individuals understand that they cannot perform at their best if they are being bogged down by inefficient processes and unfruitful tasks.  With that said, analysts must take this concept a bit further. Analysts must understand that they must be honest with their colleagues, even when they must give bad news.  This is especially true for analysts that work in marketing.  

Digital Ad Spending in the United States

eMarketer estimates that digital ad spending will hit $172 Billion (with a “B”) by 2021 in the U.S.  In other words, a lot of money gets spent on advertising in the U.S. with a majority going to Google and Facebook.  Almost every marketing organization is going to pay some part of this spending. With that, comes added scrutiny of marketing budgets and performance.  The marketing analyst is on the front line in answering questions of how well budgets are spent. They have the visibility on spend of marketing campaigns and the skill to analyze their performance.  That can make their job very hard as they will sometimes have to give bad news on the performance of a campaign(s). The best analysts are honest on the performance of marketing campaigns while giving good advice on corrective measures.  They stay in front of issues and are ready to answer questions before they are asked. They know when a campaign is going sideways before their colleagues.

I hope this helps in understanding some of the lesser known skills of being an effective analyst.  It can be a challenging career, but one with numerous benefits! 


Google Analytics Premium vs Adobe Analytics Comparison

Over the years, most analysts, especially most analysts dealing with the operations of websites and/or mobile apps will find themselves in one of two situations. Some will be faced with the dilemma of which web analytics tool should they choose.  The others will simply wish they had they had the ability to make their own choice.  In my career which spans years of using different tools for a number of different purposes, I’d rather have the issue of having to make choice rather than not having one.  This is especially true for choosing analytics tools.  My aim in this post is not to catalog every difference between Adobe Analytics and Google Analytics.  There are scores of blog posts that aim to tackle that very task.  For example, EDUCBA’s blog post on Adobe Analytics vs Google Analytics is a comprehensive and up to date comparison that should be referred to when making a choice.  The EDUCBA blog post also improves on most comparisons I’ve read as most other posts seem to have been biased in some way.  For instance, most the comparisons that tend to lean favorable toward Google were written by agencies that happen to be analytics partners for Google (providing services such as Google Analytics Premium implementation, training or sales), while comparisons that lean favorable toward Adobe seem to be simply outdated or based on edge cases that most large organizations will not be exposed to.  My aim in this post is to touch on some of the topics that are usually not covered in most comparisons including such as Tag Management, Raw Data Export and Default Metrics from an outside perspective.

Before moving forward, please note that I currently nor have I ever worked for either Adobe, Google or any company that is compensated by either company. This is my own personal opinion and not the views of either organization. Now that that’s out of the way let’s begin.

Tag Management

So, you may be thinking, Tag Management is a topic that most comparison posts cover.  If you did, you’d be right, however, I don’t think most of those posts take into account the sheer importance an analyst should place on Tag Management.  Most of these posts put the same emphasis on Tag Management that they do on Reporting or Cost.  This is poor analysis as a web analytics engagement will only perform as well as its implementation.  Issues such as poor pageview tagging, inconsistent marketing parameters and spotty success metrics will sink an analysis in no time.  Tag Management should be weighted higher than most topics when weighing analytics providers.

When to choose Adobe over Google

  1. Adobe should be considered if you have an extremely hard to track website like a Flash website.  With that said, if you have a flash website in 2019, you probably have other issues.
  2. You need a good deal of dimensions (like greater than 200) and you are representing one of the unicorn organizations with the time, resources and sanity to actually maintain over 200 dimensions.

When to choose Google over Adobe

If you don’t fit into the 2 categories listed above, I find it hard to not advise using Google.  Google’s Tag Manager has the rare combination of being intuitive and powerful while also looking pretty slick.  The interface is easy to navigate and features like Lookup Tables make working with large amounts of tags easy.  One of the other major differences between Google Tag Manager and Adobe DTM / Launch is the addition of third party tags.  With GTM, you can place tags for other tools such as Clicktale, CrazyEgg and LinkedIn Insight directly in the container without the need to trouble your IT resources with a request to add more JavaScript.  I haven’t worked with an organization that didn’t use Google Tag Manager (even the organizations that had Adobe Analytics installed).

Export of Raw Data

This is another topic that I’ve never seen covered in other comparisons.  Today, marketing organizations must be able to join their web analytics data with other sources such as CRM data to create the ever starved for Omni-channel view of the customer. I haven’t worked for an organization that didn’t warehouse and report on raw data, especially web analytics data.  One example of this type of analysis would be clickstream data including both User and Session IDs.  Neither Google nor Adobe Analytics provide this type of data in the interface (at least not for every session).  For this type of analysis, web analytics providers must export data at a level more granular than what they provide in the interface.


Adobe provides an API for accessing data from the interface and a number of dimensions and metrics including all user generated Props and eVars are available in the API.  With that said, (at the time of this blog post) not all dimensions are available and some absolute must have dimensions such as visit start timestamp and hit timestamps are not available in the API.  This has been a deal breaker for many of the organizations I’ve worked for.  I also find the documentation to be pretty underwhelming.

On the other hand, Google has multiple APIs for solving multiple tasks involving Google Analytics data.  Their reporting API deals with most general data requests, the Multi-Channel Funnel API exports data dealing with marketing attribution reporting, a Realtime Reporting API exports data at a real-time pace and they even have a Management API for easy granting and termination of data access (very helpful for large organizations).  The documentation for each of these APIs is extensive, always seems to be up to date and the APIs perform well.

Large Data Export

Not all data export issues can be solved with an API.  Very large exports will need dedicated tools to work expediently.  Adobe has the Data Feeds Export which allows users to schedule data feeds from the interface.  Feeds can be sent to either an FTP account or to an S3 bucket for convenient loading into any of the AWS data warehousing tools such as Redshift or Athena.  All dimensions available in the interface are available in Data Feeds including some dimensions not seen in the API like visit start timestamp and hit timestamp.  Data is exported at the hit level to ensure the most granular view of activity as possible.  Now, to list the bad points.  There is almost no data cleaning done before data is received and data is received in over 8 different files for each individual day (a data file and various “lookup“ files).  Any organization that would like to use this data will have to spend resources on a round of data cleaning, manipulation and merging before data is usable.  After moving from an organization that deals with Google data to one that dealt with Adobe, this was one of the most frustrating “surprises” I had to deal with.  Also, the export jobs are painfully slow.  If a job is scheduled to export historical data, each daily file sends only once every 30 minutes!!!  This means pulling 1 year’s worth of data will take over 7 days!!! Adobe has recently released a new feature for querying Adobe Analytics data called Query Service. With that said, when I inquired about using the service, I was told that my company would have to incur extra fees in order to gain access. This was…disappointing.

Google provides a solution for exporting data to Google Data Cloud which is part of the service if you are paying for Google Analytics premium.  Google Analytics BigQuery Export can be configured directly in the Google Analytics interface. 

Google Analytics BigQuery Export Interface

Once configured up to 13 months of historical data pre-populates in your BigQuery account and future data is populated daily as it comes.  Also, the daily data is all contained in concise single daily tables.  There is no need to join to lookup tables to select other attributes.  Lastly, the data is contained in single session arrays and (while the query structure has a bit a learning curve) breaking down reports based on user, session or hit level is pretty straight-forward.  This can’t be said for Adobe Analytics Data Feed data which is so raw, it’s difficult to know how to correctly query.  In my opinion, data export is a huge issue for Adobe Analytics.

Default Reports, Metrics and Dimensions

As previously mentioned, the behavior of the Data Feeds in Adobe Analytics was a frustrating surprise once I moved from an organization using Google Analytics to an organization using Adobe Analytics. Another frustrating surprise was the lack of canned metrics and dimensions. Adobe prides itself as being highly customizable and having the ability to apply to any type of website. While this cannot be refuted, this is also one of Adobe’s biggest flaws. Because of this level of freedom, Adobe lacks a good framework for tagging that can be applied to the majority of websites. Google, on the other hand, has some frameworks, developed by their engineers, that work well with most sites. Some of these include the utm marketing parameter framework for populating marketing reporting, their event tagging structure, their site search structure and their ecommerce tagging structure. They’ve done the dirty work of finding frameworks that have worked for other companies in the past and posted those frameworks directly in their documentation. This also applies to the reporting, metrics and dimensions themselves. Here’s a short breakdown of some very useful features that are “out of the box” in Google Analytics (Free or Premium) after only implementing the pageview tag as well as tagging marketing initiatives with simple utm parameters:

Feature Type Google Adobe
Multi-Session Attribution Report Yes Prime
New vs Repeat Dimension Yes No
Campaign Cost Reporting (Google Ads) Metric Yes No
Site Speed Reporting Report Yes No
Demographics (Google Ad Based) Report Yes No

Custom Reporting and Dashboards

Custom reporting is another topic that is normal glossed over when it comes to comparison articles and blog posts. Historically, Adobe has not provided much by the way of dashboards other than the rather simple “reportlets” displayed on the homepage once logging into the interface. These widgets provide some simple data in a table format but no graphs for understanding trends or differences visually. This changed recently with Adobe release of Analysis Workspaces. Analysis Workspaces allows users to create dashboards with a number of different visualization types such as tables, line graphs, bar charts, scatter plots and maps. While the addition of this functionality is a welcome sight for any analysis looking for a good high-level view of their site performance, there are some features in the reporting that make me scratch my head. I can’t think of a day in my career as an analyst where I didn’t do some sort of comparison. This is especially true when trying to evaluate the performance of a site over a particular time period. Therefore, I was more than miffed when I noticed the number of steps I have to use to compare date ranges in Analysis Workspaces. To do a comparison, I have to select a date range that includes both of the date ranges I’m comparing (strange especially when comparing year over year data), create 2 separate date ranges for the ranges being compared (much like creating segments), then create a table report listing both of the date ranges.

Date Range Builder is part of the multi-step process for comparing dates in Adobe

Also, as far as I know, the comparison is limited to the table report view and not available for other visualizations such as bar chart and line graph. This process is more cumbersome than I’d expect which leads me to discuss dashboarding in Google. Google provides some useful visualizations on the homepage of the reporting interface as well as their custom dashboard functionality. Here, I’m able to create a dashboard with up to 12 widgets with visualizations such as tables, line graphs, bar charts and maps. Widgets can also include real-time user count based on a number of different dimensions such as marketing channel and page. However, the real differentiator between Google and Adobe’s dashboarding is how easily date comparisons can be done in Google. Google provides the same date dropdown in the dashboard as it does in all of the other reporting allowing me to easily select the date ranges I’d like to compare without the extra steps required in Adobe. Also, the comparison applies to all of the visualizations (except real-time of course).

Date dropdown in Google Analytics
Bar chart from Google Analytics custom dashboard

There are some really nice advantages to using Adobe Analytics over Google Analytics. Adobe has much more by the way of customization and Adobe has dedicated account services, while Google Analytics Premium provides SLAs and some services but lacks the true assigned account representation. With that said, as I’ve mentioned in my post, I find the lack of structure, lack of expected default metrics / dimensions (bounce rate, new vs repeat, etc.) and the cumbersome nature of the raw data export to be some pretty large issues with Adobe Analytics. While many other comparison posts end with something to the effect of “its up to organizations themselves to make the decision of which tool to use”, I have a hard time not endorsing Google.


Which is Faster?: R Data Manipulation with data.table vs dplyr

I started my voyage into learning R by taking Datacamp’s online courses.  After finishing courses on data manipulation in both base R and dplyr, I stumbled upon a course on using the data.table library.  I was taken back a bit after learning that data.table using a different syntax than base R.  This was unnerving as I didn’t know what I would gain from learning data manipulation in yet another syntax.  My skepticism, however, changed to optimism once I began working on a rather large dataset a few weeks later.  This dataset (a 43M row set of email opens and clickthroughs), took something like 30-40 minutes to read into R using the base read.csv function.  Instead, I tried using the fread function in data.table.  Low and behold, what took 30-40 minutes using base R took about 5 minutes using fread.  Here’s timing data for a 3M row text file:

fread for r
85% improvement in median performance using fread

The speed improvements are not just limited to reading data.  Manipulations are also faster using data.table.  Here are 2 functions written to group and count rows using the world cities population dataset.

data.table performance
83% increase in median performance using data.table

Lastly, as you can see in the functions written above, the data.table function (dt_func) is less verbose than the dplyr function (df_func).  One of the reasons for this is that dplyr is meant to be easily expressed from one programmer to another, however, some programmers will not need to share their code from one user to another.  Nevertheless, once I learned the data.table syntax, I preferred using it over the dplyr syntax.  This seems to be the case for many programmers with a previous foundation in SQL.

datacamp data.table SQL similarity
Datacamp’s data.table tutorial explains the data.table – SQL similarity

While learning syntax can be a tough task, I have to admit that the extra work of learning data.table syntax is worth it.

If you haven’t had a chance to read my last post on using R with Google Analytics data, please take a look.  Also, if you have any comments, questions or if simply want to call me crazy, drop a comment below.

Finding a Marketing Mix with Google Analytics Multi Channel Funnels and R

Google Analytics Multi Channel Analysis

Online marketing channels such as Paid Search and Display Advertising are used by scores of organizations to improve outreach.  Having the ability to improve your visibility by simply purchasing traffic is very useful.  What is not useful, at least for most marketing organizations is having to come to grips around whether the money for said traffic was spent as efficiently as possible.  Most organizations will simply use the Acquisition reporting in Google Analytics to learn of how much traffic their marketing campaigns generate (bad).  Some will even venture to see how many conversions or revenue they produce (better but still bad).

Only the savvy organization will employ the technique known as “multi-session marketing analytics.”  This technique uses user and session data in order to analyze the activity of users across multiple sessions.  This improves on the simplistic “last click” attribution model used for the regular Google Analytics Acquisition reporting.  Google has provided some reporting tools for this type of analysis in the Multi-Channel Funnels and Attribution reporting found under the “Conversions” tab in Google Analytics.  The Model Comparison Tool report can even be used to compare different models (i.e. “last click”, “first click”, etc.).

multi session marketing analysis
Model Comparison Tool in Google Analytics

The unfortunate thing about using these models is that there is no such thing as a one-size fits all model for analyzing a site’s marketing channels.  Each site has it’s own flavor and it’s own user base with it’s own marketing behavior.

In order to deal with this issue, an organization could either pay for Google Premium (which uses machine learning algorithm to predict the best model to use) or it could run a probabilistic model on its GA data. Markov Chain is one of the easier models to use. It is also a model that is used by a number of marketing attribution analytics consultancies.

I’m no statistician, but I believe the simplest way to explain Markov Chain is that is way of describing the probability of events based on the most previous event state. In this case each event is a session with an assigned medium and the result is the re-assigning of values based on the highest probabilities of conversion.  Read more on Markov Chain here.

Markov Chain for Marketing Analysis
Markov Chain Illustration

Good thing for those of us that have no advanced math degree, there is a package for R called ChannelAttribution which allows us to run the Markov Chain model on data direct from the Google Analytics API. It also compares the Markov Chain model to other models such as last touch, first touch and linear without much fuss.  There is a great tutorial on using ChannelAttribution on the Lunametrics Blog.  Read below to see how this technique can be used with data direct from the Google Analytics API in R.  The script seems a bit verbose, but it works very well, thanks to Kaelin Harmon!

This produces a dataframe and a plot which compares each of the heuristic models (last touch, first touch, linear) and the Markov Model.

Multi Session Analysis in R
Heuristic Models vs Markov Model

If you liked this post, you might want to take a look at my last post on using the GA API and R as an alternative to Google Analytics Premium or just leave me a comment below.

An Alternative to Google Analytics BigQuery Export Using R and Google Tag Manager

Google Analytics has proven to be one of the most influential tools ever created for marketing analysis.  Google is pretty unrelenting in their pursuit of innovation for Google Analytics and that innovation shows in the number of other tools they’ve built for analysts.  From Google Sheets to BigQuery to Google Data Studio, the complementary tools built are a great aid for dealing with the dearth of data that can be mined from Google Analytics.  One of the little known yet game-breaking tools available for use with Google Analytics data is the Google Analytics BigQuery Export.

Google Analytics Premium BigQuery Export
Google Analytics Premium BigQuery Export

This tool, which is only available for users of Google Analytics premium product, is in essence a raw data export of a website’s Google Analytics data.  This unlocks any analyst with a decent knowledge of SQL from the shackles of Google canned reports.  This also allows an analyst to create much more robust logic for creating reports.  For instance, if an analyst wanted to create a report for all users that viewed a particular page during their session and returned to the site within 6 days, they would only be limited by only their knowledge SQL and their ability to fork out the $150K Google charges for their premium product!!!

Google Analytics does not provide data at the user level out of the box, however, with the aid of a process outlined in Simo Ahava’s tremendously useful blog, you can use Google Tag Manager to pull Google’s user and session IDs out of the cookie (also known as Client ID) and feed them back to the interface in custom dimension or event.  This gives an analyst the ability to report on user activity at the user ID level.

Remember: passing personally identifying information to Google Analytics is a violation of the terms of service, so don’t pass any personal identifying information to Google if you might have it, like email addresses.

Below are the steps I use for passing pageview data along with user and session data to the GTM data layer for logging in Google Analytics, however, you could technically use a slightly different process to pass ecommerce, event, goal, custom metric/dimension data as well.  I’ll cover that in a later post.  I’ll assume that the reader has already tagged all of their pages with a Google Tag Manager container, but if not, start by reading this post and make sure to tag your pages.


  1. Create a Custom Dimension by going to the admin page in your Google Analytics view:Under “Custom Definintions” select “Custom Dimensions”, create a dimension and call it “Client ID” or whatever name you prefer. This dimension will have a scope of “Session”.  Make note of the dimension index (you’ll need to enter that later).
  2. Create a Custom JavaScript Variable in Google Tag Manager and give it a title such as {{Set Client ID in Dimension 1}}.
    Here is the code:

    Make sure to include the correct index to the customDimensionIndex variable.  If you’ve completed this step correctly, you will be able to see the ClientId being passed under whichever custom dimension you have set it up for in the Google Analytics Debugger tool.

    Client ID being passed into dimension 1
  3. If everything shows up, move back to Google Tag Manager and edit the pageview Tag for your site. Under “Fields to Set”, type “customTask” and under “Value” use the dropdown to select the variable we created in step B, {{Set Client ID in Dimension 1}}.Now that concludes the first part of the process.  Once you’ve reached this step, you could technically start playing with the user and session Client ID dimension in Google Analytics’ custom reports.Pull Client ID Custom Dimension DataSo we’ve tagged our site to send user and session data to Google Analytics and have dealt with sampling, now for the fun part.  This string pulls page URLs, user and session IDs by date based on the dimensions detailed above.  Where pro

    Run some other scripts

    Using some other scripts, an analyst can answer a number of other questions, like how long does it take a new user to become a repeat user.  These scripts rely heavily on the data.table syntax instead using base R.  Please take a look at my prior post on using data.table to learn why I do so.

    Data Returned

    Then run the rest:

    This will return the number of days on average it takes a new user to become a repeat user.

    calculate number of days for new user to return
    There are a number of other uses for the client ID data in GA.  For instance, a marketer might want to do some attribution modelling or a content manager would want to know if viewing an article in one session might effect subsequent sessions.

One consideration around doing this type of analysis is scale.  Most smaller websites won’t pose an issue, but some larger sites (like the one I currently work on) will.  Pulling an individual non-aggregated row for every session, page or event can yield some extremely large datasets.  In this case, it would make sense to send the data to a cloud storage data warehouse such as BigQuery.  Want to learn more on using R to solve for this?  Stay tuned…

3 Ways To Analyze Google Analytics Data in R with RGA and ggplot2

In my opinion, Google Analytics is the single most influential development in marketing analytics ever.  Quantcast estimates that 70% of its top 10,000 website have GA installed.  Google has shown a relentless drive to improve the product over the years and it’s free price tag insures access to most anyone that runs a website.  With that said, Google Analytics is a service and no service (great or lacking) is without flaws.  One of the hidden advantages GA possesses is a robust API and this advantage allows users to build some of the features that are missing from the standard interface.  I wanted to cover some of the ways a user could use R to deal with some of the features not available in GA.

In order to use any of these techniques, you will have to install R as well as the rga package and dplyr package which available on CRAN.  Other packages used include ggplot2 for visualization, scales, lubridate and zoo.  Use the script below to install.

  1. Event Conversion Rate Script

    One of my gripes with Google Analytics is that the Top Events report includes total event counts but does not include a conversion metric.  If you are using the Google Tag Manger click listening technique to add events to your site by listening for click elements, a you could add a bit of custom Javascript to pass an impression for the same element, however, in many cases, just a simple total event count over the pageview count would suffice.  Here’s a script that grabs that simple metric:

    Social Link/3Homepage9.3333

    This gives all event parameters (Category, Action and Label) as well as the page URL and content group 1, allowing the user to easily aggregate pages if they are passing content groups.  I strongly encourage using content groupings.

  2. Analyze Acquisition Mediums with ggplot2

    Google Analytics has some good embedded graphs for analyzing traffic mediums and the advent of Google Data Studio gives users even more flexibility, however, sites with high numbers of marketing mediums (10+) will pose issues for these tools.  Using ggplot2 in R allows a user to create what analysts call “small multiples” or a series of similar graphs or charts using the same scale and axes, allowing them to be easily compared.  Below is a script that returns small multiples for a year over year comparison of marketing mediums.

    small multiples using ggplot2 and r
    Small Multiples using ggplot2 and R
  3. Analyze Product Performance, Content Groups or Other Categories with ggplot2

    A user could also use the previous script for small multiples to learn about other categorical data like revenue by product:

    If you haven’t had a chance yet, please read my post on why an analyst should learn R.

    Have any questions or comments???  Let me know in the comments section.


6 Reasons Why An Analyst Should Learn R: Why Learn R for Data Analysis???

I had some very specific reasons for deciding to learn R for data analysis.  One of my original reasons was to be able to work directly with APIs, and while that reason is valid, in hindsight, I have uncovered a number of other reasons to learn R.  I wanted to write about my 6 best reasons for learning R.  I’m not writing this post to pontificate on the differences between R and Python.  I think both languages have distinct advantages.  As R was written with statistical analysis and data visualization in mind, it was the language I (an analyst) decided to learn first.  Also, please keep in mind that these are reasons surrounding how to best manipulate, analyze and learn from data, not to simply “get a job.”  So, here they are:

1. Reproducible Analysis

Download a set of data from a data source, open file in excel, manipulate data in Excel…etc.  This was the normal flow of analysis I had when I used Excel for my main analysis tool.  R simplifies or simply removes most of this workflow.  R analyses can be written, saved and re-run in the future. A single well written script can handle the tasks of pulling data from a local file source or an API, munging the data, producing an analysis, displaying a visualization and exporting a file.  A well written script can also be passed from one analyst to another with little accompanying assistance.   Packages such as Rmarkdown can also aid in the process of creating reproducible reports by allowing users to create documents such as html files for the presentation of their analysis.

rmarkdown for data analysis
Data Analysis in R Markdown


2. R doesn’t have the 1M row limit Excel has

When I began my career as an analyst, Excel 2003 was the spreadsheet tool of choice for data analysis. Back then, Excel had a row limit of 65K rows.  Excel 2007 extended the row limit to 1M rows, however, this limit is still a bit modest for many analysts.  With R, row limits are based only on the limitation of the data you have and the hardware you use.  The largest dataset I have loaded into memory and manipulated was around 43M rows (Windows 10 desktop, i5 processor, 8GB of ram), however, I am sure I could load much larger datasets.  Keep in mind, as the dataset grows so does the load on your resources.  With this in mind, certain packages in R help deal with large datasets whether locally or via cloud computing (data.table, bigmemory, sparklyr, etc.).

10M rows for data analysis in R
10M rows of random numbers

3. Large Community of Users

As you can see, there are a number of quantitative measures proving the proliferation of R and a vast amount of support.

4. R is a language written specifically for statistical analysis and data visualization.

The R language is widely used among statisticians and data miners for developing statistical software as well as creating data visualizations.

Data Analysis in Shiny for R
New York Short Term Rental Data in Shiny for R

5. RStudio is an excellent IDE.

RStudio for Data Analysis

Many other programming languages have scores of Integrated Development Environments or IDEs (PyCharm for Python, Spyder for Python, Visual Studio for .NET, PHPStorm for PHP, etc.).  Some languages have 10 or 20 IDEs.  Making a choice among a number of IDEs can be problematic as some IDEs contain features that others do not.  Some IDEs are not open source and require payment for maintenance.  Much like R, RStudio was built with a focus on statistical computing.  Instead of most IDEs, which are designed with general programming in mind, RStudio’s workflow is meant for the analyst or statistician.  RStudio is one of the only IDEs for R therefore, there are loads of resources available on using RStudio.  RStudio is also open source.

6. Because R was written specifically for statistical analysis, a number of machine learning and data science algorithms are also available for R.

This makes R a good springboard for analysts wanting to get started in machine learning and data science.

Bonus! – R is open source.

With some ingenuity and research, R can do many of the same things as Excel, SAS and Tableau or it can be a great compliment to these tools.  Also, with R being open source, it is completely extensible, can be distributed and changed to suit a user’s needs.