I had some very specific reasons for deciding to learn R for data analysis. One of my original reasons was to be able to work directly with APIs, and while that reason is valid, in hindsight, I have uncovered a number of other reasons to learn R. I wanted to write about my 6 best reasons for learning R. I’m not writing this post to pontificate on the differences between R and Python. I think both languages have distinct advantages. As R was written with statistical analysis and data visualization in mind, it was the language I (an analyst) decided to learn first. Also, please keep in mind that these are reasons surrounding how to best manipulate, analyze and learn from data, not to simply “get a job.” So, here they are:
1. Reproducible Analysis
Download a set of data from a data source, open file in excel, manipulate data in Excel…etc. This was the normal flow of analysis I had when I used Excel for my main analysis tool. R simplifies or simply removes most of this workflow. R analyses can be written, saved and re-run in the future. A single well written script can handle the tasks of pulling data from a local file source or an API, munging the data, producing an analysis, displaying a visualization and exporting a file. A well written script can also be passed from one analyst to another with little accompanying assistance. Packages such as Rmarkdown can also aid in the process of creating reproducible reports by allowing users to create documents such as html files for the presentation of their analysis.
2. R doesn’t have the 1M row limit Excel has
When I began my career as an analyst, Excel 2003 was the spreadsheet tool of choice for data analysis. Back then, Excel had a row limit of 65K rows. Excel 2007 extended the row limit to 1M rows, however, this limit is still a bit modest for many analysts. With R, row limits are based only on the limitation of the data you have and the hardware you use. The largest dataset I have loaded into memory and manipulated was around 43M rows (Windows 10 desktop, i5 processor, 8GB of ram), however, I am sure I could load much larger datasets. Keep in mind, as the dataset grows so does the load on your resources. With this in mind, certain packages in R help deal with large datasets whether locally or via cloud computing (data.table, bigmemory, sparklyr, etc.).
3. Large Community of Users
- There is an estimated 2 million users of R around the world
- CRAN or the Comprehensive R Archive Network currently lists over 10K packages for R
- stackoverflow.com lists over 180K questions currently tagged for R
As you can see, there are a number of quantitative measures proving the proliferation of R and a vast amount of support.
4. R is a language written specifically for statistical analysis and data visualization.
The R language is widely used among statisticians and data miners for developing statistical software as well as creating data visualizations.
5. RStudio is an excellent IDE.
Many other programming languages have scores of Integrated Development Environments or IDEs (PyCharm for Python, Spyder for Python, Visual Studio for .NET, PHPStorm for PHP, etc.). Some languages have 10 or 20 IDEs. Making a choice among a number of IDEs can be problematic as some IDEs contain features that others do not. Some IDEs are not open source and require payment for maintenance. Much like R, RStudio was built with a focus on statistical computing. Instead of most IDEs, which are designed with general programming in mind, RStudio’s workflow is meant for the analyst or statistician. RStudio is one of the only IDEs for R therefore, there are loads of resources available on using RStudio. RStudio is also open source.
6. Because R was written specifically for statistical analysis, a number of machine learning and data science algorithms are also available for R.
This makes R a good springboard for analysts wanting to get started in machine learning and data science.
Bonus! – R is open source.
With some ingenuity and research, R can do many of the same things as Excel, SAS and Tableau or it can be a great compliment to these tools. Also, with R being open source, it is completely extensible, can be distributed and changed to suit a user’s needs.