Choosing R or Python for Data Analysis? An Infographic
by Karlijn Willems
Wondering whether you should use R or Python for your next data analysis post? Check our infographic "Data Science Wars: R vs Python".
I think you'll agree with me if I say:
It's HARD to know whether to use Python or R for data analysis. And this is especially true if you're a newbie data analyst looking for the right language to start with.
It turns out that there are many good resources that can help you to figure out the strengths and weaknesses of both languages. They often go into great detail, and provide a tailored answer to questions such as "What should I use for Machine Learning?", or "I need a fast solution, should I go for Python or R?".
In today's post, I present to you our new Infographic "Data Science Wars: R vs Python", that highlights in great detail the differences betweens these two languages from a data science point of view. So that next time you are debating R vs Python for machine learning, statistics, or maybe even the Internet of Things, you can have a look at the infographic and find the answer. ...
R or Python? Why not both? Using Anaconda Python within R with {reticulate}
R! related
Blog: R Documentation and Learning Resources
IDE and GUI
- RStudio is an IDE for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R. RStudio Webinars
- Rcmdr is a GUI of R.
- esquisse is a great addin for ggplot2
- Deducer is a good but relative old GUI for exploring data like JMP with ggplot2 behind (Plot Builder). JGR is a Java GUI for R. (ggplot2 – much easier with JGR and Deducer). To use Deducer, you need - install.packages(c("JGR", "Deducer", "DeducerExtras")) -, submit: - library(JGR) -, - JGR() -; then, in the JGR console, to load Deducer, go to 'Packages & Data' > 'Package Manager' and select Deducer and DeducerExtras. more: all-in-one installing JGR and Deducer. Notes: 1) to set JAVA location using --- options(java.home="xxx/Java/") ---. 2) rJAVA ver 9.6 needs running under the 32 bit R! on my computers.
- GrapheR (pdf) is another GUI for draw customized graphs without knowing any R commands.
- Tessera - Open source environment for deep analysis of large complex data (Divide and recombine)
- The application Bio7 is an integrated development environment for ecological modelling and contains powerful tools for model creation, scientific image analysis (ImageJ) and statistical analysis.
Communication between SAS and R
Graphical parameters
R! How can I include Greek letters in my plot labels?
Revolutions: How to make a heat map in R, Superheat: supercharged heatmaps for R
Chart Chooser — improves Excel and PowerPoint charts. there is R! version of Chart Chooser (not many charts on the site, but the idea is great)
Packages (rdrr.io, CRAN, Rdocumentation, Inside-R, Quick-R, Bioconductor)
- CRAN Task View organizes the packages into different groups such as Graphics, Survival Analysis, etc.
- margins: An introduction to 'margins'
- Transition from Excel to R!:
- DT: An R interface to the DataTables library
- excelR: An R interface to jExcel library to create web-based interactive tables and spreadsheets compatible with Excel or any other spreadsheet software
- DTedit: Editable DataTables for shiny apps
- rhandsontable is a htmlwidget based on the handsontable.js library.
- formattable is designed for applying formatting on vectors and data frames
- rpivotTable is a R wrapper for the great library pivottable
- Zelig is a general purpose statistics program fro estimating, interpreting, and presenting results from any statistical method. It turns the power of R with free ranging syntax, diverse examples, and documentation written for different audiences — into the same three commands and consistent documentation for every method
- Report and documentation
- knitr package was designed to generate dynamic report with R. Chunk options.
- rmarkdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.
- bookdown is to facilitate writing books and long-form articles/reports with R Markdown. Here are several very good free books published on the bookdown website
- Programming
- Devtools makes it so easy to build a package that it becomes your default way to organise code, data and documentation.
- Data I/O and manipulation
- haven allows you to load foreign data formats (SAS, SPSS and Stata).
- xlsx gives programatic control of Excel files using R.
- read.table() reads a file in table format and creates a data frame from it.
- arrow supports for analyzing large, multi-file datasets, working with individual Parquet and Feather files.
- MMWRweek: Convert dates to MMWR Day, Week, and Year
- dlookr: Diagnose, explore and transform data
- foreign read data stored by other system.
- dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. Introduction to dplyr. Notes: "Both data.table and dplyr were able to reduce the problem to less than a few seconds. If you’re looking for pure speed data.table is the clear winner. However, it is my understanding that data.table‘s syntax can be frustrating, so if you’re already used to the ‘Hadley ecosystem’ of packages, dplyr is a formitable alternative, even if it is still in the early stages."
- tidyr with spread() and gather(), a reframing of reshape2, is a new package that makes it easy to “tidy” your data.
- magrittr is a forward pipe operator. magrittr: Simplifying R code with pipes. History of Magrittr.
- Aggregation and restructure: reshape2 package with melt() or cast() function.Transpose using t() function, aggregating data using aggregate() function.
- lubridate is good for manipulating time and date.
- Stringr makes it easier to work with strings (article).
- Quick-R examples of outputing R data.
- RStudio addins: datapasta, ggthemeassist
- pdftools: Text Extraction, Rendering and Converting of PDF Documents (example)
- Survey related packages
- GIS
- Data Visualization and Graphics
- ggplot2 (document) (You can build the book here: ggplot2: Elegant graphics for data analysis) is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. Cookbook for R is a great book. Tutorial websites: Tutorial I, Tutorial II. A good book website of Guide to Create Beautiful Graphics in R (Cheatsheet: Be Awesome in ggplot2). Beautiful plotting in R: A ggplot2 cheatsheet (pdf). Tutorial video (Part I and Part II) by Roger Peng
- A conversation with Hadley Wickham 2014 - Eduardo
- ggthemes is a package including some extra geoms, scales, and themes styles (the Economist, Excel, Stata, etc.) for ggplot2.
- GGally is designed to be a helper to ggplot2. It contains templates for different plots to be combined into a plot matrix, a parallel coordinate plot function, as well as a function for making a network plot.
- gtable is a package of tools to make it easier to work with ``tables'' of grobs, which is internally used by ggplot2. I used the gtable::cbind(..., size = "max", z = NULL) or gtable::rbind(..., size = "max", z = NULL) to match the size of plots.
- ggprepel is a ggplot2 extension to avoid overlapping text labels.
- Walker (2014). International population pyramids with ggplot2
- ggforce accelerates 'ggplot2'
- ggraph is an extension of ggplot2 tailored at plotting graph-like data structures (graphs, networks, trees, hierarchies...)
- WVPlots Provides examples of ggplot plots that can be generated from a standard calling interface
- classifierplots generates a visualization of binary classifier performance as a grid of diagonstic plots with just one function call
- ggalt: amke a dumbbell plot in ggplot2
- ggcharts aims to get you to your desired plot faster.
- Plotly: ggplotly of plotly package is a powerful tool to convert ggplot2 plots and create interactive, online ggplot2 charts with D3.js. Carson Sievert used this converter recreated Hadley Wickam’s entire ggplot2 documentation (here). Click-drag to zoom, shift-click to pan, double-click to autoscale. It's really amazing. Plotly Tutorial: Plotly and R
- grid and gridExtra are useful combined with ggplot2, for example:
- ggvis new package for data visualization. Like ggplot2, it is built on concepts from the grammar of graphics, but it also adds interactivity, a new data pipeline, and it renders in a web browser.
- vcd for visualizing categorical data. Working with categorical data with R and the vcd and vcdExtra packages and Visualizing Categorical Data with SAS and R by Michael Friendly.
- Distribution visualization: vioplot for violin plot (examples of boxplot and violin plot). Hintze (1998). Violin Plots: A Box Plot-Density Trace Synergism. Violin plot using ggplot2. Boxplots and beyond: Part I, II: asymmetry, III: violin plots, IV: beanplots.
- venneuler is my favorite package for make a area proportional Venn and Euler Diagram simple and fast, I use it create diagrams then use Inkscape to modify them. A more complicated package is VennDiagram. Wilkinson (2012) introduced his venneuler package in the artilce "Exact and Approximate Area-proportional Circular Venn and EulerDiagrams". Chen (2011) compared some R! package and programs in the article "VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R" (Create a Venn diagram using MS Excel)
- venn = venneuler(c(a1c=.15, fpg=.11, ogtt=.13, 'a1c&fpg'=.1, 'fpg&ogtt'=.13, 'a1c&ogtt'=.4, 'a1c&ogtt&unknow'=.1))
- plot(venn)
- Vennerable package provides routines to compute and plot Venn diagrams, including the classic two- and three-circle diagrams but also a variety of others with different properties such as rectangular or square Venn diagram and for up to seven sets. It's not available at CRAN. You can install this package using: install.packages("Vennerable", repos = "http://R-Forge.R-project.org")
- googleVis - Google Motion Charts with R!
- colourlovers provides access to the COLOURlovers API, which offers color inspiration and color palettes
- Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data.
- scatterplot3d provides routines for the visualization of multivariate data in a three dimensional space.
- iPlots is a package for the R! which provides high interaction statistical graphics, written in Java.
- rcharts is an R package to create, customize and publish interactive javascript visualizations from R using a familiar lattice style plotting interface.
- wordcloud2 and wordcloud create wordcloud for data visualization. Using R to create a Word Cloud from a PDF Document. 可能是目前最好的词云解决方案wordcloud2
- David Smith (2017).Packages to simplify mapping in R
- magick is a package of wrapping up of ImageMagick which is an image processing library
- networkD3: D3 JavaScript Network Graphs from R, good at Sankey Diagram
- Missing Imputation
- Statistics/Machine Learning
- MLR – Machine Learning in R and MLR3 a scussor of MLR. Here is the mlr3 book
- tidymodes: a combination of multiple packages: parsnip, recipes, etc.
- tidyverse is a collection of popular packages for data munging from Hadley Wickham including ggplot2, dplyr, tidyr, readr, purrr, tibble, etc.
- bayestestR provides a comprehensive and consistent set of functions to analyze and describe posterior distributions generated by a variety of models objects, including popular modeling packages such as rstanarm, brms or BayesFactor.
- rms: is a package goes along with the book Regression Modeling Strategies
- caret is a set of functions that attempt to streamline the process for creating predictive models. Here are caret webinar and slides
- splines is now a part of R! standard/add-on packages/bundle. splines::ns() generate a matrix for natural cubic splines. Bendix tuned the ns() as Ns() in his Epi package, which used the smallest and the largest of the supplied knots as boundary knots. You can download the latest version on his website here.
- lavaan: latent variable analysis
- msm::deltamethod: delta method, FAQ UCLA
- nls can be used to determine the nonlinear (weighted) least-squares estimates of the parameters of a nonlinear model, which is similar to the 'nl' of stata and 'proc nlin' of SAS
- rpart is for the recursive partitioning for classification, regression and survival trees.
- Hmisc: Harrell Miscellaneous
- convey estimates measures of poverty, inequality, and wellbeing using the complex survey data
- deltavar() is a function in the emdbook packages of the book, which calculates delta-method-based variances for functions with any number of parameters
- deltavar(log(A/B),meanval=c(A=0.8,B=12),Sigma=matrix(c(0.1,0,0,8.0),nrow=2)) ---> 0.2118056 = ((1/A)^2*Var(A)+(1/B)^2*Var(B))^0.5
- deltavar(log(A/B),meanval=c(A=0.8,B=12),Sigma=c(0.1,8.0)) -> 0.2118056
- MASS is a classic package for Venables and Ripley's Modern Applied Statistics with S
- Bolstad is A set of R functions and data sets for the book Introduction to Bayesian Statistics, Bolstad, W.M. (2016). The "bayes.lin.reg" function may be used for combining the two estimates like meta-analysis.
- Psych: Using R for psychological research
- Electronic Health/Medical Record (EHR) related
- coder: Deterministic Categorization of Items Based on External Code Data by rOpenSci (blog)
R functions/commands and keyboard shortcuts
R! is powerful and has rich packages and functions. It's impossible to build a list of functions/shortcuts to fit the purposes of all. Below are some functions/shortcuts related to my projects.
- Cheatsheets, The R Guide, R Reference Card
- Help functions: help()/?, apropos(), find(): apropos() finds all objects. find() the locations of found objects, methods(), example(), demo(), vignette(), args()
- Housekeeping functions: getwd(), setwd(), rm(list=ls()) removes all objects in the R environment, source("myRscript.r") runs the R codes in "myRscript.r" file, fix() modifies the original object, and edit() is used edit an object and returns to a new object, download.file() downloads a file from the Internet, attach()/detach() objects, search() shows the current search paths and sequence, install.packages(), update.packages(), remove.packages(), getOption("defaultPackages") which can be changed by setting the option in startup code (e.g. in ~/.Rprofile), .libPaths()
- Numeric/character functions: length(), seq(), rep(), cut(), pretty(), cat(), substr(), grep(), sub(), strsplit(), paste(), toupper(), tolower()
- Data functions: read.table(), head(), tail(), str(), class(), length(), dim(), nrow(), ncol(), names(), levels(), length(), c(), cbind(), rbind(), append(), rep(), rev(), sort(), unique()
- Type functions: "is." for checking or "as." for conversion + numeric(), character(), vector(), matrix(), data.frame(), factor(), logical(), integer(). For example: is.numeric(), as.numeric()
- Mathematical functions: abs(), sqrt(), log(), log(x, base=n), log10(), exp(), prod(), factorial(), choose(), ceiling(), floor(), solve(), trunc(), round(), signif(), cos(), sin(), tan(), acos()
- Statistical functions: mean(), median(), sd(), var(), mad(), quantile(), range(), sum(), diff(), min(), max(), scale(), fivenum(), cumsum(), cumprod(), cummax(), cumin(), cor(), colSums(), rowSums(), colMeans(), rowMeans()
- Probability functions: the form is [d][p][q][r]distribution(). d, p, q, r are for (d)ensity, cumulated (p)robability/distribution function, (q)uantile function, and (r)andom generation, respectively. the Distribution types can be: (norm)al, (beta), (binom)ial, (chisq)uared, (exp)onential, (logis)tic, (multinom)ial, (n)egative (binom)ial, (pois)son, (f), (gamma), (t), (unif)orm, etc. for example: dnorm(), pnorm(), qnorm(), rnorm()
- Statistical modeling functions
- Model functions: lm(), glm(), nls(), nls2(), lme() / nlme()
- Symbol formulas (y ~ A + B + C ): ":" is for interaction term, "*" is for complete interaction, "^" is for crossing to a specified degree "." is a placeholder for all other variables except the dependent variable, "-" removes a variable from the equation, "-1" suppresses the intercept, "I()" has elements within the parentheses interpreted arithmetically
- Post-estimation functions: coef(),
confint(),
resid(),
fitted(),
summary(),
predict(), deviance(), print(),plot(), formula(),
anova(obj1, obj2), AIC(), vcov()
- Contrast functions: contr.helmert(), contr.poly(), contr.sum(), contr.treatment(), contr.SAS()
-
RStudio is an integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R. Shortcuts (you can modify them: Tools -> Modify Keyboard Shortcuts...)
- Alt + Shift + K: Show a Quick Reference
- Alt + -: Insert assignment operator "<- font="">->
- Ctrl + Shift + M: Insert pipe operator "%>%" (I changed it as Ctrl + Shift + P)
- Ctrl + Alt + I: Insert chunk (R Notebook/Markdown)
- Ctrl + 1: Move cursor to source Editor window
- Ctrl + 2: Move cursor to Command window
- Ctrl + 3: Move cursor to Help window
- Ctrl + 4: Move cursor to History window
- Ctrl + 5: Move cursor to File window
- Ctrl + 6: Move cursor to Plots window
- ...
Choice of analytical language
I have used mainly three statistical languages, Stata, R, and SAS, for many years for different purposes. The weights of usage of those three languages are shift from SAS-Stata-R to SAS-R-Stata, then, to Stata-R-SAS. Sometimes I am asked to recommend a better analytic language, which is always a hard and complicated question to me. I came across an blog written by Curtis Miller, which is very thoughtful and helpful to make this kind of choice. Here is his blog: "On Programming Languages; Why My Dad Went From Programming to Driving a Bus". Hopefully his story can help you to make your own decision.
Interview with J.J. Allaire - the founder of RStudio
by Joseph Rickert
Welcome to “R Views”, the new R Community blog from RStudio. For this first post, I sat down with J.J. Allaire, RStudio’s founder and CEO, to discuss RStudio’s history, its mission and JJ’s vision for its future. In a short time, we touched on a wide range of subjects including RStudio’s business, the growth of the R language, the importance of the R Consortium to the R Community and J.J.’s advice to anyone coming to R for the first time. We hope you enjoy this “snapshot” of RStudio’s place in the R world. full text
You can also read a Chinese version here.
R! Books
- Oscar: Big Book of R collection
- RStudio: Cheatsheets
- W.N. Venables. An Introduction to R
- Hadley Wickham & Garrett Grolemund (2017). R for Data Science
- Hadley Wickham. Advanced R
- Winston Chang: R Graphics Cookbook, 2nd
- Christoph Hank: Introduction to econometrics with R
- Neale Batra: R for applied epidemiology and public health
- David Dalpiaz: Applied Statistics with R (HTML version) GitHub
- Colin Gillespie. Efficient R programming
- Hadley Wickham (2015). R Packages
- Yihui Xie. bookdown: Authoring Books and Technical Documents with R Markdown
- Yihui Xie. R Markdown: The Definitive Guide
- Julia Silge: Text Mining with R
- Patrick Burns (2011). The R Inferno
- Daniel Navarro. Learning Statistics with R
- Trevor Hastie, Robert Tibshirani, Gareth James, Daniela Witten: An Introduction to Statistical Learning, with Applications in R 2nd. (pdf) with the excellent self-paced video training course. (here is the 15-hours of video training video abstracted by the Data School (YouTube))
- Trevor Hastie (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- Norman Matloff. The Art of R Programming (part)
- Michael Crawley (2007). The R Book
- Bolker (2007).Ecological models and data in R (2007 draft). Appendix (w/ delta method)
- Winston Chang. Cookbook for R
- Verzani. simpleR - Using R for Introductory Statistics
- Kerns. Introduction to Probability and Statistics Using R
- Peng. R Programming for Data Science, The Art of Data Science, Exploratory Data Analysis with R
- Yakir. Introduction to Statistical Thinking (With R, Without Calculus)
- Aragón. Population Health Data Science with R
- 赵鹏, 谢益辉, 黄湘云 现代统计图形 (Modern Statistical Graphics)
Where/how to configure R start-up environment
There are several approaches can be used to customize the R working environment such as options and library directory etc. at R start-up:
- Modify the R original profile file directly. The "Rprofile.site" is under the directory ".\R directory name\etc\". At both startup and end, the R will use the "Rprofile.site" file, then looks for the user-defined ".Rprofile" file in the current working directory (run "getwd()" to find the current location of working directory) or in the user's R home directory (run "R.home()" or "Sys.getenv("R_HOME")"to find where it is). You can edit the "Rprofile.site" file or create a ".Rprofile" file to customize the startup. For more information see Initialization at start of an R session and Customizing startup. I am using R-Portable and prefer to create a ".Rprofile" in the same directory of "R-Portable.exe" file. In such way, I don't need to dig deep and edit the R original setting.
- to lists all the options can be set, run "names(options())"
- to show the value of an item, run "options("option name")", for example:
- "options("digits")" shows "$digits, [1] 7", which means the number will be shown 7 digits.
- "options("defaultPackages")" shows the packages attached by default when R starts up
- to modify the values of an option item, run "options(xxx=yyy), for example:
- "options(digits=15)" changes the digit number into 15. Notes: this is for setting full length of number but not number of decimal places. To set the number of decimal, try such as "round(4/3, digits=2)" with 2 decimal places but not in "options()" unfortunately.
- to set the directory of personal R library, create a ".Rprofile" file in the working directory and include ".libPaths(c(.libPaths(),"c:/myRlib directory name")", save it.
- or, edit "Rprofile.site", add line: Add line: ".libPaths(c(.libPaths(),"c:/myRlib directory name")"
- When use RStudio as the IDE, modify the options file ("Options.R") under the ".\Rstudio directory name\R\". The option setting overwrites the option setting in R profiles both "Rprofile.site" and ".Rprofile".
- to set the directory of personal R library, edit file "Options.R", add line: ".libPaths(c(.libPaths(),"c:/myRlib directory name")", then save the "Options.R".
- or, to use ".Rprofile", this file needs be in the working directory when not in a project (to set this master working directory using RStudio GUI: tools -> Global options... -> change the "Default working directory(when not in a project):"). Also you can change R.home() under the "R version:".
- By the way, the options and the directory of package library can also be changed after the start-up of R.
- de Vries (2015).Best practices for handling packages in R projects
- Gillespie. R startup