February 2, 2015

Should you teach Python or R for data science?

Last week, I published a post titled Lessons learned from teaching an 11-week data science course, detailing my experiences and recommendations from teaching General Assembly's 66-hour introductory data science course.

In the comments, I received the following question:

I'm part of a team developing a course, with NSF support, in data science. The course will have no prerequisites and will be targeted for non-technical majors, with a goal to show how useful data science can be in their own area. Some of the modules we are developing include, for example, data cleansing, data mining, relational databases and NoSQL data stores. We are considering as tools the statistical environment R and Python and will likely develop two versions of this course. For now, we'd appreciate your sense of the relative merits of those two environments. We are hoping to get a sense of what would be more appropriate for computer and non computer science students, so if you have a sense of what colleagues that you know would prefer, that also would be helpful.

That's an excellent question! It doesn't have a simple answer (in my opinion) because both languages are great for data science, but one might be better than the other depending upon your students and your priorities.

At General Assembly in DC, we currently teach the course entirely in Python, though we used to teach it in both R and Python. I also mentor data science students in R, and I'm a teaching assistant for online courses in both R and Python. I enjoy using both languages, though I have a slight personal preference for Python specifically because of its machine learning capabilities (more details below).

Here are some questions that might help you (as educators or curriculum developers) to assess which language is a better fit for your students:

Do your students have experience programming in other languages?

If your students have some programming experience, Python may be the better choice because its syntax is more similar to other languages, whereas R's syntax is thought to be unintuitive by many programmers. If your students don't have any programming experience, I think both languages have an equivalent learning curve, though many people would argue that Python is easier to learn because its code reads more like regular human language.

Do your students want to go into academia or industry?

In academia, especially in the field of statistics, R is much more widely used than Python. In industry, the data science trend is slowly moving from R towards Python. One contributing factor is that companies using a Python-based application stack can more easily integrate a data scientist who writes Python code, since that eliminates a key hurdle in "productionizing" a data scientist's work.

Are you teaching "machine learning" or "statistical learning"?

The line between these two terms is blurry, but machine learning is concerned primarily with predictive accuracy over model interpretability, whereas statistical learning places a greater priority on interpretability and statistical inference. To some extent, R "assumes" that you are performing statistical learning and makes it easy to assess and diagnose your models. scikit-learn, by far the most popular machine learning package for Python, is more concerned with predictive accuracy. (For example, scikit-learn makes it very easy to tune and cross-validate your models and switch between different models, but makes it much harder than R to actually "examine" your models.) Thus, R is probably the better choice if you are teaching statistical learning, though Python also has a nice package for statistical modeling (Statsmodels) that duplicates some of R's functionality.

Do you care more about the ease with which students can get started in machine learning, or the ease with which they can go deeper into machine learning?

In R, getting started with your first model is easy: read your data into a data frame, use a built-in model (such as linear regression) along with R's easy-to-read formula language, and then review the model's summary output. In Python, it can be much more of a challenging process to get started simply because there are so many choices to make: How should I read in my data? Which data structure should I store it in? Which machine learning package should I use? What type of objects does that package allow as input? What shape should those objects be in? How do I include categorical variables? How do I access the model's output? (Et cetera.) Because Python is a general purpose programming language whereas R specializes in a smaller subset of statistically-oriented tasks, those tasks tend to be easier to do (at least initially) in R.

However, once you have mastered the basics of machine learning in Python (using scikit-learn), I find that machine learning is actually a lot easier in Python than in R. scikit-learn provides a clean and consistent interface to tons of different models. It provides you with many options for each model, but also chooses sensible defaults. Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly. It is also actively being developed.

In R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be under active development. (caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, but it's nowhere near as elegant a solution as scikit-learn.) In summary, machine learning in R tends to be a more tiresome experience than machine learning in Python once you have moved beyond the basics. As such, Python may be a better choice if students are planning to go deeper into machine learning.

Do your students care about learning a "sexy" language?

R is not a sexy language. It feels old, and its website looks like it was created around the time the web was invented. Python is the "new kid" on the data science block, and has far more sex appeal. From a marketing perspective, Python may be the better choice simply because it will attract more students.

How computer savvy are your students?

Installing R is a simple process, and installing RStudio (the de facto IDE for R) is just as easy. Installing new packages or upgrading existing packages from CRAN (R's package management system) is a trivial process within RStudio, and even installing packages hosted on GitHub is a simple process thanks to the devtools package.

By comparison, Python itself may be easy to install, but installing individual Python packages can be much more challenging. In my classroom, we encourage students to use the Anaconda distribution of Python, which includes nearly every Python package we use in the course and has a package management system similar to CRAN. However, Anaconda installation and configuration problems are still common in my classroom, whereas these problems were much more rare when using R and RStudio. As such, R may be the better choice if your students are not computer savvy.

Is data cleaning a focus of your course?

Data cleaning (also known as "data munging") is the process of transforming your raw data into a more meaningful form. I find data cleaning to be easier in Python because of its rich set of data structures, as well as its far superior implementation of regular expressions (which are often necessary for cleaning text).

Is data exploration a focus of your course?

The pandas package in Python is an extremely powerful tool for data exploration, though its power and flexibility can also make it challenging to learn. R's dplyr is more limited in its capabilities than pandas (by design), though I find that its more focused approach makes it easier to figure out how to accomplish a given task. As well, dplyr's syntax is more readable and thus is easier for me to remember. Although it's not a clear differentiator, I would consider R a slightly easier environment for getting started in data exploration due to the ease of learning dplyr.

Is data visualization a focus of your course?

R's ggplot2 is an excellent package for data visualization. Once you understand its core principles (its "grammar of graphics"), it feels like the most natural way to build your plots, and it becomes easy to produce sophisticated and attractive plots. Matplotlib is the de facto standard for scientific plotting in Python, but I find it tedious both to learn and to use. Alternatives like Seaborn and pandas plotting still require you to know some Matplotlib, and the alternative that I find most promising (ggplot for Python) is still early in development. Therefore, I consider R the better choice for data visualization.

Is Natural Language Processing (NLP) part of your curriculum?

Python's Natural Language Toolkit (NLTK) is a mature, well-documented package for NLP. TextBlob is a simpler alternative, spaCy is a brand new alternative focused on performance, and scikit-learn also provides some supporting functionality for text-based feature extraction. In comparison, I find R's primary NLP framework (the tm package) to be significantly more limited and harder to use. Even if there are additional R packages that can fill in the gaps, there isn't one comprehensive package that you can use to get started, and thus Python is the better choice for teaching NLP.

If you are a data science educator, or even just a data scientist who uses R or Python, I'd love to hear from you in the comments! On which points above do you agree or disagree? What are some important factors that I have left out? What language do you teach in the classroom, and why?

I look forward to this conversation!

P.S. Want to hear about new Data School blog posts, video tutorials, and online courses? Subscribe to my newsletter:

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School