March 2, 2014
Useful resources for learning data science fundamentals
This post is a collection of resources that I found particularly useful when I was learning the fundamentals of data science.
- Beginner's guide to R (tutorial): This six-part tutorial can be worked through in a day, and hits the sweet spot for beginners by giving you enough information to understand what you are doing without overwhelming you with details. It covers setup, reading in data, basic data analysis, visualization (with ggplot2!), syntax quirks, and useful resources.
- R programming for those coming from other languages (article): If you know another programming language, this article gives some helpful context for key areas in which R is different.
- Computing for Data Analysis (4-week online course): An excellent and thorough introduction to R by Roger Peng of Johns Hopkins. Dr. Peng has a deep understanding of the language, uses good coding practices, and provides a good balance of theory and practice. His lecture videos are packed with information and I highly recommend them. (This course is unlikely to be offered again, having been split into multiple courses for Coursera's upcoming Data Science Specialization program.)
- Introduction to ggplot2 (tutorial): This one-page tutorial teaches the fundamentals of the ggplot2 package in a thoughtful order and includes a ton of useful example graphics.
- Learn Python (tutorial): Codecademy's popular introduction to Python contains 20+ modules and 200+ interactive exercises. Although it's geared towards novice programmers and thus glosses over details that I would have found helpful, it is still a useful first course in Python.
- Google's Python class (tutorial): A bundle of written materials, video lectures, and programming assignments from an introductory two-day Python class at Google. It was a good follow-up to the Codecademy course, providing less breadth than Codecademy but more depth on the most important Python topics.
- Sams Teach Yourself SQL in 10 Minutes (book): A concise, well-written introduction to SQL that can easily be worked through in a day. The majority of the book focuses on retrieving, sorting, filtering, summarizing, and joining data, which are the most important SQL operations for data scientists.
- Download the practice database: All of the example code in the book uses a database provided by the author, available for download in a dozen popular formats.
- I used SQLite to run the example code, which has some significant advantages over other database engines (especially for getting started with SQL): it's free, serverless, and requires no configuration.
- Try-SQL Editor (playground): If you know some SQL and just need a place to practice your queries, this is a lightweight web application that allows you to run queries on a toy database and reset it at any time.
- Statistical Learning (9-week online course): Taught by Trevor Hastie and Rob Tibshirani of Stanford using their new "Introduction to Statistical Learning" textbook. It covers a wide gamut of supervised learning methods and a few unsupervised learning methods. They cover the math and concepts behind each method, and then work through example implementations in R. Although this course skewed a bit heavy on math and light on application for my taste, the textbook is fantastic and they are clearly masters of this material.
- Machine Learning applications (links): This is a curated list of links to news articles and research papers about how machine learning has been used to solve interesting, real-world problems.
Git, Git Bash, and GitHub:
- Git and GitHub for Beginners (video series): I created these videos to provide beginners with an approachable introduction to Git and GitHub, and how to use them together.
- Pro Git (online book): The first three chapters provided a thoughtful introduction to Git. It was the only Git resource I found that taught the concepts (as well as the code) in an approachable and logical way.
- Simple guide to forks in GitHub and Git (article): This is a short post I wrote to explain forking, the most fundamental GitHub concept, in two simple diagrams.
- GitHub Bootcamp (articles): A concise walk-through of GitHub basics, with well-explained Git code.
Data Science (general):
- General Assembly Data Science course (11-week in-person course): I'm taking this course now, and it covers a ton of data science topics in both R and Python.
- Coursera Data Science Specialization (series of 9 month-long online courses): I have just begun taking the first few courses.
- Analyzing the Analyzers (short e-book): The best overview I have read of data science roles, skillsets, and career paths.
Please let me know if you have any questions or suggestions!