How to get better at data science
Recently, I finished teaching General Assembly's 11-week data science course for the fourth time. The goal of the course is to enable students to apply the entire data science workflow (using Python) to problems that interest them: forming a question, gathering and cleaning data, exploring and visualizing the data, building and evaluating machine learning models, and communicating results. The typical student is a working professional with some experience working with data, limited programming experience, and basic statistical knowledge. (Here are the course materials, here's why we teach Python, and here were my lessons learned from the first time I taught the course.)
At the end of every course, the most common question I receive from students is this:
How can I continue to improve my data science skills?
Below is the advice I give to my students. How would you answer this question?
My advice to students
Here is my best advice for getting better at data science: Find "the thing" that motivates you to practice what you learned and to learn more, and then do that thing. That could be personal data science projects, Kaggle competitions, online courses, reading books, reading blogs, attending meetups or conferences, or something else.
If you create your own data science projects, I'd encourage you to share them on GitHub and include writeups. That will help to show others that you know how to do proper data science.
Kaggle competitions are a great way to practice data science without coming up with the problem yourself. Don't worry about how high you place, just focus on learning something new with every competition. Spend as much time as possible reading the forums, because you'll learn a lot, but don't spend time in the forums at the expense of working on the competition yourself. Also, keep in mind that you won't be practicing important parts of the data science workflow, namely generating questions, gathering data, and communicating results.
There are many online courses to consider, and new ones being created all the time:
- Coursera's Data Science Specialization is 9 courses, plus a capstone project. There is a lot of overlap with General Assembly's course, and course quality varies, but you would definitely learn a lot of R.
- Coursera's Machine Learning is Andrew Ng's highly regarded course. It goes deeper into many topics we covered, and covers many topics we didn't. Keep in mind that it focuses only on machine learning (not the entire data science workflow), the programming assignments use MATLAB/Octave, and it requires some understanding of linear algebra. Browse these lecture notes (compiled by a student) for a preview of the course.
- Stanford's Statistical Learning also covers some topics that we did not. It focuses on teaching machine learning at a conceptual (rather than mathematical) level, when possible. The course may be offered again in 2016, but the real gem from the course is the book and videos (linked below).
- Caltech's Learning from Data teaches machine learning at a theoretical and conceptual level. The lectures and slides are excellent. The homework assignments are not interactive, and the course does not use a specific programming language.
- Udacity's Data Analyst Nanodegree looks promising, but I don't know anyone who has done it.
- edX's Introduction to Computer Science and Programming Using Python is apparently an excellent course if you want to get better at programming in Python.
- CourseTalk is useful for reading reviews of online courses.
- I also teach my own online courses, which will range in level from beginner to advanced. (Subscribe to my email newsletter to be notified when courses are announced.)
- Some additional courses are listed in the Additional Resources section of the course repository.
Here is just a tiny selection of books:
- An Introduction to Statistical Learning with Applications in R is my favorite book on machine learning because of the thoughtful way in which the material is presented. The Statistical Learning course linked above uses it as the course textbook, and the related videos are available on YouTube.
- Elements of Statistical Learning is by the same authors. It covers a wider variety of topics, and in greater mathematical depth.
- Python for Data Analysis was written by the creator of Pandas, and is especially useful if you want to go deeper into Pandas and NumPy.
- Python Machine Learning is coming out in October 2015. The author, Sebastian Raschka, is an excellent writer and has a deep understanding of both machine learning and scikit-learn, so I expect it will be worth reading.
There are an overwhelming number of data science blogs and newsletters. Data Elixir is the best newsletter, though the O'Reilly Data Newsletter and Python Weekly are also good. Other notable blogs include: Practical Business Python (accessible Python content), Simply Statistics (a bit more academic), FastML (machine learning content), Win-Vector blog (great data science advice), FiveThirtyEight (data journalism), and Data School (my blog).
Some notable data science conferences are KDD, Strata, PyCon, PyData, and SciPy. (You should also search for data-related meetups in your local community!)
If you want to go full-time with your data science education, read this guide to data science bootcamps. Or, check out this massive list of colleges and universities with data science-related degrees.
What's your advice?
I'd love to hear from you in the comments, whether it's to share an additional resource or piece of advice, to discuss one of my recommendations, or just to let me know that you found something useful here!