How to launch your data science career

Welcome! If you're interested in the exciting world of data science, but don't know where to start, Data School is here to help.


Step 0: Figure out what you need to learn

Data science can be an overwhelming field. Many people will tell you that you can't become a data scientist until you master the following: statistics, linear algebra, calculus, programming, databases, distributed computing, machine learning, visualization, experimental design, clustering, deep learning, natural language processing, and more. That's simply not true.

So, what exactly is data science? It's the process of asking interesting questions, and then answering those questions using data. Generally speaking, the data science workflow looks like this:

This workflow doesn't necessarily require advanced mathematics, a mastery of deep learning, or many of the other skills listed above. But it does require knowledege of a programming language and the ability to work with data in that language. And although you need mathematical fluency to become really good at data science, you only need a basic understanding of mathematics to get started.

It's true that the other specialized skills listed above may one day help you to solve data science problems. However, you don't need to master all of those skills to begin your career in data science. You can begin today, and I'm here to help you!


Step 1: Get comfortable with Python

Python and R are both great choices as programming languages for data science. R tends to be more popular in academia, and Python tends to be more popular in industry, but both languages have a wealth of packages that support the data science workflow. I've taught data science in both languages, and generally prefer Python. (Here's why.)

You don't need to learn both Python and R to get started. Instead, you should focus on learning one language and its ecosystem of data science packages. If you've chosen Python (my recommendation), you may want to considering installing the Anaconda distribution because it simplifies the process of package installation and management on Windows, OSX, and Linux.

You also don't need to become a Python expert to move on to step 2. Instead, you should focus on mastering the following: data types, data structures, imports, functions, conditional statements, comparisons, loops, and comprehensions. Everything else can wait until later!

If you're not sure whether you know "enough" Python, scan through my Python Quick Reference. If most of that material is familiar to you, you can move on to step 2!

If you're looking for a course to help you learn Python, here are a few recommendations:


Step 2: Learn data analysis, manipulation, and visualization with pandas

For working with data in Python, you should learn how to use the pandas library.

pandas provides a high-performance data structure (called a "DataFrame") that is suitable for tabular data with columns of different types, similar to an Excel spreadsheet or SQL table. It includes tools for reading and writing data, handling missing data, filtering data, cleaning messy data, merging datasets, visualizing data, and so much more. In short, learning pandas will significantly increase your efficiency when working with data.

However, pandas includes an overwhelming amount of functionality, and (arguably) provides too many ways to accomplish the same task. Those characteristics can make it challenging to learn pandas and to discover best practices.

That's why I created a pandas course (7 hours) that teaches the pandas library from the ground up. Each video answers a question using a real dataset, and the datasets are posted online so you can follow along at home.

"Your videos are extremely helpful. I like that you use actual data sets and try a lot of different applications of the concept being discussed rather than just overly simplistic examples. Your content has helped me immensely!" - Sean Montague

If you're already an intermediate pandas user, you may want to learn my top 25 pandas tricks or learn about best practices with pandas.

If you would prefer a non-video resource for learning pandas, here are my recommended resources.


Step 3: Learn machine learning with scikit-learn

For machine learning in Python, you should learn how to use the scikit-learn library.

Building "machine learning models" to predict the future or automatically extract insights from data is the sexy part of data science. scikit-learn is the most popular library for machine learning in Python, and for good reason:

However, machine learning is still a highly complex and rapidly evolving field, and scikit-learn has a steep learning curve. That's why I created a free scikit-learn course (4 hours), which will help you to gain a thorough grasp of both machine learning fundamentals and the scikit-learn workflow. The series doesn't presume any familiarity with machine learning or advanced mathematics. (You can find all of the code from the course on GitHub.)

"Your videos are absolutely incredible. I have just completed the course on Machine Learning with Python and I can say I understood every single thing thanks to your excellent teaching style and skills." - Guillaume B

Once you've finished the course, you should consider enrolling in my follow-up course, Building an Effective Machine Learning Workflow with scikit-learn.

If you would prefer a non-video resource for learning scikit-learn, I recommend either Python Machine Learning (Amazon / GitHub) or Introduction to Machine Learning with Python (Amazon / GitHub).


Step 4: Understand machine learning in more depth

Machine learning is a complex field. Although scikit-learn provides the tools you need to do effective machine learning, it doesn't directly answer many important questions:

If you want to become great at machine learning, you need to be able to answer those questions, which requires both experience and further study. Here are some resources to help you along that path:


Step 5: Keep learning and practicing

Here is my best advice for improving your data science skills: Find "the thing" that motivates you to practice what you learned and to learn more, and then do that thing. That could be personal data science projects, Kaggle competitions, online courses, reading books, reading blogs, attending meetups or conferences, or something else!

Your data science journey has only begun! There is so much to learn in the field of data science that it would take more than a lifetime to master. Just remember: You don't have to master it all to launch your data science career, you just have to get started!


Join Data School (for free!)

My name is Kevin Markham, and I'm the founder of Data School. I'd be honored if you would join the Data School community by subscribing to the email newsletter:

As a subscriber, you'll receive priority access to my online courses and live webcasts, and you'll get notified about new Data School tutorials and videos.

Have a question? Feel free to email me: kevin@dataschool.io. I read every email!

Want to follow Data School?

Thank you so much for reading!