Data science best practices with pandas (video tutorial)
The pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the pandas library makes it challenging to discover the best way to accomplish any given task.
In this in-depth tutorial, which I presented at PyCon 2019, you'll use pandas to answer questions about a real-world dataset. Through each exercise, you'll learn important data science skills as well as "best practices" for using pandas. By the end of the tutorial, you'll be more fluent at using pandas to correctly and efficiently answer your own data science questions.
This is an intermediate level tutorial, so if you're new to pandas, I recommend starting with my other video series: Easier data analysis with pandas.
If you want to follow along with the exercises at home, you can download the dataset and notebook from GitHub.
Here are some of the topics covered in the video:
- adjusting for bias in your dataset
- handling missing values
- choosing an appropriate plot
- customizing your plot
- using the datetime data type
- filtering using loc versus query
- using multiple aggregation functions
- checking for small sample sizes
- method chaining
- verifying your results using random samples
- evaluating a "stringifed" Python container
- applying a custom function to a Series
- writing lambda functions
Let me know if you have any questions, and I'm happy to answer them!
P.S. If you like this video, you should check out my interactive pandas course, Analyzing Police Activity with pandas.
NEW VIDEO: Learn how to write better, more efficient #pandas code 🐼— Kevin Markham (@justmarkham) May 23, 2019
Download the dataset to follow along with the exercises:
Become more fluent at using pandas to answer your own #DataScience questions!#Python pic.twitter.com/eTNKuNihif