Here's what I want in a reference guide:

**High-quality examples**that show the simplest possible usage of a given feature**Explanatory comments**, and descriptive variable names that eliminate the need for some

Here's what I want in a reference guide:

**High-quality examples**that show the simplest possible usage of a given feature**Explanatory comments**, and descriptive variable names that eliminate the need for some comments- Presented as a
**single script (or notebook)**, so that I can keep it open and search it when needed **Code that can be run**from top to bottom, with the relevant objects defined nearby

This is **not** written as a full-fledged Python tutorial, though I ordered the topics such that you can read it like a tutorial (i.e., each topic depends only on material preceding it).

The guide was written using Python 2 but is **fully compatible** with Python 3. Relevant differences between Python 2 and 3 are noted throughout the guide.

You can view it as a Python script on GitHub. It's also **embedded** below this blog post.

You can view it as a Jupyter notebook on nbviewer.

If you want to **save a copy** of either the script or the notebook, just clone or download the GitHub repository.

Click to jump to the relevant section of the script or the notebook:

- Imports (script, notebook)
- Data Types (script, notebook)
- Math (script, notebook)
- Comparisons and Boolean Operations (script, notebook)
- Conditional Statements (script, notebook)
- Lists (script, notebook)
- Tuples (script, notebook)
- Strings (script, notebook)
- Dictionaries (script, notebook)
- Sets (script, notebook)
- Defining Functions (script, notebook)
- Anonymous (Lambda) Functions (script, notebook)
- For Loops and While Loops (script, notebook)
- Comprehensions (script, notebook)
- Map and Filter (script, notebook)

If you like the general format of this guide, but need **more explanation of each topic**, I highly recommend reading the Appendix of Python for Data Analysis. It presents the essentials of the Python language in a clear and focused manner.

If you are looking for a resource that will help you to **learn Python from scratch**, this is my list of recommended resources.

If there's a **topic or example** you'd like me to add to this guide, or you notice a **mistake**, please create a GitHub issue or let me know in the comments section below!

Finally! Created a Jupyter notebook version of my #Python Quick Reference and made compatible with Python 2 & 3: https://t.co/zJbhhmhBZm

— Kevin Markham (@justmarkham) October 13, 2016

Intro to pandas data structures: This is the first post

I recently launched a video series about "pandas", a popular Python library for **data analysis, manipulation, and visualization**. But for those of you who want to learn pandas and prefer the written word, I've compiled my list of **recommended resources:**

Intro to pandas data structures: This is the first post in Greg Reda's classic three-part pandas tutorial (part 2, part 3). It's highly readable, presents the "right" level of detail for a pandas beginner, and includes lots of useful examples.

Introduction to Pandas / Data Wrangling with Pandas / Plotting with Pandas: Three

**extremely long**(but well-written) Jupyter notebooks from Chris Fonnesbeck's Advanced Statistical Computing course at Vanderbilt University (my alma mater!). If you want to go deep into the details and learn about many powerful pandas features, these notebooks are for you.Python for Data Analysis: This book was written by the creator of pandas, Wes McKinney, back in 2012. It covers IPython, NumPy, and pandas, and also includes an excellent appendix of "Python Language Essentials". It's still probably the best pandas book out there, though it might be worth waiting to buy until

**early 2017**when the second edition is released. (Wes is currently accepting suggestions for the book!)Common Excel Tasks Demonstrated in Pandas: If you're coming from an Excel background, this post (and part 2) may help you to build a mental model for how pandas "thinks". It's from Chris Moffitt's excellent blog, Practical Business Python.

Translating SQL to pandas: This Jupyter notebook from Greg Reda may be helpful if you are transitioning from SQL to pandas. (Here's the related video presentation.)

Modern Pandas: This is a recent seven-part series by Tom Augspurger, a contributor to pandas, primarily targeting

**intermediate pandas users**who want to make their code more modern and idiomatic.If you prefer reading

**code snippets**(rather than articles or books) to learn a language, you might like Mark Graph's 10-page Cheat sheet to the pandas DataFrame object or Chris Albon's Data Wrangling code samples.All of the code from my pandas video series is available for you to browse, in a well-commented Jupyter notebook.

What excellent pandas resources did I miss? Let me know in the comments section below!

**P.S.** Want to be the first to know when I launch an **online course about pandas?** Subscribe to the Data School newsletter.

]]>Top 8 resources for learning #Python pandas: https://t.co/jIyGuuqCsy featuring @wesmckinn @fonnesbeck @gjreda @TomAugspurger @chrisalbon ...

— Kevin Markham (@justmarkham) May 17, 2016

**Summary:** If you're working with data in Python, learning pandas will make your life easier! I love teaching pandas, and so I created a video series targeted at beginners. There are currently 30 videos in the series.

pandas is a powerful, open source Python library for **data analysis, manipulation, and visualization**. If you're working with data in Python and you're not using pandas, you're probably working too hard!

**There are many things to like about pandas:** It's well-documented, has a huge amount of community support, is under active development, and plays well with other Python libraries (such as matplotlib, scikit-learn, and seaborn).

**There are also things you might not like:** pandas has an overwhelming amount of functionality (so it's hard to know where to start), and it provides too many ways to accomplish the same task (so it's hard to figure out the best practices).

That's why I created this series. I've been using and teaching pandas for a long time, and so I know how to explain pandas in a way that is **understandable to novices**.

You don't need to have **any pandas experience** to benefit from this series, but you do need to know the basics of Python.

In each video, I answer a question from one of my students using a real dataset. Since I've posted the data online, and pandas can read files directly from a URL, **you can follow along with every video at home!**

Every video in the series is embedded below. There are currently 30 videos in the series, but more may be added in the future. (Subscribe on YouTube for notifications.)

There's also a well-commented Jupyter notebook containing the code from every video, and a GitHub repository containing all of the datasets.

Do you have a question about pandas, or a task you would like to accomplish? **Let me know in the comments section!**

Just launched a new pandas Q&A video series! New videos every Tues/Thurs, 30+ videos planned https://t.co/7ZAguJKZzR pic.twitter.com/RfygnF1sEQ

— Kevin Markham (@justmarkham) April 8, 2016

- What is pandas? (Introduction to the Q&A series) (6:24)
- How do I read a tabular data file into pandas? (8:54)
- How do I select a pandas Series from a DataFrame? (11:10)
- Why do some pandas commands end with parentheses (and others don't)? (8:45)
- How do I rename columns in a pandas DataFrame? (9:36)
- How do I remove columns from a pandas DataFrame? (6:35)
- How do I sort a pandas DataFrame or a Series? (8:56)
- How do I filter rows of a pandas DataFrame by column value? (13:44)
- How do I apply multiple filter criteria to a pandas DataFrame? (9:51)
- Your pandas questions answered! (9:06)
- How do I use the "axis" parameter in pandas? (8:33)
- How do I use string methods in pandas? (6:16)
- How do I change the data type of a pandas Series? (7:28)
- When should I use a "groupby" in pandas? (8:24)
- How do I explore a pandas Series? (9:50)
- How do I handle missing values in pandas? (14:27)
- What do I need to know about the pandas index? (Part 1) (13:36)
- What do I need to know about the pandas index? (Part 2) (10:38)
- How do I select multiple rows and columns from a pandas DataFrame? (21:46)
- When should I use the "inplace" parameter in pandas? (10:18)
- How do I make my pandas DataFrame smaller and faster? (19:05)
- How do I use pandas with scikit-learn to create Kaggle submissions? (13:25)
- More of your pandas questions answered! (19:23)
- How do I create dummy variables in pandas? (13:13)
- How do I work with dates and times in pandas? (10:20)
- How do I find and remove duplicate rows in pandas? (9:47)
- How do I avoid a SettingWithCopyWarning in pandas? (13:29)
- How do I change display options in pandas? (14:55)
- How do I create a pandas DataFrame from another object? (14:25)
- How do I apply a function to a pandas Series or DataFrame? (17:57)
**Bonus:**Your pandas questions answered! (webcast) (1:56:01)

pandas is a full-featured Python library for data analysis, manipulation, and visualization. This video series is for anyone who wants to work with data in Python, regardless of whether you are brand new to pandas or have some experience. Each video will answer a student question about pandas using a real dataset, which is available online so you can follow along!

"Tabular data" is just data that has been formatted as a table, with rows and columns (like a spreadsheet). You can easily read a tabular data file into pandas, even directly from a URL! In this video, I'll walk you through how to do that, including how to modify some of the default arguments of the read_table function to solve common problems.

DataFrames and Series are the two main object types in pandas for data storage: a DataFrame is like a table, and each column of the table is called a Series. You will often select a Series in order to analyze or manipulate it. In this video, I'll show you how to select a Series using "bracket notation" and "dot notation", and will discuss the limitations of dot notation. I'll also demonstrate how to create a new Series in a DataFrame.

To access most of the functionality in pandas, you have to call the methods and attributes of DataFrame and Series objects. In this video, I'll discuss some common methods and attributes, and show you how to tell the difference between them. (Hint: It's all about the parentheses!)

You will often want to rename the columns of a DataFrame so that their names are descriptive, easy to type, and don't contain any spaces. In this video, I'll demonstrate three different strategies for renaming columns so that you can choose the best strategy to fit your particular situation.

If you have DataFrame columns that you're never going to use, you may want to remove them entirely in order to focus on the columns that you do use. In this video, I'll show you how to remove columns (and rows), and will briefly explain the meaning of the "axis" and "inplace" parameters.

pandas allows you to sort a DataFrame by one of its columns (known as a "Series"), and also allows you to sort a Series alone. The sorting API changed in pandas version 0.17, so in this video, I'll demonstrate both the "old way" and the "new way" to sort. I'll also show you how to sort a DataFrame by multiple columns at once!

Let's say that you only want to display the rows of a DataFrame which have a certain column value. How would you do it? pandas makes it easy, but the notation can be confusing and thus difficult to remember. In this video, I'll work up to the solution step-by-step using regular Python code so that you can truly understand the logic behind pandas filtering notation.

Let's say that you want to filter the rows of a DataFrame by multiple conditions. In this video, I'll demonstrate how to do this using two different logical operators. I'll also explain the special rules in pandas for combining filter criteria, and end with a trick for simplifying chained conditions!

In this video, I'm answering a few of the pandas questions I've received in the YouTube comments:

- When reading from a file, how do I read in only a subset of the columns or rows?
- How do I iterate through a Series or a DataFrame?
- How do I drop all non-numeric columns from a DataFrame?
- How do I know whether I should pass an argument as a string or a list?

When performing operations on a pandas DataFrame, such as dropping columns or calculating row means, it is often necessary to specify the "axis". But what exactly is an axis? In this video, I'll help you to build a mental model for understanding the axis parameter so that you will know when and how to use it.

pandas includes powerful string manipulation capabilities that you can easily apply to any Series of strings. In this video, I'll show you how to access string methods in pandas (along with a few examples), and then end with two bonus tips to help you maximize your efficiency.

Have you ever tried to do math with a pandas Series that you thought was numeric, but it turned out that your numbers were stored as strings? In this video, I'll demonstrate two different ways to change the data type of a Series so that you can fix incorrect data types. I'll also show you the easiest way to convert a boolean Series to integers, which is useful for creating dummy/indicator variables for machine learning.

The pandas "groupby" method allows you to split a DataFrame into groups, apply a function to each group independently, and then combine the results back together. This is called the "split-apply-combine" pattern, and is a powerful tool for analyzing data across different categories. In this video, I'll explain when you should use a groupby and then demonstrate its flexibility using four different examples.

When you start working with a new dataset, how should you go about exploring it? In this video, I'll demonstrate some of the basic tools in pandas for exploring both numeric and non-numeric data. I'll also show you how to create simple visualizations in a single line of code!

Most datasets contain "missing values", meaning that the data is incomplete. Deciding how to handle missing values can be challenging! In this video, I'll cover all of the basics: how missing values are represented in pandas, how to locate them, and options for how to drop them or fill them in.

The DataFrame index is core to the functionality of pandas, yet it's confusing to many users. In this video, I'll explain what the index is used for and why you might want to store your data in the index. I'll also demonstrate how to set and reset the index, and show how that affects the DataFrame's shape and contents.

In part two of our discussion of the index, we'll switch our focus from the DataFrame index to the Series index. After discussing index-based selection and sorting, I'll demonstrate how automatic index alignment during mathematical operations and concatenation enables us to easily work with incomplete data in pandas.

Have you ever been confused about the "right" way to select rows and columns from a DataFrame? pandas gives you an incredible number of options for doing so, but in this video, I'll outline the current best practices for row and column selection using the loc, iloc, and ix methods.

We've used the "inplace" parameter many times during this video series, but what exactly does it do, and when should you use it? In this video, I'll explain how "inplace" affects methods such as "drop" and "dropna", and why it is always False by default.

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.

Have you been using scikit-learn for machine learning, and wondering whether pandas could help you to prepare your data and export your predictions? In this video, I'll demonstrate the simplest way to integrate pandas into your machine learning workflow, and will create a submission for Kaggle's Titanic competition in just a few lines of code!

In this video, I'm answering a few of the pandas questions I've received in the YouTube comments:

- Could you explain how to read the pandas documentation?
- What is the difference between ufo.isnull() and pd.isnull(ufo)?
- Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?
- How do I randomly sample rows from a DataFrame?

If you want to include a categorical feature in your machine learning model, one common solution is to create dummy variables. In this video, I'll demonstrate three different ways you can create dummy variables from your existing DataFrame columns. I'll also show you a trick for simplifying your code that was introduced in pandas 0.18.

Let's say that you have dates and times in your DataFrame and you want to analyze your data by minute, month, or year. What should you do? In this video, I'll demonstrate how you can convert your data to "datetime" format, enabling you to access a ton of convenient attributes and perform datetime comparisons and mathematical operations.

During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I'll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.

If you've been using pandas for a while, you've likely encountered a SettingWithCopyWarning. The proper response is to modify your code appropriately, not to turn off the warning! In this video, I'll show you two common scenarios in which this warning arises, explain why it's occurring, and then demonstrate how to address it.

Have you ever wanted to change the way your DataFrame is displayed? Perhaps you needed to see more rows or columns, or modify the formatting of numbers? In this video, I'll demonstrate how to change the settings for five common display options in pandas.

Have you ever needed to create a DataFrame of "dummy" data, but without reading from a file? In this video, I'll demonstrate how to create a DataFrame from a dictionary, a list, and a NumPy array. I'll also show you how to create a new Series and attach it to the DataFrame.

Have you ever struggled to figure out the differences between apply, map, and applymap? In this video, I'll explain when you should use each of these methods and demonstrate a few common use cases. Watch the end of the video for three important announcements!

During this two-hour webcast, I answered 45 viewer questions about pandas, the leading Python library for data analysis, exploration, and manipulation. View the complete list of questions on Crowdcast.

**P.S.** Want to be the first to know when I launch an **online course about pandas?** Subscribe to the Data School newsletter.

**March:**While learning version control, I get frustrated by the lack of clear and accessible

**March:**While learning version control, I get frustrated by the lack of clear and accessible information about some important Git and GitHub concepts. Once I figure it out, I write the "missing page" of GitHub's documentation about forks and pull requests.**April:**Coursera invites me to be a Teaching Assistant for "The Data Scientist's Toolbox," the first course in their Data Science Specialization. I decide that the course videos leave out some essential information about Git, and so I create a 36-minute video series to share with the students. I return as a volunteer Teaching Assistant for the next 16 course sessions, and my videos are viewed 350,000 times.**August:**As an Expert in Residence for the Data Science course at General Assembly (GA), I teach a lesson on how to use R's dplyr package for data exploration and manipulation. I decide that a wider audience would benefit from the lesson, and record a longer tutorial for YouTube. (I record a follow-up tutorial in March 2015, and both tutorials are later featured by Kaggle.)**September:**I realize that the excellent videos from Stanford's Statistical Learning course are on YouTube, but are nearly impossible to find. I catalog the videos on my blog, and come up with a "must-click" title: In-depth introduction to machine learning in 15 hours of expert videos. It remains my most viewed post (and most popular tweet), and has been on the R-bloggers list of "most visited articles of the week" every week for the past 18 months.**November:**I'm now an Instructor for GA's Data Science course, and I teach a lesson on the challenging topic of ROC curves and Area Under the Curve. I convert that lesson into my first (and only) animated video, which later becomes surprisingly popular.**December:**I finish teaching the Data Science course, and publish the 66 hours of course content on GitHub so that others can benefit. (It's still my most popular GitHub repository, though truthfully, the latest version of the repository is much more refined.)

**January:**I publish a 4000-word essay on data science instruction. A commenter asks a question about teaching Python or R for data science, and I respond with a (controversial) post, which garners some incredibly thoughtful debate in the comments section.**February:**Kaggle invites me to guest blog for them on a topic of my choosing. I volunteer to create a series of video tutorials on machine learning using Python's scikit-learn. During the eight months that follow, I spend hundreds of hours creating a 4-hour video series with companion blog posts for Kaggle. The Data School blog is very quiet during this time! :)**October:**I launch my first live online course, Machine Learning with Text in Python, in order to provide a classroom-like experience to students worldwide who are unable to attend my in-person courses.**December:**Google begins using my definition of "confusion matrix" in the snippet at the top of their search results, quoting from my simple guide to confusion matrix terminology.

**March:**I announce another session of Machine Learning with Text in Python starting April 9, having expanded the course content from 6 hours to 15 hours. (If you're interested in the course, you can watch a video Q&A for more information.)

**So what's next for Data School?** My plan is to launch additional online courses this year, while continuing to create lots of new content for the Data School blog and YouTube channel. I've got lots of exciting ideas for what to create next, but feel free to let me know your suggestions in the comments section!

**How can you help?** It would mean a lot to me if you would take 60 seconds right now and share this page with a friend, colleague, or group who might be interested in Data School. Let them know I have an email newsletter, and make sure you are also subscribed! By growing the Data School community, you're enabling me to focus full-time on Data School and create more high-quality content.

Thanks so much for being part of the Data School community. Here's to many more great years ahead!

]]>Top #datascience content from two years of Data School: https://t.co/xS1rkJwszd What's your favorite post or video? pic.twitter.com/P9R2wn9UKq

— Kevin Markham (@justmarkham) March 25, 2016

Here are just a few of the attributes of logistic regression that make it **incredibly popular**: it's fast, it's highly interpretable, it doesn't require input features to be scaled, it doesn't require any tuning, it's easy to regularize, and it outputs well-calibrated predicted probabilities.

But despite its popularity, it is often misunderstood. Here are a few common questions about logistic regression:

- Why is it called "logistic regression" if it's used for
**classification**? - Why is it considered a
**linear model**? - How do you interpret the
**model coefficients**?

As a teacher, I've found that my best lessons are the ones in which I explain a topic step-by-step in the way that I wish it had been taught to me. **I struggled when I was learning logistic regression**, which is why I'm so pleased to have written a lesson that may help you to grasp this challenging topic.

In order to give you **additional context for the lesson**, I created this guide that includes suggested prerequisites, a practical exercise, and a lengthy set of additional resources to allow you to go deeper into this topic.

Please note that the lesson code is written in Python, and so you will get the most out of it if you are a user of Python and scikit-learn. However, most components of this guide cover conceptual or mathematical material, and should be useful to all readers **regardless of programming background**.

I'd love to hear from you in the comments below! What questions do you have about logistic regression? Is this kind of guide helpful to you for learning a new topic? Are there **other guides** you would like me to create?

My in-depth guide to one of the most popular #machinelearning classifiers: https://t.co/fftGNRBzMD pic.twitter.com/521JwdvXkN

— Kevin Markham (@justmarkham) March 10, 2016

**Mathematical terminology:**

- Watch Rahul Patwari's videos on probability (5 minutes) and odds (8 minutes).
- Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln). Then, review this brief summary of exponential functions and logarithms.

**Machine learning:**

- Browse through my introductory slides on machine learning to make sure you are clear on the difference between regression and classification problems.
- Read Sebastian Raschka's overview of the supervised learning process for a look at the typical steps used to solve a classification problem.

**Linear regression:**

- Read my linear regression lesson notebook to ensure you are familiar with its form and interpretation, since the logistic regression lesson will build upon it. Alternatively, watch The Easiest Introduction to Regression Analysis (14 minutes).
- Setosa has an interactive visualization that may also help you to grasp linear regression.

**scikit-learn (optional):**

- For a walkthrough of the classification process using Python's scikit-learn library, watch videos 3 and 4 (35 minutes) from my scikit-learn video series. (Here are the associated notebooks.)

My logistic regression lesson notebook covers the following topics using the glass identification dataset:

- Refresh your memory on how to do linear regression in scikit-learn
- Attempt to use linear regression for classification
- Show you why logistic regression is a better alternative for classification
- Brief overview of probability, odds, e, log, and log-odds
- Explain the form of logistic regression
- Explain how to interpret logistic regression coefficients
- Demonstrate how logistic regression works with categorical features
- Compare logistic regression with other models

As a way to practice applying what you've learned, participate in Kaggle's introductory Titanic competition and use logistic regression to predict passenger survival. Kaggle links to helpful tutorials for Python, R, and Excel, and their Scripts feature lets you run Python and R code on the Titanic dataset from within your browser.

- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a "math-ier" explanation of logistic regression, read Sebastian Raschka's overview of logistic regression. He also provides the code for a simple logistic regression implementation in Python, and he has a section on logistic regression in his machine learning FAQ.
- For more guidance in interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation on probability calibration explains what it means for a predicted probability to be calibrated, and my blog post on click-through rate prediction with logistic regression explains why calibrated probabilities are useful in the real world.

- If you're a scikit-learn user, it's worth reading the user guide and class documentation for logistic regression to understand the particulars of its implementation.
- If you'd like to improve your logistic regression model through regularization, read part 5 of my regularization lesson notebook.

- Choosing a Machine Learning Classifier is a short and highly readable comparison of logistic regression, Naive Bayes, decision trees, and Support Vector Machines.
- Supervised learning superstitions cheat sheet is a more thorough comparison of those classifiers, and includes links to lots of useful resources.
- Comparing supervised learning algorithms is a comparison table I created that includes both classification and regression models.
- Classifier comparison is scikit-learn's visualization of classifier decision boundaries.
- An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes).
- These lecture slides compare the inner workings of logistic regression and Naive Bayes, and this paper by Andrew Ng compares the performance of logistic regression and Naive Bayes across a variety of datasets.

At the end of every course, the most common question I receive from students is this:

How can I continue to improve my data science skills?

Below is the advice I give to my students. **How would you answer this question?**

Here is my best advice for **getting better at data science**: Find "the thing" that motivates you to practice what you learned and to learn more, and then do that thing. That could be personal data science projects, Kaggle competitions, online courses, reading books, reading blogs, attending meetups or conferences, or something else.

If you create your own **data science projects**, I'd encourage you to share them on GitHub and include writeups. That will help to show others that you know how to do proper data science.

**Kaggle competitions** are a great way to practice data science without coming up with the problem yourself. Don't worry about how high you place, just focus on learning something new with every competition. Spend as much time as possible reading the forums, because you'll learn a lot, but don't spend time in the forums at the expense of working on the competition yourself. Also, keep in mind that you won't be practicing important parts of the data science workflow, namely generating questions, gathering data, and communicating results.

There are many **online courses** to consider, and new ones being created all the time:

- Coursera's Data Science Specialization is 9 courses, plus a capstone project. There is a lot of overlap with General Assembly's course, and course quality varies, but you would definitely learn a lot of R.
- Coursera's Machine Learning is Andrew Ng's highly regarded course. It goes deeper into many topics we covered, and covers many topics we didn't. Keep in mind that it focuses only on machine learning (not the entire data science workflow), the programming assignments use MATLAB/Octave, and it requires some understanding of linear algebra. Browse these lecture notes (compiled by a student) for a preview of the course.
- Stanford's Statistical Learning also covers some topics that we did not. It focuses on teaching machine learning at a conceptual (rather than mathematical) level, when possible. The course may be offered again in 2016, but the real gem from the course is the book and videos (linked below).
- Caltech's Learning from Data teaches machine learning at a theoretical and conceptual level. The lectures and slides are excellent. The homework assignments are not interactive, and the course does not use a specific programming language.
- Udacity's Data Analyst Nanodegree looks promising, but I don't know anyone who has done it.
- Thinkful's Data Science in Python course may be a good way to practice our course material with guidance from an expert mentor.
- edX's Introduction to Computer Science and Programming Using Python is apparently an excellent course if you want to get better at programming in Python.
- CourseTalk is useful for reading reviews of online courses.
- I also teach my own online courses, which will range in level from beginner to advanced. (Subscribe to my email newsletter to be notified when courses are announced.)
- Some additional courses are listed in the Bonus Resources section of the course repository.

Here is just a tiny selection of **books**:

- An Introduction to Statistical Learning with Applications in R is my favorite book on machine learning because of the thoughtful way in which the material is presented. The Statistical Learning course linked above uses it as the course textbook, and the related videos are available on YouTube.
- Elements of Statistical Learning is by the same authors. It covers a wider variety of topics, and in greater mathematical depth.
- Python for Data Analysis was written by the creator of Pandas, and is especially useful if you want to go deeper into Pandas and NumPy.
- Python Machine Learning is coming out in October 2015. The author, Sebastian Raschka, is an excellent writer and has a deep understanding of both machine learning and scikit-learn, so I expect it will be worth reading.

There are an overwhelming number of data science **blogs and newsletters**. If you want to read just one site, DataTau is the best aggregator. Data Elixir is the best newsletter, though the O'Reilly Data Newsletter and Python Weekly are also good. Other notable blogs include: no free hunch (Kaggle's blog), The Yhat blog (lots of Python and R content), Practical Business Python (accessible Python content), Simply Statistics (a bit more academic), FastML (machine learning content), Win-Vector blog (great data science advice), FiveThirtyEight (data journalism), and Data School (my blog).

If you prefer **podcasts**, I don't have any personal recommendations, though this list gives a nice summary of seven data science podcasts that the author recommends.

Some notable data science **conferences** are KDD, Strata, PyCon, PyData, and SciPy. (You should also search for data-related **meetups** in your local community!)

If you want to go **full-time** with your data science education, read this guide to data science bootcamps, and this other guide which also includes part-time and online programs. Or, check out this massive list of colleges and universities with data science-related degrees.

I'd love to hear from you in the comments, whether it's to share an additional resource or piece of advice, to discuss one of my recommendations, or just to let me know that you found something useful here!

P.S. Want to take an **online data science course** taught by me? Please subscribe to the Data School newsletter to gain priority access to my upcoming courses!

]]>Finished a #DataScience or #machinelearning course and don't know what to do next? My advice to students: http://t.co/CQFbUhW1Po

— Kevin Markham (@justmarkham) September 8, 2015

In the data science course that I teach for General Assembly, we spend a lot of time using scikit-learn, Python's library for machine learning. I love teaching scikit-learn, but it has a **steep learning curve**, and my feeling is that there are not many scikit-learn resources that are targeted towards **machine learning beginners**. Thus I decided to create a series of scikit-learn video tutorials, which I launched in April in partnership with Kaggle, the leading online platform for competitive data science!

The

]]>In the data science course that I teach for General Assembly, we spend a lot of time using scikit-learn, Python's library for machine learning. I love teaching scikit-learn, but it has a **steep learning curve**, and my feeling is that there are not many scikit-learn resources that are targeted towards **machine learning beginners**. Thus I decided to create a series of scikit-learn video tutorials, which I launched in April in partnership with Kaggle, the leading online platform for competitive data science!

The series now contains **nine video tutorials** totaling **four hours**. My goal with this series is to help motivated individuals to gain a thorough grasp of both machine learning fundamentals and the scikit-learn workflow. **I don't presume any familiarity with machine learning**, which is why the first video focuses exclusively on answering the question, "What is machine learning, and how does it work?" And although the series does assume that you have some familiarity with Python, the second video contains my suggested resources for learning Python if you're just getting started with the language.

I've embedded the video playlist below, or you can watch it on YouTube. I've also listed the agenda for each video, along with links to the blog post and **Jupyter Notebook** associated with each video. (My GitHub repository contains all of the Notebooks, which may be useful as reference material!)

I hope you enjoy the series, and welcome your **comments and questions**. Please subscribe to my YouTube channel to be notified when new videos are released!

**What is machine learning, and how does it work?**(video, notebook, blog post)- What is machine learning?
- What are the two main categories of machine learning?
- What are some examples of machine learning?
- How does machine learning "work"?

**Setting up Python for machine learning: scikit-learn and IPython Notebook**(video, notebook, blog post)- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the IPython Notebook?
- What are some good resources for learning Python?

**Getting started in scikit-learn with the famous iris dataset**(video, notebook, blog post)- What is the famous iris dataset, and how does it relate to machine learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- What are scikit-learn's four key requirements for working with data?

**Training a machine learning model with scikit-learn**(video, notebook, blog post)- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other machine learning models?

**Comparing machine learning models in scikit-learn**(video, notebook, blog post)- How do I choose which model to use for my supervised learning task?
- How do I choose the best tuning parameters for that model?
- How do I estimate the likely performance of my model on out-of-sample data?

**Data science pipeline: pandas, seaborn, scikit-learn**(video, notebook, blog post)- How do I use the pandas library to read data into Python?
- How do I use the seaborn library to visualize data?
- What is linear regression, and how does it work?
- How do I train and interpret a linear regression model in scikit-learn?
- What are some evaluation metrics for regression problems?
- How do I choose which features to include in my model?

**Cross-validation for parameter tuning, model selection, and feature selection**(video, notebook, blog post)- What is the drawback of using the train/test split procedure for model evaluation?
- How does K-fold cross-validation overcome this limitation?
- How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
- What are some possible improvements to cross-validation?

**Efficiently searching for optimal tuning parameters**(video, notebook, blog post)- How can K-fold cross-validation be used to search for an optimal tuning parameter?
- How can this process be made more efficient?
- How do you search for multiple tuning parameters at once?
- What do you do with those tuning parameters before making real predictions?
- How can the computational expense of this process be reduced?

**Evaluating a classification model**(video, notebook, blog post)- What is the purpose of model evaluation, and what are some common evaluation procedures?
- What is the usage of classification accuracy, and what are its limitations?
- How does a confusion matrix describe the performance of a classifier?
- What metrics can be computed from a confusion matrix?
- How can you adjust classifier performance by changing the classification threshold?
- What is the purpose of an ROC curve?
- How does Area Under the Curve (AUC) differ from classification accuracy?

At the PyCon 2016 conference, I taught a **3-hour tutorial** that builds upon this video series. The recording is embedded below, or you can watch it on YouTube:

Here are the topics I covered:

- Model building in scikit-learn (refresher)
- Representing text as numerical data
- Reading a text-based dataset into pandas
- Vectorizing our dataset
- Building and evaluating a model
- Comparing models
- Examining a model for further insight
- Practicing this workflow on another dataset
- Tuning the vectorizer (discussion)

Visit this GitHub repository to access the **tutorial notebooks** and many other recommended resources. If you want to go even deeper into this material, I teach an online course, Machine Learning with Text in Python.

]]>Detailed intro to #machinelearning with scikit-learn: 9 IPython notebooks & 4 hrs of video! https://t.co/nyQKC8Szm5 pic.twitter.com/Ez86rCPdw3

— Kevin Markham (@justmarkham) October 28, 2015

I recorded that tutorial using the latest version

]]>I recorded that tutorial using the latest version at the time (0.2), but there have since been **two significant updates to dplyr (versions 0.3 and 0.4)**. Because those updates introduced **a ton of new functionality**, I thought it was time to create another tutorial!

This new tutorial covers the most useful new features in 0.3 and 0.4, as well as some advanced functionality from previous versions that I didn't cover last time. (**If you have not watched the previous tutorial, I recommend you do so first** since it covers some dplyr basics that are not covered in this tutorial.)

This new tutorial runs **37 minutes**, but if you only want to watch a particular section, simply click the topic below and it will **skip to that point in the video:**

- Introduction (starts at 0:00)
- Loading dplyr and the nycflights13 dataset (starts at 1:12)
- Choosing columns:
`select`

,`rename`

(starts at 2:28) - Choosing rows:
`filter`

,`between`

,`slice`

,`sample_n`

,`top_n`

,`distinct`

(starts at 5:40) - Adding new variables:
`mutate`

,`transmute`

,`add_rownames`

(starts at 12:38) - Grouping and counting:
`summarise`

,`tally`

,`count`

,`group_size`

,`n_groups`

,`ungroup`

(starts at 15:20) - Creating data frames:
`data_frame`

(starts at 23:01) - Joining (merging) tables:
`left_join`

,`right_join`

,`inner_join`

,`full_join`

,`semi_join`

,`anti_join`

(starts at 25:28) - Viewing more output:
`print`

,`View`

(starts at 31:29) - Resources (starts at 34:41)

The video is embedded below, or you can view it on YouTube:

You can view the R Markdown document used in the video on RPubs, or you can download the source document from GitHub.

Here are the resources I mention in the video:

- Release announcements for version 0.3 and version 0.4
- dplyr reference manual and vignettes
- Two-table vignette covering joins and set operations
- RStudio's Data Wrangling Cheat Sheet for dplyr and tidyr
- dplyr GitHub repo and list of releases

My previous tutorial is embedded below, or you can view it on YouTube:

If you have any **questions about dplyr**, I'd love to hear them in the comments!

If you'd like to be notified when I release **new videos**, please subscribe to my YouTube channel. I also blog about a wide variety of data science topics, and have an email newsletter if you'd like to hear about that content!

]]>Just released! 37-min tutorial on new features in dplyr 0.3 & 0.4: rename, slice, count, data_frame, joins, much more http://t.co/BoQXEvo81n

— Kevin Markham (@justmarkham) March 9, 2015

Near the end of this 11-week course, we spend a few

]]>Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to **intelligently choose between supervised learning algorithms** when working a machine learning problem. Although there is some value in the "brute force" approach (try everything and see what works best), there is a lot more value in being able to **understand the trade-offs you're making** when choosing one algorithm over another.

I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to **compare the algorithms across a dozen different dimensions**. I couldn't find a table like this on the Internet, so I decided to construct one myself! Here's what I came up with:

I wanted to share this table for two reasons: First, I thought it might be useful to others as a **teaching or learning tool**. (You're welcome to open it in Google Sheets and make a copy.) Second, **I want to make it better**, and one way to do that is to ask people more knowledgeable than me to tell me what I got wrong! :)

This table is a product of my own experience and research, but I'm not an expert in any one of these algorithms. **If you have a suggestion for how this table can be improved, I'd love to hear it in the comments!**

- Are any of my evaluations misleading or incorrect? (Of course, some of these dimensions are inherently subjective.)
- Are there any other "important" dimensions for comparison that should be added to this table?
- Are there any other algorithms that you would like me to add to this table? (Currently, it only includes algorithms that were taught in my course.)

I realize that the characteristics and relative performance of each algorithm can vary based upon the particulars of the data (and how well it is tuned), and thus some may argue that attempting to construct an "objective" comparison is an ill-advised task. However, I would argue that there is still value in providing this table as a **set of general guidelines** and as a **starting point for comparing algorithms** for your own supervised learning task.

Happy (machine) learning!

- Choosing a Machine Learning Classifier: Edwin Chen's short and highly readable guide.
- scikit-learn's "Machine Learning Map": Their guide for choosing the "right" estimator for your task.
- Machine Learning Done Wrong: Thoughtful advice on common mistakes to avoid in machine learning, some of which relate to algorithmic selection.
- Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above.
- An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006.
- View all Data School posts on machine learning

P.S. There are other discussions about this post on Kaggle and DataTau.

P.P.S. I teach an online course about Machine Learning with Text in Python.

]]>Comparing 8 common supervised learning algorithms on 13 different dimensions: http://t.co/nqND7EWN3H #machinelearning

— Kevin Markham (@justmarkham) February 27, 2015

- It's widely used and well-understood.
- It runs very fast!
- It's easy to use because minimal

- It's widely used and well-understood.
- It runs very fast!
- It's easy to use because minimal "tuning" is required.
- It's highly "interpretable", meaning that it's easy to explain to others.
- It's the basis for many other machine learning techniques.

The most **accessible (yet thorough) introduction to linear regression** that I've found is Chapter 3 of An Introduction to Statistical Learning (ISL) by Hastie & Tibshirani. Their examples are crystal clear and the material is presented in a logical fashion, but it covers a lot more detail than I wanted to present in class. As well, **their code is written in R**, and my data science class is taught in Python.

When teaching this material, **I essentially condensed ISL chapter 3 into a single IPython Notebook**, focusing on the points that I consider to be most important and adding a lot of practical advice. As well, **I wrote all of the code in Python**, using both Statsmodels and scikit-learn to implement linear regression.

**Click here to view the IPython Notebook.**

Here is a detailed list of **topics covered** in the Notebook:

- reading data into Python using pandas
- identifying the features, response, and observations
- plotting the relationship between each feature and the response using Matplotlib
- introducing the form of simple linear regression
- estimating linear model coefficients
- interpreting model coefficients
- using the model for prediction
- plotting the "least squares" line
- quantifying confidence in the model
- identifying "significant" coefficients using hypothesis testing and p-values
- assessing how well the model fits the observed data
- extending simple linear regression to include multiple predictors
- comparing feature selection techniques: R-squared, p-values, cross-validation
- creating "dummy variables" (using pandas) to handle categorical predictors

**If you would like to go deeper into linear regression**, here are a few resources I would suggest:

- Chapter 3 of An Introduction to Statistical Learning (which can be
**downloaded for free!**) extends this lesson to include more advanced topics, such as detecting collinearity, diagnosing model fit, and transforming predictors to fit non-linear relationships. - This introduction to linear regression is well-written, mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.

**If you liked this Notebook**, here are some other Data School resources that might interest you:

- Quick reference guide to applying and interpreting linear regression
- IPython Notebook demonstrating logistic regression in Python
- 15 hours of expert videos introducing machine learning
- Python or R for data science?
- My 4-hour video series on machine learning in Python

Do you have any questions about linear regression in Python? **Please let me know in the comments below!**

P.S. Want to receive more content like this in your inbox? Subscribe to the Data School newsletter.

]]>New IPython notebook: Intro to linear regression in #python using scikit-learn, statsmodels, pandas, matplotlib http://t.co/T7MP4784jP

— Kevin Markham (@justmarkham) February 20, 2015

In the comments, I received the following question:

]]>I'm part of a team developing a course, with NSF support, in

In the comments, I received the following question:

I'm part of a team developing a course, with NSF support, in data science. The course will have no prerequisites and will be targeted for non-technical majors, with a goal to show how useful data science can be in their own area. Some of the modules we are developing include, for example, data cleansing, data mining, relational databases and NoSQL data stores. We are considering as tools the statistical environment R and Python and will likely develop two versions of this course. For now, we'd appreciate your sense of the relative merits of those two environments. We are hoping to get a sense of what would be more appropriate for computer and non computer science students, so if you have a sense of what colleagues that you know would prefer, that also would be helpful.

That's an excellent question! It doesn't have a simple answer (in my opinion) because both languages are great for data science, but one might be better than the other depending upon your students and your priorities.

At General Assembly in DC, we currently teach the course entirely in Python, though we used to teach it in both R and Python. I also mentor data science students in R, and I'm a teaching assistant for online courses in both R and Python. I enjoy using both languages, though I have a slight personal preference for Python specifically because of its machine learning capabilities (more details below).

Here are some questions that might help you (as educators or curriculum developers) to assess which language is a better fit for your students:

If your students have some programming experience, Python may be the better choice because its syntax is more similar to other languages, whereas R's syntax is thought to be unintuitive by many programmers. If your students don't have any programming experience, I think both languages have an equivalent learning curve, though many people would argue that Python is easier to learn because its code reads more like regular human language.

In academia, especially in the field of statistics, R is much more widely used than Python. In industry, the data science trend is slowly moving from R towards Python. One contributing factor is that companies using a Python-based application stack can more easily integrate a data scientist who writes Python code, since that eliminates a key hurdle in "productionizing" a data scientist's work.

The line between these two terms is blurry, but machine learning is concerned primarily with predictive accuracy over model interpretability, whereas statistical learning places a greater priority on interpretability and statistical inference. To some extent, R "assumes" that you are performing statistical learning and makes it easy to assess and diagnose your models. scikit-learn, by far the most popular machine learning package for Python, is more concerned with predictive accuracy. (For example, scikit-learn makes it very easy to tune and cross-validate your models and switch between different models, but makes it much harder than R to actually "examine" your models.) Thus, R is probably the better choice if you are teaching statistical learning, though Python also has a nice package for statistical modeling (Statsmodels) that duplicates some of R's functionality.

In R, getting started with your first model is easy: read your data into a data frame, use a built-in model (such as linear regression) along with R's easy-to-read formula language, and then review the model's summary output. In Python, it can be much more of a challenging process to get started simply because there are so many choices to make: How should I read in my data? Which data structure should I store it in? Which machine learning package should I use? What type of objects does that package allow as input? What shape should those objects be in? How do I include categorical variables? How do I access the model's output? (Et cetera.) Because Python is a general purpose programming language whereas R specializes in a smaller subset of statistically-oriented tasks, those tasks tend to be easier to do (at least initially) in R.

However, once you have mastered the basics of machine learning in Python (using scikit-learn), I find that machine learning is actually a lot easier in Python than in R. scikit-learn provides a clean and consistent interface to tons of different models. It provides you with many options for each model, but also chooses sensible defaults. Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly. It is also actively being developed.

In R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be under active development. (caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, but it's nowhere near as elegant a solution as scikit-learn.) In summary, machine learning in R tends to be a more tiresome experience than machine learning in Python once you have moved beyond the basics. As such, Python may be a better choice if students are planning to go deeper into machine learning.

R is not a sexy language. It feels old, and its website looks like it was created around the time the web was invented. Python is the "new kid" on the data science block, and has far more sex appeal. From a marketing perspective, Python may be the better choice simply because it will attract more students.

Installing R is a simple process, and installing RStudio (the de facto IDE for R) is just as easy. Installing new packages or upgrading existing packages from CRAN (R's package management system) is a trivial process within RStudio, and even installing packages hosted on GitHub is a simple process thanks to the devtools package.

By comparison, Python itself may be easy to install, but installing individual Python packages can be much more challenging. In my classroom, we encourage students to use the Anaconda distribution of Python, which includes nearly every Python package we use in the course and has a package management system similar to CRAN. However, Anaconda installation and configuration problems are still common in my classroom, whereas these problems were much more rare when using R and RStudio. As such, R may be the better choice if your students are not computer savvy.

Data cleaning (also known as "data munging") is the process of transforming your raw data into a more meaningful form. I find data cleaning to be easier in Python because of its rich set of data structures, as well as its far superior implementation of regular expressions (which are often necessary for cleaning text).

The pandas package in Python is an extremely powerful tool for data exploration, though its power and flexibility can also make it challenging to learn. R's dplyr is more limited in its capabilities than pandas (by design), though I find that its more focused approach makes it easier to figure out how to accomplish a given task. As well, dplyr's syntax is more readable and thus is easier for me to remember. Although it's not a clear differentiator, I would consider R a slightly easier environment for getting started in data exploration due to the ease of learning dplyr.

R's ggplot2 is an excellent package for data visualization. Once you understand its core principles (its "grammar of graphics"), it feels like the most natural way to build your plots, and it becomes easy to produce sophisticated and attractive plots. Matplotlib is the de facto standard for scientific plotting in Python, but I find it tedious both to learn and to use. Alternatives like Seaborn and pandas plotting still require you to know some Matplotlib, and the alternative that I find most promising (ggplot for Python) is still early in development. Therefore, I consider R the better choice for data visualization.

Python's Natural Language Toolkit (NLTK) is a mature, well-documented package for NLP. TextBlob is a simpler alternative, spaCy is a brand new alternative focused on performance, and scikit-learn also provides some supporting functionality for text-based feature extraction. In comparison, I find R's primary NLP framework (the tm package) to be significantly more limited and harder to use. Even if there are additional R packages that can fill in the gaps, there isn't one comprehensive package that you can use to get started, and thus Python is the better choice for teaching NLP.

If you are a data science educator, or even just a data scientist who uses R or Python, **I'd love to hear from you in the comments!** On which points above do you agree or disagree? What are some important factors that I have left out? What language do you teach in the classroom, and why?

I look forward to this conversation!

P.S. Want to hear about new Data School blog posts, video tutorials, and online courses? Subscribe to my newsletter.

]]>Should you teach #python or #rstats for #DataScience? My analysis: http://t.co/uaXFhh35NF

— Kevin Markham (@justmarkham) February 2, 2015

The session ending in December was my third time through the Data Science course, as I had previously taken the course (as a student) and had also been an "Expert in Residence" (teaching assistant) for the course. Since I have already begun teaching a fourth session of the course, **I've been giving a lot of thought to what worked and what didn't during the past three sessions** so that I can make the current course even better for my students. (My course materials can found on GitHub, both for the third session and the fourth session.)

Below are my findings from the past three sessions, based on explicit and implicit feedback from the students, student performance in each area, and conversations with my excellent co-instructor from the last session, Josiah Davis. Although this post is most useful to instructors specifically teaching General Assembly's Data Science course, **my hope is that every data science educator will find value in some of the lessons below.**

**I'd love to hear your comments** below the post, and I look forward to an engaging dialogue about data science education! If you'd like to connect with me otherwise, check out my About page.

P.S. I wrote a follow-up to this post: Should you teach Python or R for data science?

Lessons learned from teaching an 11-week #datascience course in #python: http://t.co/J4YK1RZpGp @GA @GA_DC

— Kevin Markham (@justmarkham) January 23, 2015

**Things that worked well**- Teaching Python as the sole language
- Using the Anaconda distribution of Python
- More concepts than math
- More code than slides
- Teaching APIs and web scraping
- Teaching Git in the context of the GitHub workflow
- Using a GitHub repository for homework submissions
- Much more Pandas than Numpy
- Including Natural Language Processing in the curriculum
- Starting class with a motivating example
- Placing a large emphasis on the course project
- Requiring students to review their peers' projects
- No grades
- Providing students with resources to go deeper into the material
- Using videos as teaching tools
- Using a chat application for out-of-class communication
- Recording a video of class

**Changes I'm making****Things I'd like to try**- Flipped classroom
- Pair programming
- Teaching plotting using ggplot or Seaborn instead of Matplotlib
- Teaching visualization at a conceptual level
- Using a course textbook
- Participating in a Kaggle competition as a class
- Including tidy data and reproducibility in the curriculum
- Teaching other types of machine learning

**Changes I'm debating**

Some latitude is given to General Assembly instructors as to how to deliver the curriculum, including the choice of language(s) in which to teach the core content. In the first two sessions in DC, the course was taught in both R and Python. In the most recent session, we taught the course in Python only, and I found that to be a wise decision. Because the vast majority of students who take the Data Science course (at least in DC) are relatively inexperienced at programming, attaining a baseline proficiency in two languages in 11 weeks is unattainable for most of our students. So when choosing between R and Python, we ultimately chose Python because of its extensibility, but I believe that R could also have been just as good a choice.

Although it's not a strict requirement, we strongly encourage all of our students to use the Anaconda distribution of Python. It includes almost every Python package that we use in the course, and removes most of the headaches (especially with Windows) associated with installing and upgrading packages. It also includes a nice IDE that we use for teaching (Spyder), as well as IPython and the IPython Notebook. Students occasionally run into configuration problems with Anaconda, but by and large it provides an easy path to getting up and running quickly with most of the necessary tools.

Undoubtedly, gaining a deep understanding of every machine learning technique that we teach would require students to have a thorough understanding of probability, statistics, linear algebra, calculus, and more. The vast majority of our students do not have that background, nor do we have the time to teach the in-depth mathematical theory behind each technique. Instead, we have found that students can obtain a practical understanding of how an algorithm works, its strengths and weaknesses, and how to properly apply it by teaching the material primarily at a conceptual level. We do include some of the math behind each algorithm, and we occasionally derive formulas, but that is not the focus of our lectures. We point students toward additional resources if they want to go deeper into the math, but most General Assembly students (and adult learners in general) are focused on learning things that they can immediately apply to their own work.

Because data science is as much about application as it is theory, we spent more time on code walkthroughs and exercises than we did on slides for explaining and demonstrating concepts. Students appreciate seeing high-quality, well-commented code, both for its explanatory value and its reference value. This is especially helpful for a complex topic such as machine learning, for which properly worked examples are a critical teaching tool.

Although the simplest approach to acquiring data (and the most expeditious for teaching purposes) is to download it from the web, we spent an entire class demonstrating how (and why) to gather data using APIs and web scraping. Not only did this content turn out to be very useful for students, since many of them incorporated data acquired this way into their final projects, but it also broadened their view of where data can be found.

I enjoy teaching Git (and even created an introductory video series), but we considered excluding it from the curriculum in order to make room for other "core" data science content. I'm glad that we decided to include it, because version control is an important and marketable data science skill. In combination with GitHub, it allows you to put your project portfolio online (as repositories), as well as contribute to open source projects. We gave it an entire class period (3 hours), which I have found is the minimum instructional time required to take a complete Git novice and provide them with a "functional understanding" of Git, including the ability to clone and fork repositories and contribute on GitHub using pull requests. I've specifically found that teaching Git in the context of the GitHub workflow is the fastest path to student comprehension, though branches are way too complicated for most novices to quickly grasp.

Because we want students to get a lot of practice using Git in a collaborative environment, we set up a GitHub repository for student work, and required that they use pull requests to submit their homework and course project. We found that this was valuable practice for them, though it was occasionally painful for us (as instructors) to deal with their merge conflicts. As well, GitHub gives us an easy way to comment on student code, since you can do inline comments in pull requests (which are automatically emailed to the student).

Pandas is an excellent Python library for data exploration, cleaning, analysis, and visualization. New features are being released all the time, the documentation is well-written and thorough, and there is an excellent book about Pandas by its creator. It sits on the huge shoulders of Numpy, Matplotlib, and other libraries, but makes it significantly easier to work with real-world data. Because Pandas duplicates a lot of Numpy functionality (but is much more convenient than Numpy when working with multiple data types), we only taught the functionality in Numpy that can't easily be done in Pandas. In fact, I've found that with each new release of Pandas, there are more operations you no longer need to do directly in Numpy.

Although some might argue that Natural Language Processing (NLP) is not a "core" data science topic, I was glad that we spent a class on NLP. So much of the information out there is stored in textual format, and understanding NLP significantly broadens your scope of what "data" looks like. And inevitably, many students want to use text-based data in their final project, so this class empowers them to do exactly that.

As often as possible, we introduced each topic using a "motivating example", meaning an example of a problem that could be solved by the topic we are about to teach. For example, before presenting an overview of machine learning, we explored the famous iris data set and then wrote a simple algorithm in Python for classifying each flower in the data set. That exercise in "human learning" served as the motivation for exploring machine learning as well as our first classification algorithm, K-Nearest Neighbors.

One of the course requirements is for students to create a data science project from start to finish. (Here are the project requirements and some past student projects.) Students are responsible for coming up with their own project question, and we encourage them to choose a project connected to their personal or professional interests. Because a lot of student learning occurs when they apply classroom knowledge to real-world data, we placed a heavy emphasis on the project, including milestone deadlines throughout the course. We found that on average, the resulting project quality was higher than in past courses.

Near the end of the course, we required each student to review the first draft of two of their peers' course projects. Although most students did not think they were competent enough to provide feedback, we found that it was a valuable supplement to the feedback that we (as instructors) provided, as well as useful practice for students in reading and analyzing the work of others.

This is certainly not an option in most classroom settings, but we decided not to grade any assignments or projects. Instead, we simply recorded whether the assignment was attempted, and used that as part of the rubric for whether to award a "certificate of completion" at the end of the course. The primary motivation for excluding grades is that the educational backgrounds of our students vary widely, and thus it would be unfair to compare the work of a student with significant programming experience with the work of a student lacking that experience. Instead, we focused on moving students forward (individually) by giving them lots of code and project comments, and deemphasized anything that might cause them to compare themselves to one another.

Although we limit the depth with which we teach any particular data science topic (in order to cover a broad range of topics during the course), there are always some students who want to go deeper into each topic. There is a bewildering array of available resources, and so it's easy for students to get lost trying to find material that builds upon in-class learning without being too advanced. So for each lesson, we provided a curated list of excellent resources for going deeper, and many students took advantage of these resources.

There is so much good data science content in video form, and some students learn better from videos than from text. As such, we included videos among the resources we provided to students, and occasionally showed videos during class. Specifically, we found a lot of high-quality videos in Hastie & Tibshirani's Statistical Learning course, and I'm currently reviewing Ng's Machine Learning course and Abu-Mostafa's Learning from Data course for more good video segments. As well, I've created my own videos for teaching purposes.

For most of our out-of-class communication, we used Slack, a web-based chat application. Our primary motivation was to lower the barrier for students to ask us questions, as well as allowing other students to see our answers to those questions. Chatting back-and-forth is a much more natural way (than emailing) to answer the complex questions that arise in the course, and it gives other students a way to chime in with their own thoughts. Slack also has a highly customizable notification system (email and otherwise) and a mobile app, so you can still keep track of the conversations without being logged in all the time. We were pleased to see that students also used Slack to share links with one another, such as interesting articles or local data-related events.

Because the Data Science class is taught in the evenings and nearly all of our students have day jobs, it is almost inevitable that students will miss at least a class or two, usually due to work or travel. We recorded a video of every single class using our laptops and posted it online the next day, which allowed students who missed class to quickly catch up. It was a relatively easy process, and the videos were even utilized by students who attended class but simply wanted to rewatch a particular section. (I'm considering writing a tutorial for how to record your own classroom; please comment below if you're interested!)

Scikit-learn is the most popular machine learning library for Python, and for good reason: it supports a wide variety of algorithms, has a consistent interface for accessing those algorithms, is thoughtfully designed, and has excellent documentation. Statsmodels is great for regression, and supports an R-like "formula interface" that is not available in scikit-learn, but is far more limited in terms of its general machine learning capabilities. We ended up teaching Statsmodels first because linear regression was the first algorithm we covered, and then covered scikit-learn (and emphasized scikit-learn a lot more). But because we taught Statsmodels first and spent a good amount of time on its interface, many students had trouble making the switch to how scikit-learn "thinks" and thus invested most of their project time in Statsmodels. That was unfortunate, because they eventually ran into the limitations of Statsmodels for machine learning. In the course I'm currently teaching, we are introducing scikit-learn first and mostly deemphasizing Statsmodels, and recommending to students that they should invest most of their time in scikit-learn.

In my view, there are four ways for students to learn coding: watching in-class code walkthroughs, doing in-class exercises, doing homework assignments, and individual project work. We emphasized in-class walkthroughs over exercises because walkthroughs allow us to present high-quality, well-commented code that students can later reference, whereas meaningful in-class exercises can take up a lot of class time (that could otherwise be used for instruction). We also emphasized project work over homework assignments, because we didn't want homework to take away from project time (where we felt a lot of learning would occur). Unfortunately, I think the balance was not quite right: we underestimated the amount of practice students need coding on their own before applying things to their own project. As such, in my current course we are giving out more homework and spending more time on in-class exercises.

Because we didn't feel that we could cover databases and SQL in a "useful" depth without giving up a significant amount of class time, we decided not to include it in the curriculum at all. We also felt that the course was already too heavy in terms of "tools and languages you are required to learn," and thus it seemed acceptable to eliminate SQL from that list. Although that choice did not hinder any students from executing a successful course project, ultimately it ignores the reality that data scientists tend to extract a lot of their data from databases. As such, at least a passing familiarity with database types and SQL queries is important, and thus we are teaching it in my current course.

Although we initially planned to include regularization in the course content, we ultimately ran out of time because we put a higher priority on other course content. However, regularization is an important machine learning technique and is referenced frequently in the literature, and thus some exposure to regularization is useful. We are teaching it in my current course near the end, and presenting it along with some of the more advanced material.

Toy datasets and simplified data science problems are attractive to us as educators, since they allow us to focus the student learning on one small piece at a time. However, removing too much of the real-world complexity of data science can leave our students unprepared for their course project or for real-world data science. I did spend part of a class demonstrating how I work a real-world data problem from scratch (using the Kaggle Avazu competition as an example), which students said was incredibly helpful, but they also wished it had been shown earlier in the course instead of at the end. In my current course, we are dedicating one class (halfway through the course) to students working a data problem on data they have never seen before, with the hope that it will help them to synthesize a lot of what we have taught up to that point.

(I'd love your feedback if you have tried any of these!)

I worry that by lecturing on a topic and then having students go deeper on that topic after class, we are missing an opportunity to have more engaging in-class discussions. I'm actively looking for good opportunities to try a flipped classroom approach, in which students read about a topic before class, and then we use the in-class time to discuss that topic in more depth and answer their questions. The challenge, of course, is finding the right material to give students before class: it has to be well-explained, at the right difficulty level, and cover the particular points we think are important.

We generally do very little pair programming in the course, even though it is known to be a useful way for improving your programming skills. My hesitance is that there is quite a large diversity in our class in terms of programming ability, and I fear that pair programming would not be useful for either person if there is a large imbalance within the pairs.

Although Matplotlib is extremely popular for scientific visualization in Python, I find it vastly more difficult to figure out how to plot what I want to plot (and make it look presentable) in Matplotlib than I do in R's ggplot2. Thankfully, Pandas provides a simplified interface to plotting (that simply calls Matplotlib functions), and so we mostly teach students how to plot using Pandas. However, that approach substantially limits what you can plot, since the majority of Matplotlib functionality is not available via Pandas. As an alternative, I have considered learning (and then teaching) either ggplot for Python or Seaborn, both of which produce more attractive plots than Matplotlib and appear to offer a more intuitive interface.

We do teach students how to implement the most common type of plots in Python, and we explain the circumstances under which those particular plots might be helpful. However, we spend very little time teaching visualization at a conceptual level: the purpose of visualization, how to create an effective visualization, how to choose between different visualizations, etc. That is primarily because it's hard to condense such a rich topic into a single class period. However, I do worry that by teaching the implementation of plotting without a lot of theory, we are hindering students from being really effective in this area.

Generally speaking, lessons in our course are created by instructors assembling pieces of existing lessons, finding resources on the Internet that they think are useful, and creating their own original material. What the course lacks is a unified textbook, which may provide more consistency to how we present the material. If I was to use a book (or two) as a "course textbook", I would probably use An Introduction to Statistical Learning (ISL) and Python for Data Analysis (PDA). ISL covers machine learning in a thorough yet accessible manner, and PDA is great for going in-depth especially on Pandas. However, I'm not clear that it's useful to require students to read a large portion of either book, nor is it clear that students would have time to read them!

I love Kaggle, and my hope when introducing students to Kaggle is that it will give them a focused way to practice their data science skills after the course ends. Ideally, I'd like to participate in a competition as a class since that would provide a single problem that we could all work on together. My main concern is that this would take away from homework time and project time. My lesser concern is that this would encourage students to compare their abilities to one another, which is something we actively try to deemphasize.

Although tidy data and reproducibility are not core data science topics, I do believe that it is worth considering each of those topics for inclusion in the curriculum. Knowing what tidy data looks like and how to tidy your data is a useful data science skill. As well, understanding the importance of reproducibility, and knowing how to make your work reproducible, is also useful in the real world.

Although the course covers both supervised and unsupervised learning, we don't cover other areas of machine learning such reinforcement learning, active learning, and online learning. Tools for online learning such as Vowpal Wabbit are becoming more widely known, and online learning is an especially useful paradigm for thinking about machine learning, so I've considered teaching that particular topic near the end of the course.

(I'd love to hear your thoughts!)

Like most Python users, I continue to use (and teach) Python 2 because it is still being supported by the community, a lot of teaching resources are written for Python 2, and it's unclear whether the benefits of Python 3 outweigh the downside (no backwards compatibility). However, there will be a future time at which Python 3 will become the norm, and so I wonder when we should begin teaching it. Fortunately, the availability of key libraries is no longer an issue, because every library we teach in the course supports Python 3.

Although the IPython Notebook is very popular these days as an instructional tool, we generally teach Python coding within Spyder, a nice IDE that comes with the Anaconda distribution of Python. The IPython Notebook is certainly an excellent way to present code, narrative, and visualizations in a single document. However, when watching students actually write their code within a Notebook, I've found that the interface seems to encourage sloppy coding practices that make it hard to actually debug code. As well, version control is mostly useless with IPython Notebooks. So while I am hesitant to introduce it as an alternative environment for coding, it may be worth teaching simply because it's an excellent format for presenting a finished project.

Most of the students we teach have minimal programming experience, so getting their Python skills to the point where they can understand the basics of object-oriented programming (OOP) seems like a huge challenge. However, by not providing them with at least an introduction to OOP, they leave the course not being able to effectively read a lot of real-world Python code.

We teach a bare minimum of command line in the course, primarily because most students have minimal experience with the command line and thus it becomes "one more tool" that they need to learn. However, there are certainly times at which the command line is the easiest (or the only) way to accomplish a task, and there is a general expectation that data scientists will be somewhat familiar with the command line, and so I have debated including it more fully the course.

"Big data" is obviously a hot topic in data science, and the majority of data science job postings do want some big data experience. We didn't spend any time on big data in the last session of the course, primarily because it's such a broad topic, but also because it's not clear what aspect of big data (or what big data tool) is suitable to teach as an introduction to the topic. We could certainly teach the basics of the MapReduce algorithm, but I tend to think that as an application-focused course, time is usually better spent on content that students can immediately apply without having to learn "yet another tool."

]]>An ROC curve is the most commonly used way to **visualize the performance of a binary classifier**, and AUC is (arguably) the best way to **summarize its performance in a single number**. As such, gaining a deep understanding of ROC curves and AUC is beneficial for data scientists, machine learning practitioners, and medical researchers (among others).

The 14-minute video is embedded below, followed by the complete transcript (including graphics). **If you want to skip to a particular section in the video**, simply click one of the time codes listed in the transcript (such as 0:52).

I welcome your feedback and questions in the comments section!

P.S. Want more content like this in your inbox? Subscribe to the Data School newsletter.

Think you understand ROC curves & AUC? Are you sure? In-depth video: http://t.co/3584rLd7kO Transcript & screenshots: http://t.co/5D30w5RG55

— Kevin Markham (@justmarkham) November 20, 2014

(0:00) This video should help you to gain an intuitive understanding of ROC curves and Area Under the Curve, also known as AUC.

An ROC curve is a commonly used way to **visualize the performance of a binary classifier**, meaning a classifier with two possible output classes.

For example, let's pretend you built a classifier to predict whether a research paper will be admitted to a journal, based on a variety of factors. The features might be the length of the paper, the number of authors, the number of papers those authors have previously submitted to the journal, et cetera. The response (or "output variable") would be whether or not the paper was admitted.

(0:52) Let's first take a look at the bottom portion of this diagram, and ignore the everything except the blue and red distributions. We'll pretend that **every blue and red pixel represents a paper** for which you want to predict the admission status. This is your validation (or "hold-out") set, so you know the true admission status of each paper. The 250 red pixels are the papers that were actually admitted, and the 250 blue pixels are the papers that were not admitted.

(1:32) Since this is your validation set, you want to judge how well your model is doing by comparing your model's predictions to the true admission statuses of those 500 papers. We'll assume that you used a classification method such as logistic regression that can not only make a **prediction** for each paper, but can also output a **predicted probability** of admission for each paper. These blue and red distributions are one way to visualize how those predicted probabilities compare to the true statuses.

(2:08) Let's examine this plot in detail. The x-axis represents your **predicted probabilities**, and the y-axis represents a **count of observations**, kind of like a histogram. Let's estimate that the height at 0.1 is 10 pixels. This plot tells you that there were 10 papers for which you predicted an admission probability of 0.1, and the true status for all 10 papers was negative (meaning not admitted). There were about 50 papers for which you predicted an admittance probability of 0.3, and none of those 50 were admitted. There were about 20 papers for which you predicted a probability of 0.5, and half of those were admitted and the other half were not. There were 50 papers for which you predicted a probability of 0.7, and all of those were admitted. And so on.

(3:16) Based on this plot, you might say that your classifier is doing quite well, since it did a good job of **separating the classes**. To actually make your class predictions, you might set your **"threshold"** at 0.5, and classify everything above 0.5 as admitted and everything below 0.5 as not admitted, which is what most classification methods will do by default. With that threshold, your **accuracy rate** would be above 90%, which is probably very good.

(3:58) Now let's pretend that your classifier didn't do nearly as well and move the blue distribution. You can see that there is a lot more overlap here, and regardless of where you set your threshold, your classification accuracy will be much **lower** than before.

(4:19) Now let's talk about the ROC curve that you see here in the upper left. So, what is an ROC curve? It is a plot of the **True Positive Rate (on the y-axis)** versus the **False Positive Rate (on the x-axis)** for every possible classification threshold. As a reminder, the True Positive Rate answers the question, "When the actual classification is positive (meaning admitted), how often does the classfier predict positive?" The False Positive Rate answers the question, "When the actual classification is negative (meaning not admitted), how often does the classifier incorrectly predict positive?" Both the True Positive Rate and the False Positive Rate **range from 0 to 1**.

(5:15) To see how the ROC curve is actually generated, let's set some example thresholds for classifying a paper as admitted.

A threshold of 0.8 would classify 50 papers as admitted, and 450 papers as not admitted. The True Positive Rate would be the **red pixels to the right of the line divided by all red pixels**, or 50 divided by 250, which is 0.2. The False Positive Rate would be the **blue pixels to the right of the line divided by all blue pixels**, or 0 divided by 250, which is 0. Thus, we would plot a point at 0 on the x-axis, and 0.2 on the y-axis, which is right here.

(6:16) Let's set a different threshold of 0.5. That would classify 360 papers as admitted, and 140 papers as not admitted. The True Positive Rate would be 235 divided by 250, or 0.94. The False Positive Rate would be 125 divided by 250, or 0.5. Thus, we would plot a point at 0.5 on the x-axis, and 0.94 on the y-axis, which is right here.

(7:05) We've plotted two points, but to generate the entire ROC curve, all we have to do is to plot the True Positive Rate versus the False Positive Rate for all possible classification thresholds which range from 0 to 1. That is a huge benefit of using an ROC curve to evaluate a classifier instead of a simpler metric such as misclassification rate, in that **an ROC curve visualizes all possible classification thresholds, whereas misclassification rate only represents your error rate for a single threshold**. Note that you can't actually see the thresholds used to generate the ROC curve anywhere on the curve itself.

Now, let's move the blue distribution back to where it was before. Because the classifier is doing a very good job of separating the blues and the reds, I can set a threshold of 0.6, have a True Positive Rate of 0.8, and still have a False Positive Rate of 0.

(8:24) Therefore, a classifier that does a very **good job separating the classes** will have an ROC curve that hugs the upper left corner of the plot. Conversely, a classifier that does a very **poor job separating the classes** will have an ROC curve that is close to this black diagonal line. That line essentially represents a classifier that does no better than random guessing.

(8:55) Naturally, you might want to use the ROC curve to **quantify the performance of a classifier**, and give a higher score for this classifier than this classifier. That is the purpose of AUC, which stands for **Area Under the Curve**. AUC is literally just the percentage of this box that is under this curve. This classifier has an AUC of around 0.8, a very poor classifier has an AUC of around 0.5, and this classifier has an AUC of close to 1.

(9:45) There are two things I want to mention about this diagram. First, this diagram shows a case where your **classes are perfectly balanced**, which is why the size of the blue and the red distributions are identical. In most real-world problems, this is not the case. For example, if only 10% of papers were admitted, the blue distribution would be nine times larger than the red distribution. However, that doesn't change how the ROC curve is generated.

A second note about this diagram is that it shows a case where your **predicted probabilities have a very smooth shape**, similar to a normal distribution. That was just for demonstration purposes. The probabilities output by your classifier will not necessarily follow any particular shape.

(10:40) To close, I want to add three other important notes. The first note is that the ROC curve and AUC are **insensitive to whether your predicted probabilities are properly calibrated** to actually represent probabilities of class membership. In other words, the ROC curve and the AUC would be identical even if your predicted probabilities ranged from 0.9 to 1 instead of 0 to 1, as long as the ordering of observations by predicted probability remained the same. All the AUC metric cares about is how well your classifier separated the two classes, and thus **it is said to only be sensitive to rank ordering**. You can think of AUC as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation, and thus it is a **useful metric even for datasets with highly unbalanced classes**.

(11:52) The second note is that ROC curves can be extended to **classification problems with three or more classes** using what is called a "one versus all" approach. That means if you have three classes, you would create three ROC curves. In the first curve, you would choose the first class as the positive class, and group the other two classes together as the negative class. In the second curve, you would choose the second class as the positive class, and group the other two classes together as the negative class. And so on.

(12:30) Finally, you might be wondering how you should set your classification threshold, once you are ready to use it to predict out-of-sample data. That's actually more of a **business decision**, in that you have to decide whether you would rather **minimize your False Positive Rate or maximize your True Positive Rate**. In our journal example, it's not obvious what you should do. But let's say your classifier was being used to predict whether a given credit card transaction might be fraudulent and thus should be reviewed by the credit card holder. The business decision might be to set the threshold very low. That will result in a lot of false positives, but that might be considered acceptable because it would maximize the true positive rate and thus minimize the number of cases in which a real instance of fraud was not flagged for review.

(13:34) In the end, you will always have to choose a classification threshold, but the ROC curve will help you to visually understand the impact of that choice.

Thanks very much to Navan for creating this excellent vizualization. Below this video, I've linked to it as well as a very readable paper that provides a much more in-depth treatment of ROC curves. I also welcome your questions in the comments.

]]>I created this guide to linear regression a while ago, after reading Hastie and Tibshirani's excellent An Introduction to Statistical Learning (with Applications in R). Now that I'm a Data Science instructor for General Assembly, I've made a personal commitment to sharing these guides so that my students and others can benefit from them.

Please note that this is not a tutorial, and is not suitable for teaching you linear regression if you are not already familiar with it. Instead, it is only intended to be a **light reference guide to applying linear regression and interpreting the output**, and ignores many nuances of the topic. However, I have listed resources for deepening your understanding (and applying it to R, Python, and other statistical packages) at the bottom of this post.

Your feedback and clarifications are welcome!

- Estimate B0 (intercept) and B1 (slope) based on least squares
- "Residuals" are the discrepancies between the actual and predicted y values
- Total residuals for a given model is the "residual sum of squares" (RSS)
- Least squares line minimizes RSS

- How much would B1 vary under repeated sampling? (Thus, how "accurate" is it?)
- Calculate the "standard error" (SE) of B1, and the 95% "confidence interval" ranges from B1 +- 2*SE
- Interpretation: If you sampled the data 100 times, 95% of those confidence intervals would contain the "true" B1

- Null hypothesis: x and y are not related (thus B1=0)
- Alternative hypothesis: there is some relationship between x and y (thus B1 != 0)
- "t-statistic" = B1 / SE(B1) = number of standard deviations that B1 is from zero
- Higher t-statistic (more than 2) is stronger evidence that there is a relationship

- "p-value" is probability that this relationship is occurring by chance
- Lower p-value (less than 0.05) is stronger evidence of a relationship

- "Residual standard error" (RSE) is computed using RSS
- "Large" RSE is a poor fit, but RSE is measured in y units

- "R-squared" (R^2) is proportion of variability in y that can be explained using x
- Ranges from 0 to 1
- 0.75 means fitted model showed 75% reduction in error over null model
- Higher R^2 indicates a stronger relationship between x and y

- Involves more than 1 predictor, thus has more than 1 slope coefficient
- Still estimate B0, B1, B2, etc. by minimizing RSS

- Null hypothesis: B1 = B2 = etc. = 0
- Compute F-statistic: will be close to 1 when null hypothesis is true, and much larger than 1 when null hypothesis is false
- Even if the p-value for an individual coefficient is small, you still need to check F-statistic for the entire model (especially when the number of predictors is large)
- When n (number of observations) is large, F-statistic does not have to be particularly large to reject the null hypothesis
- When n is small, larger F-statistic is required to reject the null hypothesis
- Examine p-value for F-statistic to help you decide whether to reject the null hypothesis

- Use R^2 and RSE
- "Large" increase in R^2 when adding a variable to the model is evidence that you should keep it in the model
- "Small" increase in R^2 when adding a variable to the model is evidence that you can leave it out

- Create dummy variable(s): one less dummy variable than the number of levels
- Example with three levels: intercept coefficient (B0) represents the "baseline" (average response for the first level), B1 represents the difference between the second level and the baseline, and B2 represents the difference between the third level and the baseline

- Add interaction terms and examine the p-value for those terms
- Also check whether R^2 for the model with interactions is better than one without
- If you add an interaction term, also include the "main effects" (even if their individual p-values don't justify it)

- Plot residuals versus fitted y values (multiple linear regression) or residuals versus x (simple linear regression): pattern indicates non-linearity
- Try using non-linear transformations of predictors in the model: ln(x), sqrt(x), x^2

- Indicated by funnel shape in residual plot
- Try transforming y using a concave function: ln(y), sqrt(y)
- Or try using weighted least squares

- Plot studentized residuals: greater than 3 is an outlier
- Try removing the observation from the dataset

- Simple linear regression: look for observations for which x is outside the normal range
- Multiple linear regression: compute leverage statistics - close to 1 is high leverage
- Try removing the observation from the dataset

- Exists whenever there is a correlation between two or more predictors
- Detect pairs of highly correlated variables by examining the correlation matrix for high absolute values
- Detect multicollinearity (three or more correlated variables) by computing the variance inflation factor (VIF) for each predictor
- Minimum VIF is 1
- VIF greater than 5 or 10 indicates problematic amount of collinearity

- Try removing one of the correlated predictors from the model, or combining them into a single predictor

- This guide is largely adapted from Chapter 3 of An Introduction to Statistical Learning, a book that I highly recommend to any newcomers to statistical learning/machine learning (and which is available as a free PDF download). There are also 15 hours of videos associated with the book, as well as a wealth of R code included in the book.
- I created a substantial IPython Notebook introducing linear regression in Python.
- Dr. Robert Nau (Duke University) has a highly readable and practical guide to linear regression, split across a dozen medium-length posts.
- The Yhat Blog has a concise guide to fitting linear models in R and interpreting R's output.
- The DataRobot Blog has a similar guide for using statsmodels in Python, with one post on simple linear regression and another on multiple linear regression.
- UCLA's Institute for Digital Research and Education has a set of guides to assist you in interpreting the regression output from Stata, SAS, SPSS, and Mplus.

**If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover** to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors' website.

If you decide to attempt the exercises at the end of each chapter, there is a GitHub repository of solutions provided by students you can use to check your work.

**As a supplement to the textbook, you may also want to watch the excellent course lecture videos** (linked below), in which Dr. Hastie and Dr. Tibshirani discuss much of the material. In case you want to browse the lecture content, I've also linked to the PDF slides used in the videos.

P.S. Want to learn how to do **machine learning in Python**? I have a 4-hour introductory video series (with IPython notebooks), and I teach an online course!

In-depth intro to #machinelearning in 15 hours of video by the authors of "Elements of Statistical Learning": http://t.co/Id0ea1UTV0

— Kevin Markham (@justmarkham) September 3, 2014

- Opening Remarks and Examples (18:18)
- Supervised and Unsupervised Learning (12:12)

- Statistical Learning and Regression (11:41)
- Curse of Dimensionality and Parametric Models (11:40)
- Assessing Model Accuracy and Bias-Variance Trade-off (10:04)
- Classification Problems and K-Nearest Neighbors (15:37)
- Lab: Introduction to R (14:12)

- Simple Linear Regression and Confidence Intervals (13:01)
- Hypothesis Testing (8:24)
- Multiple Linear Regression and Interpreting Regression Coefficients (15:38)
- Model Selection and Qualitative Predictors (14:51)
- Interactions and Nonlinearity (14:16)
- Lab: Linear Regression (22:10)

- Introduction to Classification (10:25)
- Logistic Regression and Maximum Likelihood (9:07)
- Multivariate Logistic Regression and Confounding (9:53)
- Case-Control Sampling and Multiclass Logistic Regression (7:28)
- Linear Discriminant Analysis and Bayes Theorem (7:12)
- Univariate Linear Discriminant Analysis (7:37)
- Multivariate Linear Discriminant Analysis and ROC Curves (17:42)
- Quadratic Discriminant Analysis and Naive Bayes (10:07)
- Lab: Logistic Regression (10:14)
- Lab: Linear Discriminant Analysis (8:22)
- Lab: K-Nearest Neighbors (5:01)

- Estimating Prediction Error and Validation Set Approach (14:01)
- K-fold Cross-Validation (13:33)
- Cross-Validation: The Right and Wrong Ways (10:07)
- The Bootstrap (11:29)
- More on the Bootstrap (14:35)
- Lab: Cross-Validation (11:21)
- Lab: The Bootstrap (7:40)

- Linear Model Selection and Best Subset Selection (13:44)
- Forward Stepwise Selection (12:26)
- Backward Stepwise Selection (5:26)
- Estimating Test Error Using Mallow's Cp, AIC, BIC, Adjusted R-squared (14:06)
- Estimating Test Error Using Cross-Validation (8:43)
- Shrinkage Methods and Ridge Regression (12:37)
- The Lasso (15:21)
- Tuning Parameter Selection for Ridge Regression and Lasso (5:27)
- Dimension Reduction (4:45)
- Principal Components Regression and Partial Least Squares (15:48)
- Lab: Best Subset Selection (10:36)
- Lab: Forward Stepwise Selection and Model Selection Using Validation Set (10:32)
- Lab: Model Selection Using Cross-Validation (5:32)
- Lab: Ridge Regression and Lasso (16:34)

- Polynomial Regression and Step Functions (14:59)
- Piecewise Polynomials and Splines (13:13)
- Smoothing Splines (10:10)
- Local Regression and Generalized Additive Models (10:45)
- Lab: Polynomials (21:11)
- Lab: Splines and Generalized Additive Models (12:15)

- Decision Trees (14:37)
- Pruning a Decision Tree (11:45)
- Classification Trees and Comparison with Linear Models (11:00)
- Bootstrap Aggregation (Bagging) and Random Forests (13:45)
- Boosting and Variable Importance (12:03)
- Lab: Decision Trees (10:13)
- Lab: Random Forests and Boosting (15:35)

- Maximal Margin Classifier (11:35)
- Support Vector Classifier (8:04)
- Kernels and Support Vector Machines (15:04)
- Example and Comparison with Logistic Regression (14:47)
- Lab: Support Vector Machine for Classification (10:13)
- Lab: Nonlinear Support Vector Machine (7:54)

- Unsupervised Learning and Principal Components Analysis (12:37)
- Exploring Principal Components Analysis and Proportion of Variance Explained (17:39)
- K-means Clustering (17:17)
- Hierarchical Clustering (14:45)
- Breast Cancer Example of Hierarchical Clustering (9:24)
- Lab: Principal Components Analysis (6:28)
- Lab: K-means Clustering (6:31)
- Lab: Hierarchical Clustering (6:33)

- Interview with John Chambers (10:20)
- Interview with Bradley Efron (12:08)
- Interview with Jerome Friedman (10:29)
- Interviews with statistics graduate students (7:44)