January 17, 2015

Lessons learned from teaching an 11-week data science course

Last month, I finished teaching General Assembly's 11-week Data Science course in Washington, DC. It's a substantial introductory data science course that covers the entire "data science pipeline": getting and cleaning data, data exploration and analysis, machine learning, visualization, and communicating results. The course includes 66 hours of classroom instruction (twice per week for three hours), as well as each student completing a course project of their choosing. It is one of General Assembly's most challenging courses, and is primarily taken by adults looking to advance their careers.

The session ending in December was my third time through the Data Science course, as I had previously taken the course (as a student) and had also been an "Expert in Residence" (teaching assistant) for the course. Since I have already begun teaching a fourth session of the course, I've been giving a lot of thought to what worked and what didn't during the past three sessions so that I can make the current course even better for my students. (My course materials can found on GitHub, both for the third session and the fourth session.)

Below are my findings from the past three sessions, based on explicit and implicit feedback from the students, student performance in each area, and conversations with my excellent co-instructor from the last session, Josiah Davis. Although this post is most useful to instructors specifically teaching General Assembly's Data Science course, my hope is that every data science educator will find value in some of the lessons below.

I'd love to hear your comments below the post, and I look forward to an engaging dialogue about data science education!

P.S. I wrote a follow-up to this post: Should you teach Python or R for data science?

Things that worked well
Changes I'm making
Things I'd like to try
Changes I'm debating

Things that worked well

Teaching Python as the sole language

Some latitude is given to General Assembly instructors as to how to deliver the curriculum, including the choice of language(s) in which to teach the core content. In the first two sessions in DC, the course was taught in both R and Python. In the most recent session, we taught the course in Python only, and I found that to be a wise decision. Because the vast majority of students who take the Data Science course (at least in DC) are relatively inexperienced at programming, attaining a baseline proficiency in two languages in 11 weeks is unattainable for most of our students. So when choosing between R and Python, we ultimately chose Python because of its extensibility, but I believe that R could also have been just as good a choice.

Using the Anaconda distribution of Python

Although it's not a strict requirement, we strongly encourage all of our students to use the Anaconda distribution of Python. It includes almost every Python package that we use in the course, and removes most of the headaches (especially with Windows) associated with installing and upgrading packages. It also includes a nice IDE that we use for teaching (Spyder), as well as IPython and the IPython Notebook. Students occasionally run into configuration problems with Anaconda, but by and large it provides an easy path to getting up and running quickly with most of the necessary tools.

More concepts than math

Undoubtedly, gaining a deep understanding of every machine learning technique that we teach would require students to have a thorough understanding of probability, statistics, linear algebra, calculus, and more. The vast majority of our students do not have that background, nor do we have the time to teach the in-depth mathematical theory behind each technique. Instead, we have found that students can obtain a practical understanding of how an algorithm works, its strengths and weaknesses, and how to properly apply it by teaching the material primarily at a conceptual level. We do include some of the math behind each algorithm, and we occasionally derive formulas, but that is not the focus of our lectures. We point students toward additional resources if they want to go deeper into the math, but most General Assembly students (and adult learners in general) are focused on learning things that they can immediately apply to their own work.

Teaching APIs and web scraping

Although the simplest approach to acquiring data (and the most expeditious for teaching purposes) is to download it from the web, we spent an entire class demonstrating how (and why) to gather data using APIs and web scraping. Not only did this content turn out to be very useful for students, since many of them incorporated data acquired this way into their final projects, but it also broadened their view of where data can be found.

Teaching Git in the context of the GitHub workflow

I enjoy teaching Git (and even created an introductory video series), but we considered excluding it from the curriculum in order to make room for other "core" data science content. I'm glad that we decided to include it, because version control is an important and marketable data science skill. In combination with GitHub, it allows you to put your project portfolio online (as repositories), as well as contribute to open source projects. We gave it an entire class period (3 hours), which I have found is the minimum instructional time required to take a complete Git novice and provide them with a "functional understanding" of Git, including the ability to clone and fork repositories and contribute on GitHub using pull requests. I've specifically found that teaching Git in the context of the GitHub workflow is the fastest path to student comprehension, though branches are way too complicated for most novices to quickly grasp.

Using a GitHub repository for homework submissions

Because we want students to get a lot of practice using Git in a collaborative environment, we set up a GitHub repository for student work, and required that they use pull requests to submit their homework and course project. We found that this was valuable practice for them, though it was occasionally painful for us (as instructors) to deal with their merge conflicts. As well, GitHub gives us an easy way to comment on student code, since you can do inline comments in pull requests (which are automatically emailed to the student).

Much more Pandas than Numpy

Pandas is an excellent Python library for data exploration, cleaning, analysis, and visualization. New features are being released all the time, the documentation is well-written and thorough, and there is an excellent book about Pandas by its creator. It sits on the huge shoulders of Numpy, Matplotlib, and other libraries, but makes it significantly easier to work with real-world data. Because Pandas duplicates a lot of Numpy functionality (but is much more convenient than Numpy when working with multiple data types), we only taught the functionality in Numpy that can't easily be done in Pandas. In fact, I've found that with each new release of Pandas, there are more operations you no longer need to do directly in Numpy.

Including Natural Language Processing in the curriculum

Although some might argue that Natural Language Processing (NLP) is not a "core" data science topic, I was glad that we spent a class on NLP. So much of the information out there is stored in textual format, and understanding NLP significantly broadens your scope of what "data" looks like. And inevitably, many students want to use text-based data in their final project, so this class empowers them to do exactly that.

Starting class with a motivating example

As often as possible, we introduced each topic using a "motivating example", meaning an example of a problem that could be solved by the topic we are about to teach. For example, before presenting an overview of machine learning, we explored the famous iris data set and then wrote a simple algorithm in Python for classifying each flower in the data set. That exercise in "human learning" served as the motivation for exploring machine learning as well as our first classification algorithm, K-Nearest Neighbors.

Placing a large emphasis on the course project

One of the course requirements is for students to create a data science project from start to finish. (Here are the project requirements.) Students are responsible for coming up with their own project question, and we encourage them to choose a project connected to their personal or professional interests. Because a lot of student learning occurs when they apply classroom knowledge to real-world data, we placed a heavy emphasis on the project, including milestone deadlines throughout the course. We found that on average, the resulting project quality was higher than in past courses.

Requiring students to review their peers' projects

Near the end of the course, we required each student to review the first draft of two of their peers' course projects. Although most students did not think they were competent enough to provide feedback, we found that it was a valuable supplement to the feedback that we (as instructors) provided, as well as useful practice for students in reading and analyzing the work of others.

No grades

This is certainly not an option in most classroom settings, but we decided not to grade any assignments or projects. Instead, we simply recorded whether the assignment was attempted, and used that as part of the rubric for whether to award a "certificate of completion" at the end of the course. The primary motivation for excluding grades is that the educational backgrounds of our students vary widely, and thus it would be unfair to compare the work of a student with significant programming experience with the work of a student lacking that experience. Instead, we focused on moving students forward (individually) by giving them lots of code and project comments, and deemphasized anything that might cause them to compare themselves to one another.

Providing students with resources to go deeper into the material

Although we limit the depth with which we teach any particular data science topic (in order to cover a broad range of topics during the course), there are always some students who want to go deeper into each topic. There is a bewildering array of available resources, and so it's easy for students to get lost trying to find material that builds upon in-class learning without being too advanced. So for each lesson, we provided a curated list of excellent resources for going deeper, and many students took advantage of these resources.

Using videos as teaching tools

There is so much good data science content in video form, and some students learn better from videos than from text. As such, we included videos among the resources we provided to students, and occasionally showed videos during class. Specifically, we found a lot of high-quality videos in Hastie & Tibshirani's Statistical Learning course, and I'm currently reviewing Ng's Machine Learning course and Abu-Mostafa's Learning from Data course for more good video segments. As well, I've created my own videos for teaching purposes.

Using a chat application for out-of-class communication

For most of our out-of-class communication, we used Slack, a web-based chat application. Our primary motivation was to lower the barrier for students to ask us questions, as well as allowing other students to see our answers to those questions. Chatting back-and-forth is a much more natural way (than emailing) to answer the complex questions that arise in the course, and it gives other students a way to chime in with their own thoughts. Slack also has a highly customizable notification system (email and otherwise) and a mobile app, so you can still keep track of the conversations without being logged in all the time. We were pleased to see that students also used Slack to share links with one another, such as interesting articles or local data-related events.

Recording a video of class

Because the Data Science class is taught in the evenings and nearly all of our students have day jobs, it is almost inevitable that students will miss at least a class or two, usually due to work or travel. We recorded a video of every single class using our laptops and posted it online the next day, which allowed students who missed class to quickly catch up. It was a relatively easy process, and the videos were even utilized by students who attended class but simply wanted to rewatch a particular section. (I'm considering writing a tutorial for how to record your own classroom; please comment below if you're interested!)

Changes I'm making

Teaching more scikit-learn and less Statsmodels

Scikit-learn is the most popular machine learning library for Python, and for good reason: it supports a wide variety of algorithms, has a consistent interface for accessing those algorithms, is thoughtfully designed, and has excellent documentation. Statsmodels is great for regression, and supports an R-like "formula interface" that is not available in scikit-learn, but is far more limited in terms of its general machine learning capabilities. We ended up teaching Statsmodels first because linear regression was the first algorithm we covered, and then covered scikit-learn (and emphasized scikit-learn a lot more). But because we taught Statsmodels first and spent a good amount of time on its interface, many students had trouble making the switch to how scikit-learn "thinks" and thus invested most of their project time in Statsmodels. That was unfortunate, because they eventually ran into the limitations of Statsmodels for machine learning. In the course I'm currently teaching, we are introducing scikit-learn first and mostly deemphasizing Statsmodels, and recommending to students that they should invest most of their time in scikit-learn.

More in-class coding and more homework assignments

In my view, there are four ways for students to learn coding: watching in-class code walkthroughs, doing in-class exercises, doing homework assignments, and individual project work. We emphasized in-class walkthroughs over exercises because walkthroughs allow us to present high-quality, well-commented code that students can later reference, whereas meaningful in-class exercises can take up a lot of class time (that could otherwise be used for instruction). We also emphasized project work over homework assignments, because we didn't want homework to take away from project time (where we felt a lot of learning would occur). Unfortunately, I think the balance was not quite right: we underestimated the amount of practice students need coding on their own before applying things to their own project. As such, in my current course we are giving out more homework and spending more time on in-class exercises.

Including databases in the curriculum

Because we didn't feel that we could cover databases and SQL in a "useful" depth without giving up a significant amount of class time, we decided not to include it in the curriculum at all. We also felt that the course was already too heavy in terms of "tools and languages you are required to learn," and thus it seemed acceptable to eliminate SQL from that list. Although that choice did not hinder any students from executing a successful course project, ultimately it ignores the reality that data scientists tend to extract a lot of their data from databases. As such, at least a passing familiarity with database types and SQL queries is important, and thus we are teaching it in my current course.

Including regularization in the curriculum

Although we initially planned to include regularization in the course content, we ultimately ran out of time because we put a higher priority on other course content. However, regularization is an important machine learning technique and is referenced frequently in the literature, and thus some exposure to regularization is useful. We are teaching it in my current course near the end, and presenting it along with some of the more advanced material.

Spending more time on real-world data problems

Toy datasets and simplified data science problems are attractive to us as educators, since they allow us to focus the student learning on one small piece at a time. However, removing too much of the real-world complexity of data science can leave our students unprepared for their course project or for real-world data science. I did spend part of a class demonstrating how I work a real-world data problem from scratch (using the Kaggle Avazu competition as an example), which students said was incredibly helpful, but they also wished it had been shown earlier in the course instead of at the end. In my current course, we are dedicating one class (halfway through the course) to students working a data problem on data they have never seen before, with the hope that it will help them to synthesize a lot of what we have taught up to that point.

Things I'd like to try

(I'd love your feedback if you have tried any of these!)

Flipped classroom

I worry that by lecturing on a topic and then having students go deeper on that topic after class, we are missing an opportunity to have more engaging in-class discussions. I'm actively looking for good opportunities to try a flipped classroom approach, in which students read about a topic before class, and then we use the in-class time to discuss that topic in more depth and answer their questions. The challenge, of course, is finding the right material to give students before class: it has to be well-explained, at the right difficulty level, and cover the particular points we think are important.

Pair programming

We generally do very little pair programming in the course, even though it is known to be a useful way for improving your programming skills. My hesitance is that there is quite a large diversity in our class in terms of programming ability, and I fear that pair programming would not be useful for either person if there is a large imbalance within the pairs.

Teaching plotting using ggplot or Seaborn instead of Matplotlib

Although Matplotlib is extremely popular for scientific visualization in Python, I find it vastly more difficult to figure out how to plot what I want to plot (and make it look presentable) in Matplotlib than I do in R's ggplot2. Thankfully, Pandas provides a simplified interface to plotting (that simply calls Matplotlib functions), and so we mostly teach students how to plot using Pandas. However, that approach substantially limits what you can plot, since the majority of Matplotlib functionality is not available via Pandas. As an alternative, I have considered learning (and then teaching) either ggplot for Python or Seaborn, both of which produce more attractive plots than Matplotlib and appear to offer a more intuitive interface.

Teaching visualization at a conceptual level

We do teach students how to implement the most common type of plots in Python, and we explain the circumstances under which those particular plots might be helpful. However, we spend very little time teaching visualization at a conceptual level: the purpose of visualization, how to create an effective visualization, how to choose between different visualizations, etc. That is primarily because it's hard to condense such a rich topic into a single class period. However, I do worry that by teaching the implementation of plotting without a lot of theory, we are hindering students from being really effective in this area.

Using a course textbook

Generally speaking, lessons in our course are created by instructors assembling pieces of existing lessons, finding resources on the Internet that they think are useful, and creating their own original material. What the course lacks is a unified textbook, which may provide more consistency to how we present the material. If I was to use a book (or two) as a "course textbook", I would probably use An Introduction to Statistical Learning (ISL) and Python for Data Analysis (PDA). ISL covers machine learning in a thorough yet accessible manner, and PDA is great for going in-depth especially on Pandas. However, I'm not clear that it's useful to require students to read a large portion of either book, nor is it clear that students would have time to read them!

Participating in a Kaggle competition as a class

I love Kaggle, and my hope when introducing students to Kaggle is that it will give them a focused way to practice their data science skills after the course ends. Ideally, I'd like to participate in a competition as a class since that would provide a single problem that we could all work on together. My main concern is that this would take away from homework time and project time. My lesser concern is that this would encourage students to compare their abilities to one another, which is something we actively try to deemphasize.

Including tidy data and reproducibility in the curriculum

Although tidy data and reproducibility are not core data science topics, I do believe that it is worth considering each of those topics for inclusion in the curriculum. Knowing what tidy data looks like and how to tidy your data is a useful data science skill. As well, understanding the importance of reproducibility, and knowing how to make your work reproducible, is also useful in the real world.

Teaching other types of machine learning

Although the course covers both supervised and unsupervised learning, we don't cover other areas of machine learning such reinforcement learning, active learning, and online learning. Tools for online learning such as Vowpal Wabbit are becoming more widely known, and online learning is an especially useful paradigm for thinking about machine learning, so I've considered teaching that particular topic near the end of the course.

Changes I'm debating

(I'd love to hear your thoughts!)

Switching to Python 3

Like most Python users, I continue to use (and teach) Python 2 because it is still being supported by the community, a lot of teaching resources are written for Python 2, and it's unclear whether the benefits of Python 3 outweigh the downside (no backwards compatibility). However, there will be a future time at which Python 3 will become the norm, and so I wonder when we should begin teaching it. Fortunately, the availability of key libraries is no longer an issue, because every library we teach in the course supports Python 3.

Encouraging use of the IPython Notebook

Although the IPython Notebook is very popular these days as an instructional tool, we generally teach Python coding within Spyder, a nice IDE that comes with the Anaconda distribution of Python. The IPython Notebook is certainly an excellent way to present code, narrative, and visualizations in a single document. However, when watching students actually write their code within a Notebook, I've found that the interface seems to encourage sloppy coding practices that make it hard to actually debug code. As well, version control is mostly useless with IPython Notebooks. So while I am hesitant to introduce it as an alternative environment for coding, it may be worth teaching simply because it's an excellent format for presenting a finished project.

Introducing object-oriented programming

Most of the students we teach have minimal programming experience, so getting their Python skills to the point where they can understand the basics of object-oriented programming (OOP) seems like a huge challenge. However, by not providing them with at least an introduction to OOP, they leave the course not being able to effectively read a lot of real-world Python code.

Teaching more command line tools

We teach a bare minimum of command line in the course, primarily because most students have minimal experience with the command line and thus it becomes "one more tool" that they need to learn. However, there are certainly times at which the command line is the easiest (or the only) way to accomplish a task, and there is a general expectation that data scientists will be somewhat familiar with the command line, and so I have debated including it more fully the course.

Including "big data" techniques in the curriculum

"Big data" is obviously a hot topic in data science, and the majority of data science job postings do want some big data experience. We didn't spend any time on big data in the last session of the course, primarily because it's such a broad topic, but also because it's not clear what aspect of big data (or what big data tool) is suitable to teach as an introduction to the topic. We could certainly teach the basics of the MapReduce algorithm, but I tend to think that as an application-focused course, time is usually better spent on content that students can immediately apply without having to learn "yet another tool."