Machine Learning with Text in Python

Is this Data School course right for you?

Are you trying to master machine learning in Python, but tired of wasting your time on courses that don't move you towards your goal? Do you recognize the enormous value of text-based data, but don't know how to apply the right machine learning and Natural Language Processing techniques to extract that value?

In this Data School course, you'll gain hands-on experience using machine learning and Natural Language Processing to solve text-based data science problems. By the end of the course, you'll be able to confidently apply these techniques to your own data science problems.

Ryan Cranfill (Product Data Analyst): "The course was a perfect introduction to machine learning with text, and I was able to apply topics covered during the first week to my work. Kevin does a great job of breaking down complex topics and providing a practical, real-world context for them."

Supervised learning diagram

Jump to: Course Description - Course Outline - FAQs

How is this course different from other online courses?

Most data science courses suffer from a host of problems: They're poorly taught, lack the necessary depth, and include unexplained or broken code. They don't teach you how to apply what you're learning, and when you do apply it, there's no way to know how well you're doing.

But in this course, we'll go deep into machine learning with text, focusing on application from day one. We'll spend most of our time writing Python code, and you'll understand how every single line relates to the problem we're solving. You'll practice what you're learning through carefully crafted lessons and assignments.

At the end of this course, you'll leave with valuable machine learning experience, high-quality code that you can reuse to solve future text-based problems, a student community you can continue to turn to with questions, and a wealth of curated resources to help you deepen your understanding of each course topic.

Harvey Summers (Information Security Team Leader): "Kevin Markham's Data School courses are remarkably good. Kevin has the ability to simplify complicated concepts and explain not only how to code, but also the 'why' behind it. I have been very happy with the cost and quality of the courses and have learned more in 4 weeks than the 18 months of coursework I've taken from a leading university."

scikit-learn algorithm cheat sheet

Course Description

In this self-paced online course, you'll learn how to build effective machine learning models using text-based data to solve your own data science problems. The course includes:

Course Outline

Each module includes 2 to 4 hours of instructional videos, 1 lesson notebook, 1 to 2 homework assignments, and 15 to 20 supplementary resources.

Module 1: Working with Text Data in scikit-learn

By the end of this module, you'll be able to confidently perform the basic workflow for machine learning with text: creating a dataset, extracting features from unstructured text, building and evaluating models, and inspecting models for further insight. You'll also gain an understanding of Unicode, enabling you to troubleshoot encoding-based errors.

Module 2: Applying Natural Language Processing Techniques to Machine Learning

By the end of this module, you'll be able to apply a handful of Natural Language Processing techniques to machine learning problems in order to improve the effectiveness of your models. You'll also learn how to perform sentiment analysis and build a simple document summarization tool for your own corpus of text.

Module 3: Parsing Text Data Using Regular Expressions

By the end of this module, you'll be able to extract text features from messy data sources using regular expressions. You'll learn the basic rules and syntax that can be applied across programming languages, and you'll master the most important Python functions and options for working with regular expressions.

Module 4: Workflow for a Text-Based Data Science Problem

By the end of this module, you'll be able to create an end-to-end workflow for solving a text-based data science problem using scikit-learn and pandas. You'll gain experience with data exploration, feature engineering, proper model evaluation, model tuning, and generating predictions for new observations.

Module 5: Advanced Machine Learning Techniques

By the end of this module, you'll be able to apply advanced machine learning techniques to improve the accuracy of your models and the efficiency of your workflow. You'll learn how to build and tune a multi-step, multi-layer machine learning pipeline, as well as how to ensemble and stack your models.

Pipeline versus FeatureUnion


Is this a beginner course?

No. This is an intermediate course, with specific prerequisites:

How do I know whether I'm ready for the course?

Review the content from my scikit-learn video series and my pandas video series. If you are comfortable with most of the content, you are ready for the course. If you are unsure whether you meet the course requirements, please email me!

What types of people have taken this course?

Here are the job titles of some of my past students:

Why should I learn how to work with text?

Most knowledge created by humans is raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from.

NLP overview

What will I be able to do by the end of this course?

Charles Franzen (Assistant Education Manager): "Kevin's courses are focused and coherent. Tools learned each week build upon and complement one another, and the classes culminate with a larger-scale project that shows how what you've learned can come together. I learned a great deal, and highly recommend Data School to those looking to explore machine learning tools in greater depth."

Which version of Python do I need for the course?

Both Python 2 and Python 3 are perfectly acceptable.

What libraries will we be using?

The majority of the content will use scikit-learn, though we will also use pandas to help us load, prepare, and visualize data. We will use the built-in re module for regular expressions. We will also make limited use of NumPy, SciPy, Matplotlib, Seaborn, and TextBlob.

Cuisine similarity

Is the course material up-to-date?

In the past year, I've spent hundreds of hours building and refining this course with the feedback of my students, and continue to improve the course on an ongoing basis. All of the code is up-to-date with the latest version of every library, and is tested in both Python 2 and 3.

Chandler McCann (Manager of Data Analytics): "Kevin is an extremely knowledgeable teacher who cares about his students and puts a ton of preparation into his courses and materials. Kevin's approach to teaching data science is logical, well-structured and accessible. I highly recommend this course."

How is this course different from other online courses?

Kathleen Perez-Lopez (Senior Data Scientist): "In Machine Learning with Text, Kevin Markham does a superb job walking you through this topic at an intermediate level. The classes were totally engrossing. His homework assignment for each class steps you through a process so that you don't get stuck at any stage. He points to a tremendous amount of carefully curated supporting material. I'm looking forward to his next offering."

Who is the instructor?

The course instructor is Kevin Markham (me!): Founder of the Data School blog and YouTube channel, former Data Science Expert Mentor for Springboard, and former Lead Data Science Instructor for General Assembly in Washington, DC. I have more than 400 hours of classroom experience teaching data science in Python, and more than 1,000 hours of experience creating data science educational materials, mentoring data science students, and training other data science instructors.

Can I see a sample of the course content?

At PyCon 2016, I presented 3-hour tutorial based on a portion of this course. Below is a recording that should give you a good idea of my teaching style.

Tsering Paljor: "Hands down the best machine learning presentation I've seen thus far."

How much does the course cost?

Enrollment in the course costs $295.

How much time will the course take to complete?

Past students have said that they spent anywhere from 40 to 60 hours working through the course.

How long will I have access to the course?

You will have lifetime access to all course materials, including the Slack team for collaboration with other students.

What happens when I enroll in the course?

Shortly after enrolling in the course, you will be given access to the Slack team and all course materials. You can work through the course at your own pace.

Jeff Weakley (Creative Director): "Stop reading this and sign up now! Kevin isn't just a programmer/data scientist, he's a great teacher. If I had paid a lot more for this class, it would still have been worth it. After taking a lot of other online courses, I feel like I'm finally getting valuable skills, tools and info I can use and financially benefit from."

What will happen at the end of the course?

You'll leave the course with valuable machine learning experience, high-quality code that you can reuse to solve future text-based problems, a student community you can continue to turn to with questions, and a wealth of curated resources to help you deepen your understanding of each course topic.

Will I receive a certificate of completion?

Yes, you will receive an official certificate of completion from Data School after completing the course.

Certificate of completion

Have students been happy with the course?

In an anonymous post-course survey, students were asked to rate the course on a scale from 1 (poor) to 5 (excellent). On average, students have rated the course 4.78 for "quality of content", 4.85 for "quality of instruction", and 4.73 for "overall value provided by the course". In addition, 100% of students reported that the course had "helped them to make progress towards their personal or professional goals."

Can I talk with one of your past students?

Of course. Just let me know the type of person you'd like to speak with (or a specific person listed on this page), and I'll do my best to put you in touch with them.

Asif Mehedi (Research Associate): "Having known Kevin from his YouTube videos on scikit-learn and more recently pandas, I've long admired his ability to explain difficult concepts in clear language. This new course was no exception. I now feel prepared to use machine learning in my text analysis projects."

What if I'm not happy with the course?

I offer a "Love it or leave it" guarantee: If you don't love the course, I'm happy to give you a full refund, no questions asked, if you cancel within two weeks of purchase.

What if I need more than two weeks to evaluate the course?

If you need more time to assess whether the course is a good fit for you, just let me know and I'd be happy to give you as much time as you need.

What if I need help with the course?

Although the "Standard Course" (described on this page) includes access to a Slack team for collaboration with other students, it does not include one-on-one help from the instructional team. If you would benefit from personalized assistance, that will be available during the next session of the "Master Course" (Spring/Summer 2017). If you purchase the Standard Course now, you are eligible to upgrade to the Master Course in the future simply by paying the difference in price!

I have more questions...

Please email me. I'm happy to answer all of your questions!

Amit Dingare (Director of Data Science): "One of the best, if not the best course I have taken."