October 30, 2018

How to create useful features for Machine Learning

Recently, a member of Data School Insiders asked the following question in our private forum:

I'm new to Machine Learning. When I want to create a predictive model, what are the techniques I should use to do "feature engineering"?

Great question! Let's start with the basics:

What is feature engineering?

Feature engineering is the process of creating features (also called "attributes") that don't already exist in the dataset. This means that if your dataset already contains enough "useful" features, you don't necessarily need to engineer additional features.

But what is a "useful" feature? It's a feature that your Machine Learning model can learn from in order to more accurately predict the value of your target variable. In other words, it's a feature that helps your model to make better predictions!

Datetime example

Let's pretend you have a dataset with a "datetime" column:

If your goal is to predict the temperature, you might use the datetime column to engineer an integer hour feature (0-23), since the hour of the day is a useful predictor of the temperature.
If your goal is to predict the number of cars on the road, you might use the datetime column to engineer boolean is_weekend and is_holiday features (True/False), since those are useful predictors of traffic.

Thus when creating features, you have to take the target variable ("temperature" or "number of cars") into account, because different features will be useful for different targets.

Considering model type

Ideally, you should also take into account the type of Machine Learning model you're using:

If you're using a linear model (such as linear regression), the hour feature might not be useful for predicting temperature since there's a non-linear relationship between hour (0-23) and temperature. Instead, you might create an is_night boolean feature that represents the hours 0-4 and 20-23.
If you're using a non-linear model (such as decision trees), the hour feature might work well for predicting temperature since decision trees can learn from non-linear features.

Art versus science?

There were some insightful comments in the forum about the process of feature engineering:

Feature engineering is more of an art than a formal process. You need to know the domain you're working with to come up with reasonable features. If you're trying to predict credit card fraud, for example, study the most common fraud schemes, the main vulnerabilities in the credit card system, which behaviors are considered suspicious in an online transaction, etc. Then, look for data to represent those features, and test which combinations of features lead to the best results.

Also, keep in mind that feature engineering uses different methods on different data types: categorical, numerical, spacial, text, audio, images, missing data, etc.

Questions?

Do you have questions about feature engineering or resources to suggest? Let me know in the comments below!

P.S. Are you new to Machine Learning? Check out my free 4-hour course, Introduction to Machine Learning with scikit-learn.

New? Start here!

Log in / Sign up for courses

Get weekly tips 💌

About Data School