Example of linear regression and regularization in R

When getting started in machine learning, it's often helpful to see a worked example of a real-world problem from start to finish. But it can be hard to find an example with the "right" level of complexity for a novice. Here's what I look for:

  • uses real-world data, not artificially simple data
  • demonstrates multiple models on the same data and compares them using a reasonable evaluation metric
  • explains the thinking of the modeler at each step in the process
  • includes readable, commented code

My linear regression example

In my Data Science class, we were assigned to perform linear regression on a dataset based on Kaggle's Job Salary Prediction competition. I posted my solution on RPubs, and thought it might be helpful as a regression example for other machine learning novices. Here's what my solution entails:

  • reading in the data from a CSV file
  • visualizing the data using the ggplot2 package
  • exploring the data using the table() and tapply() functions
  • creating text-based features using regular expressions
  • building linear models with different features, and comparing their performance using RMSE on a validation set
  • building regularized models using ridge regression and lasso (from the glmnet package)
  • selecting features using a forward stepwise approach (from the leaps package)
  • choosing the best model, training it on the full training set, and predicting on the test set

Please check it out, and let me know what you think! You can also run the code yourself if you download the data files into your working directory in R.

I'm happy to answer your questions! I admit that I didn't include nearly enough explanation for someone who is unfamiliar with these techniques, though I hope you find it useful in any case.

Publishing your own document to RPubs

If you've never used RPubs, it's an easy (and free) way to publish "R Markdown" documents directly from RStudio. It allows you to weave together your code, output (including plots), and explanation (written in standard Markdown) into a single document. Here's how to get started with R Markdown, and how to publish to RPubs.