Tidying "messy data" in R
I watched Hadley Wickham's excellent talk on tidy data and tidy tools, and decided to use this as an opportunity to learn about a few of his R packages. (In case you're unfamiliar with Hadley, he is well-known for his contributions to the R ecosystem, most notably ggplot2; he is also the Chief Scientist for RStudio.)
The principles of tidy data are simple: Every variable (or "feature") is a column, every observation is a row, and there is one type of "observational unit" per dataset. Tidy datasets, he argues, are easier to model, visualize, and aggregate.
Here are the packages covered in the talk:
- reshape2: for restructuring and aggregating data
- plyr (pronounced "plier"): for manipulating and transforming data
- stringr: for string operations and regular expressions
While watching the talk, I ran the code from the slides on the actual datasets, and annotated the code with comments. If you're interested in doing the same, you can find the commented code and data files in my GitHub repo. (Note that my code also contains the "Billboard" example which is described in his tidy data paper and classroom slides, but not shown in the video.)
Although I mostly used the code verbatim as presented in the talk, I decided to use the dplyr library instead of plyr. I chose dplyr because it's an updated version of plyr focused on data frames (that's where the "d" in "dplyr" comes from), and because it has a nicer syntax and runs faster than plyr. (Here's more from Hadley on why you should use dplyr.)
If you're interested in learning more about dplyr, there are some excellent vignettes on the CRAN package page, especially the "Introduction to dplyr."