I’ve been learning a bit of statistical computing with R lately on the side from Chris Paciorek’s Berkeley course. I just got introduced to knitr and it’s damned sweet! It’s an R package which takes a LaTeX file with embedded R, and produces a pure LaTeX file (similar to how Rails renders an .html.erb file into an .html file), where the resulting LaTeX file has the output of the R code. It makes it super easy to embed statistical calculations, graphs, and all the good stuff R gives you right into your TeX files. It let’s you put math in your math, so you can math while you math.
I’ve got a little project which:
- Runs a Python script which will use Selenium to scrape a web page for 2012 NFL passing statistics.
- “Knits” a TeX file with embedded R that cleans the raw scraped data, produces a histogram of touchown passes for teams, and displays the teams with the least and greatest number of touchdowns.
- Compiles the resulting TeX file and opens the resulting PDF.
- Cleans up any temporary work files.
Here’s what the pre-“knitted” LaTeX looks like with the embedded R:
documentclass{article}
usepackage{graphicx}
%% begin.rcode setup, include=FALSE
% opts_chunk$set(fig.path='figure/latex-', cache.path='cache/latex-')
%% end.rcode
begin{document}
After scraping data for all passing TDs in 2012, we get the following histogram for number of TD passes by team.
%% begin.rcode cache=TRUE
% scrape <- read.csv('scrape.csv')
% raw_data <- scrape[scrape[,"X"]!="",]
% tds_for_passers <- transform(raw_data[c("Tm","TD")], TD = as.numeric(as.character(TD)))
% tds_for_teams <- aggregate(tds_for_passers$TD, by=list(Team=tds_for_passers$Tm), FUN=sum)
% hist(tds_for_teams$x)
%% end.rcode
The teams with the greatest and least TDs:
%% begin.rcode
% low_high <- c(which.min(tds_for_teams$x), which.max(tds_for_teams$x))
% tds_for_teams[low_high,"Team"]
%% end.rcode
end{document}
You can comment out the line in the factory script that deletes the tds2012-out.tex file if you want to see what it looks like post-knit. The resulting TeX file basically contains a ton of new commonad definitions but the meat of it is what it does with your R code. It formats and displays the R code itself, and then it displays the output of the R code. Wherever the output is a graph, you’ll see includegraphics[...]{...}. knitr will do the R computation, render the graphics, create a figures subdirectory and store them there for the includegraphics to reference. Whenever the output is simply text or mathematical expressions, you’ll see the R output translated to pure LaTeX markup.
Pretty cool stuff!