Kevin McMahon

Enthusiast

Blog Search

Crunching some NFL Stats with F#

To explore functional programming, I’ve decided to return to a familiar problem domain, football stats. I used this domain a couple of years ago when I was in the process of making the transition from the Unix-based OS/Java world to the Microsoft/C# world. I am the type of person that learns better by doing than studying, so I’m going to try and jump in and cobble something together to start the learning process. I’ve watched the PDC presentation by Luca Bolognese, and I’ve read through the first couple chapters of Don Syme’s Expert F#, so consider me armed with an F# Interactive window and dangerous.

The first stat that I plan to look at is the QB Score Stat as outlined by Berri, et al. in Wages of Wins. The stat is much easier to calculate than the traditional QB Rating used by the NFL, and if you read the link or book, you’ll see that it correlates much better to wins and points than QB rating. For our purposes, I’ll outline the formula here, but I do recommend checking out the links for more info.

QB Score = Total Yards - (3 * Plays) - (30 * Turnovers)

I got the 2008 QB stats from Yahoo, dumped them into Excel, and then saved them off into CSV. This data munging can be done programmatically fairly quickly with HtmlAgilityPack and Linq to XML, but I’ll save that for another post. I’ve provided a copy of the stats in CSV here (Update 8/3/2020: link is dead).

So to get started here is what we have to do in order to calculate the raw QB score and the QB score per play for all the NFL QB’s:

  1. Read in the CSV file
  2. Grab the relevant stats for our calculation
  3. Calculate the QB score per play for each QB
  4. Return the QB name and the score.

I’ll tackle this step by step and we can verify our results via the F# Interactive window.

To ingest the file, we can leverage the .Net System.IO library. The call pattern to read it into memory is identical to what you would see in C# or VB and is pretty straight forward.

open System.IO
let filePath = "D:\code\data\QB_Stats_2008.csv";
let stream = new FileStream(filePath, FileMode.Open)
let reader = new StreamReader(stream)
let csv = reader.ReadToEnd()

Here is the output of the F# interaction window.

val filePath : string
val stream : System.IO.FileStream
val reader : System.IO.StreamReader
val csv : string

As we can see from the output, ‘csv’ is string that holds the contents of the QB stats file. Since we know that the file is is a CSV file, we can break it down into its individual elements like so:

let stats = csv.Split([|'\n'|]) |> Seq.skip 1
|> Seq.map(fun line -> line.Split([|','|]))
|> Seq.map(fun values -> string values.[0], // qb name
System.Int32.Parse(values.[5]), // att
System.Int32.Parse(values.[7]), // pass yds
System.Int32.Parse(values.[11]), // int
System.Int32.Parse(values.[12]), // rushes
System.Int32.Parse(values.[13]), // rush yds
System.Int32.Parse(values.[17]), // sacks
System.Int32.Parse(values.[20])) // fumbles lost

Since ‘csv’ is a string, we can use the Split method to chunk the string up into individual lines using the ‘\n’ character as our split token. Once split into individual lines, the pipeline operator on line 3 further processes each line. Sequences in F# can be thought of as IEnumerables from C# and come with some nice baked-in methods to help with processing. Our QB stats CSV file has as its first line a key to the data. We’ll need to skip that first line before we get to process the real data, and to do so we’ll use one of those nice baked-in methods (Seq.skip) to do so.

Line 4 further deconstructs the csv file into the individual comma delimited values tokenizing each line. After the lines have been tokenized the individual values can be read. Here I’ve created a tuple to hold each lines values. The tokenized values have been collected in a tuple that holds 8 values. The mapping of the values is specified by the comments.

Here is the output of the F# interaction window after step 2:

val stats : seq

After step two we have a sequence of tuples that have only the stats and information that we care about. The next step now becomes calculating the QB score. The calculation of the score requires three sub-steps, so let us revise the outline we laid out earlier to include them.

  1. Read in the CSV file
  2. Grab the relevant stats for our calculation
  3. Calculate the QB score per play for each QB
    1. _Create the formula function _
    2. _Compute the components of the formula _
    3. Create the desired output
  4. Return the QB name and the score

Let’s tackle the first sub-step and codify the formula now and see what we’ll need to provide from the data we just acquired.

let qbcalc (plays,yards,turnovers) = yards - 3 * plays - 30 * turnovers

This line of code creates a function called qbcalc that takes in a tuple composed of the plays, yards, and turnovers components of the formula.

If we run the qbcalc function through the interactive window we get:

val qbcalc : int * int * int -> int

The end result of this is the raw QB score. The arithmetic operations in F# are similar to most languages, so the formula is a straight forward expression without any surprises. Since we know plays, yards and turnovers are all integer values, we could further constrain the types of values that the tuple is composed of, but F#’s type inference already does this for us, so it is not needed. When the compiler analyzed this code, it was able to ascertain from the operations and the integers used that the plays, yards and turnover values were of type int and automatically created the int constraints.

The next step is to compute the individual values of plays, yards, and turnovers. Before we start, I just want to note that I am sure there is a slicker, more concise way to do this, but this is my first go at this, so pardon the mess.

let names = stats |> Seq.map(fun(name,_,_,_,_,_,_,_) -> name)

Here we start to perform operations on the stats sequence we captured from the CSV file. The basic structure of what I am doing here is grabbing the specific values of the components I am looking to either aggregate (names) or calculate (plays, yards, and turnovers) from the sequence and mapping them to a new sequence. Here is an example of how to create the plays sequence.

let plays = stats |> Seq.map(fun(_,att,_,_,rush,_,sacks,_) -> att+rush+sacks)
Here the stats sequence is pushed through the pipeline operator ( > ) which allows you to chain functions in a sequence. This is happens because, as pointed out in Expert F#, the pipeline operator is just function application in reverse. This can be expressed like so:
let (|>) x f = f x

So in our case when we have the following:

stats |> Seq.map (fun(_,att,_,_,rush,_,sacks,_) -> att+rush+sacks)

Chaining the stats sequence with the the Seq.map function will apply the function we’ve defined in the parenthesis to each element in the stats sequence and return a new sequence with the results of the function.  The function we have defined has a signature that matches the 8 value tuples that compose the stats sequence. Since only a few values are needed to be computed for the various values, “˜_’ can be assigned to the values in the parameter definition and more meaningful names can be given to the values we care about. On the right hand side of the -> (a symbol that represents a function), we do the simple adding of the values. Again the results of this function are collected in a new sequence that is returned from the Seq.map call.

After all the individual components of the QB score formula have been computed, we’re left with a bunch of individual sequence values that need to be reconstructed into something that we can pass to the the qbcalc function. The calculation function is defined as taking a tuple composed of a play, yard, and turnover values, so we need to utilize another method that Seq provides called zip.

Here is the code that crunches the individual components.

let getStats =
let stats = loadQBStats
let names = stats |> Seq.map(fun(name,_,_,_,_,_,_,_) -> name)
let plays = stats |> Seq.map(fun(_,att,_,_,rush,_,sacks,_) -> att+rush+sacks)    let yards = stats |> Seq.map(fun(_,_,passyd,_,_,rushyd,_,_) -> passyd + rushyd)
let turnovers = stats |> Seq.map( fun(_,_,_,int,_,_,_,fum) -> int+fum)
Seq.zip3 plays yards turnovers |> Seq.zip names

The final step to complete is to apply the qbcalc function to each play, yard, and turnover tuple, and zipping up the resulting sequence with the sequence of the names rounds out steps and completes our task. The values were balled up into tuples in previous steps, so a lot of what is left to do is unpacking what we need to do the actual calculation and then reassemble to the output. The unpacking of the tuples is done with the fst and snd functions applied to the sequences. These methods return the fst, and the snd functions return the first and second elements of the tuples, respectively. The last line of the doCalc function divides the raw QB score over the plays completing the calculation and then back pipes that sequence for zipping with the names. The zipped sequence gets returned, and at last, we’ve calculated the QB score per play for the 2008 season. The last thing to note with the calculation is that to get better precision from the final result, the int values being divided need to be converted to a decimal. If the integers aren’t converted, then the results of the division operation will be rounded down, and we’ll lose precision on the calculation.

let doCalc =     let stats =    <br />getStats    <br />let names =    <br />stats |> Seq.map fst    <br />let rawScore =    <br />stats    <br />|> Seq.map snd    <br />|> Seq.map qbcalc    <br />let plays =     <br />stats    <br />|> Seq.map snd    <br />|> Seq.map (fun (plays,_,_) -> plays)    <br />let components =    <br />Seq.zip rawScore plays     <br />Seq.zip names <| Seq.map(fun(x:int, y:int) -> System.Convert.ToDecimal(x)/ System.Convert.ToDecimal(y)) components    <br />

Below is the complete source listing of my first crack at doing something useful with F#. There are a couple things (the packing and repacking of the tuples, the CSV parsing) that scream optimize me. In my next F# post, I’ll refactor this code to slim it down and package it up so I can display these results graphically via C#.

open System.IO</p>  <p>let loadQBStats =   <br />let filePath = "D:\code\ProFootballDB\Data\QB_Stats_2008.csv"    <br />let stream = new FileStream(filePath, FileMode.Open)    <br />let reader = new StreamReader(stream)    <br />let csv = reader.ReadToEnd()</p>  <p>let stats =   <br />csv.Split([|'\n'|])    <br />|> Seq.skip 1    <br />|> Seq.map(fun line -> line.Split([|','|]))    <br />|> Seq.map(fun values ->    <br />string values.[0], // qb name    <br />System.Int32.Parse(values.[5]), // att    <br />System.Int32.Parse(values.[7]), // pass yds    <br />System.Int32.Parse(values.[11]), // int    <br />System.Int32.Parse(values.[12]), // rushes    <br />System.Int32.Parse(values.[13]), // rush yds    <br />System.Int32.Parse(values.[17]), // sacks    <br />System.Int32.Parse(values.[20])) // fumbles lost    <br />stats</p>  <p>let qbcalc (plays,yards,turnovers) = yards - 3 * plays - 30 * turnovers</p>  <p>let getStats =   <br />let stats = loadQBStats    <br />let names = stats |> Seq.map(fun(name,_,_,_,_,_,_,_) -> name)    <br />let plays = stats |> Seq.map(fun(_,att,_,_,rush,_,sacks,_) -> att+rush+sacks)    <br />let yards = stats |> Seq.map(fun(_,_,passyd,_,_,rushyd,_,_) -> passyd + rushyd)    <br />let turnovers = stats |> Seq.map( fun(_,_,_,int,_,_,_,fum) -> int+fum)    <br />Seq.zip3 plays yards turnovers |> Seq.zip names</p>  <p>let doCalc =    <br />let stats =    <br />getStats    <br />let names =    <br />stats |> Seq.map fst    <br />let rawScore =    <br />stats    <br />|> Seq.map snd    <br />|> Seq.map qbcalc    <br />let plays =     <br />stats    <br />|> Seq.map snd    <br />|> Seq.map (fun (plays,_,_) -> plays)    <br />let components =    <br />Seq.zip rawScore plays     <br />Seq.zip names <| Seq.map(fun(x:int, y:int) -> System.Convert.ToDecimal(x)/ System.Convert.ToDecimal(y)) components    <br />

Useful links: