A guest post by W. Casan Scott, Baylor University.
As different as ecology and the NFL sound, they share quite similar problems. The environment is an infinitely complex system with many known and unknown variables. The NFL is a perpetually changing landscape with a revolving door of players and schemes. Predicting an athlete’s performance pre-draft is complicated through a number of contributing variables including combine results, college production, intangibles, or how well that player fits a certain NFL scheme. Perhaps techniques that ecologists use to discern confounding trends in nature may be suitable for such challenges as the NFL draft. This article aims to introduce an eco-statistical tool, Principal Component Analysis (PCA), and its potential utility to advanced NFL analytics.
My Ph.D. research area is aquatic eco-toxicology, where I primarily model chemical exposure hazards to fish. So essentially, I use the best available data and methods to quantify how much danger a fish may be in, in a given habitat. Chemical exposures occur in infinitely complex mixtures across many different environments, and distinguishing trends from such dynamic situations is difficult.
Prospective draftees are actually similar (in theory) in that they are always a unique combination of their college team, inherent athleticism, history, intangibles, and even the current landscape in the NFL. The myriad of variables present in the environment and the NFL, both static and changing, make it difficult to separate the noise from actual, observable trends.
In environmental science, we sometimes use non-traditional methods to help us visualize what previously could not be observed. Likewise, Advanced NFL Analytics tries to answer questions that traditional methods cannot. The goal of this article is to educate others of the utility of eco-statistical tools, namely Principal Component Analysis (PCA), in assessing NFL draft prospects.
The purpose of Principal Component Analysis (PCA) is to represent a data set containing many variables with a much smaller number of composite variables, or principal components. Think of the QB Rating (QBR). It is a composite variable in that it incorporates a number of other variables (Completion %, Yards, TDs, etc). PCA differs in that it places no bias on which variables it incorporates into the principal component. PCA only chooses the most compelling co-variation among variables, or the variables which explain the most variance between the sample units (i.e. Players). PCA can be useful in dissecting what separates players in the NFL or college. By performing a PCA, I can get a sense for how similar each player or prospect is. When I do this in a historical sense, I may see similarities between Pro Bowlers…or draft busts. By deconstructing the PCA, I will see what measurements are highly correlated to the principal components (composite variable). Maybe 40 yard dash time or the vertical leap are highly correlated, and bench press is not (which is generally true). The PC itself can be used as a predictor or can be used to inform what measurements are statistically most interesting when creating a metric. I use R Statistical Software to perform my PCA and highly recommend R for statistical work in general. I will outline an example of how I use PCA to assess the draft potential of defensive ends. The objective of this particular study was to use PCA to explore the maximum potential of NFL prospects, and to explain how to use this tool in NFL Analytics.
I collected combine data from http://nflcombineresults.com/ and college and professional statistics from http://www.sports-reference.com/ .The NFL data will represent our dependent variables (Y axis), or what we hope to eventually predict. The NCAA statistics and combine results will be our independent variables and what we use to predict NFL success. I was able to gather enough quality data for 82 defensive ends. I organized this data in Microsoft excel and saved the data as a CSV file. I then imported the file into R using the following type of code:
I then experimented with plotting different variables against each other. For example, I plotted bench press reps and 40 yard dash times versus career NFL sacks per game for all 82 defensive ends (Figures 1 and 2). You can see that 40 yard dash has a slight inverse relationship to career sacks per game (Figure 2), whereas bench press has virtually no relation (Figure 1). This is not surprising to most people knowledgeable about the NFL draft. Further, the 40 yard dash data is wedge shaped (Figure 3). As 40 yard dash times get faster, the upper limit of career sacks per game increases. You see virtually no one running a 4.8 or slower recording over a half sack per game. This suggests that 40 yard dash may cap the potential or ceiling of a pass rusher, but that there are clearly other variables present that may limit or dictate whether that potential is ever realized.
Using only one variable at a time to predict a prospect’s NFL success does not exactly work, which usually necessitates the use of a model to incorporate a number of these statistics and measurements. This is where ordination techniques, namely PCA, can be beneficial. Principal Components will explain what individual measurements explain the most variability between different prospects. Maybe a player’s tackles for loss in college separate out players with “high motors” from those without. Weight-normalized vertical leap may explain how much innate explosive power a prospect has. By running a PCA, we can find out objectively what explains differences between players rather than placing pre-emptive bias on measurements we have been conditioned to value (40 yard dash, height, sacks). The PCA that I will run will incorporate the following NCAA statistics and combine measurements into synthetic principal components, listed below.
Recall that a Principal Component is a synthetic variable, much like the QBR. It differs from QBR in that it looks for what variables explain the most variance between players. So Principal Component 1, shown below, is the composite of the variables that are statistically most important. Each original NCAA statistic or combine result has a specific loading, or correlation with the Principal Component, which essentially weights its importance. Here is what the loadings look like for Principal Component 1 (PC1):
PC1 includes all of these statistics and measurements, but at varying degrees of importance. Whether or not the correlation is negative or positive is irrelevant at the moment; we are only concerned with the magnitude. Variables shaded in grey have a ±0.20 or greater correlation to PC1 and are really the only relevant measurements. Those not shaded are randomly occurring big plays with no significant explanation and non-normalized combine numbers. Notice that all the weight-normalized combine numbers except for bench press are relatively strongly correlated. Likewise tackles, sacks, and game experience are the most correlated NCAA stats. Logically, this makes sense. Athleticism seems to only matter within the context of the size of the athlete. Also, fluke plays in college (returns and TDs) don’t seem to matter much for a defensive end. So PC1 seems to be composed of measurements and stats that logically seem important for predicting a defensive end’s success in the NFL. But does it actually predict anything? To test PC1 as a predictor of NFL pass rushing success, I plotted PC1 versus NFL Career Sacks per Game for these defensive ends (Figure 4).
Using the PC1 as a predictor, we did a little bit better job of improving the wedge-shaped nature of the data. Notice how as PC1 increases, so does the upper limit of career NFL sacks per game. Some names that appear along the upper limits including Ryan Kerrigan, Greg Hardy, Chandler Jones, JJ Watt, Aldon Smith, and Mario Williams. These players realized their potential predicted from PC1, while players like Adam Carriker did not.
So let’s think about this year’s draft and one of the more discussed players Jadeveon Clowney. In Figure 4, you can see Jadeveon Clowney’s name highlighted in red. His PC1 scoring landed him somewhere between Ziggy Ansah and Robert Quinn…not bad. But move down the figure and you will see names like Jarius Wynn and Stanley McClover who you wouldn’t really want to waste a number 1 pick on. This again highlights the objective of my study. I feel PCA did a pretty good job of objectifying a prospect’s max potential, but still does not help with the multitude of other factors that can limit a prospect.
Although PC1 is far from being able to predict anything confidently, the purpose of this analysis is to better inform us of prospect potential. What are NFL GMs gambling on? Was Mario Williams a once-in-a-generation prospect? Will Adam Carriker be a household name or just a big strong productive guy with a motor? The utility of using PCA and other ordination techniques is that it can help us make sense of data sets that before seemed extremely messy and noisy. Gambling the franchise on a player’s potential is part of what makes the NFL draft so exciting. However, I think by using tools like PCA, GMs could lower the stakes a bit.
Here it is important to note that PC1 only accounts for approximately 20% of all the variability between these prospects. That leaves 80% of the variance between the prospects left to be explained. This is where I believe the Advanced Football Analytics community can help. I think there is huge potential in using advanced stats within the framework of these ordination techniques. Career sacks per game is by no means a complete metric of NFL performance. Merely weight-normalizing combine numbers probably isn’t the best way to quantify combine performance. But, perhaps utilizing metrics developed through this community or approaches similar to the one by Chase Stuart , in conjunction with ordination techniques, may better separate the noise from the trends. I hope this helps some of you add a new tool to your belt. I am a fan of this site and the fascinating work by the Advanced Football Analytics community. Below I will leave a link to R’s website along with some of my coding to get you started. If you need any help getting started with PCA feel free to email me at casan_scott@baylor.edu or casanscott@gmail.com.
In order to run a PCA in R, I use the following example of code:
http://www.r-project.org/
Import a CSV into R
Ends<-read .table="" efensive="" end="" metrics.csv="" span="">-read>
To run a PCA
prcomp. Ends<-prcomp .="" data="Ends[,*]," scale="T)</span">-prcomp>
This plots PC1 versus PC2, in order to give you a spatial understanding of which players are similar and which are not.
plot(prcomp.Ends$x[,1], Ends[,*])
This plots the PC1 from the analysis against whatever response variable you choose from (i.e. column x from your file Ends). I chose to plot PC1 vs Career NFL Sacks per game.
*You must specify what rows and columns you desire to analyze in the within the brackets.
Would adding some kind of SoS adjustment to the college stats have any effect?
College stats probably needs to be normalized by playing time. A person who is not on the field off enough will produce less. Ideally, should use WP on a play-by-play basis. A pass rush causing the QB to get rid of the ball quickly that end up as an incomplete pass is probably as useful as a sack for no loss, although it may not show up on the stats.
Thank you much for your feedback!
Strength of schedule seems like it definitely could help. Thats the beauty of PCA...It decides what is important!
I am currently working on a predictive model that normalizes per game. I certainly agree that improving the value metric for ends would help as well!
Thanks again for your feedback and I look forward to writing more
There are some interesting stats proposed by the writers at rotoviz that could be explored. one of the more interesting is normalizing college performance to age. The younger a given player delivers outstanding performance the better.
the better predictive variable should be something like pressures (hits/hurries/etc.), something like that. I'd like the think that a disruptive players sack total would smooth out over a large sample size, but maybe thats not always the case.
Another problem being that for ends, sometimes pressure is only half the battle, as teams still run about 50% of the time. A good end doesn't just play the pass, then again, they are often used situationally, which compounds the problem.
good luck!
Really cool read!
In the NFL draft, since you're dealing with multiple options at any given pick, I think these are best for weeding out players who have a relatively high potential to under perform in the context of their draft ratings. For example, in figure 3, you make note of the marker at 4.8. If you're drafting a 4-3 DE, why wouldn't you significantly downgrade every player that runs over a 4.8? I'd be curious to see all the significant components charted out individually, and I'd be interested to see which players most often hit that minimum cutoff (4.8) most often.
Switching positions here, but I was much higher on Shazier entering the draft than most. One of the primary reasons being that I can't remember a single LB with his combination of athleticism and college production failing in the NFL.
On figure 4, you note where Clowney lands, but I don't see any other prospects from this class. I'd be interested in seeing how PC1 rates players like Jackson Jeffcoat, Kareem Martin, and Kony Ealy. I believe Ealy is overrated, and I'd guess that PC1 here would agree. I'd guess that it'd also agree that Jeffcoat and Martin are underrated. Anyway, like I said, this is great. Thanks.