Advanced Football Analytics (formerly Advanced NFL Stats): NFL Prospect Evaluation using Quantile Regression

Casan Scott continues his guest series on evaluating NFL prospects through Principal Component Analysis. By day, Casan is a PhD candidate researching aquatic eco-toxicology at Baylor University.

Extraordinary amounts of data go into evaluation an NFL prospect. The NFL combine, pro days, college statistics, game tape breakdown, and even personality tests can all play a role in predicting a player’s future in the NFL. Jadeveon Clowney is arguably the most discussed prospect in the 2014 NFL draft, not named Johnny Manziel. He is certainly an elite prospect and potentially the best in this year’s draft, but he doesn’t appear to be a “once-in-a-decade” type of physical specimen based exclusively on historical combine performances. From the research I’ve done, only Mario Williams and JJ Watt can make such a claim. Super-talents like Clowney have traditionally been gambled on in the NFL draft with little idea of what future production is actually statistically anticipated. All prospects have a “ceiling” and a “floor” which represent the maximum and lowest potential that a prospect could realize respectively. But what does this “potential” mean and does it hold any importance for actually predicting a prospect’s success in the NFL? In this article I will show how Quantile Regression, a technique used by quantitative ecologists, can clarify what Clowney’s proverbial “ceiling” and “floor” may be in the NFL.

Athletes are a collection of numerous measured and unmeasured descriptor variables. Figure 1 shows a single predictor (40 yard dash time) vs a prospects’ Career NFL sacks + tackles for loss (TFL) per game.

But this relationship is unclear. Figure 2 shows us how only 2 additional descriptive variables (Vertical Leap and Shuttle run time) have blurred this correlation.

Principal Components can be used as predictors or to inform what measurements are statistically most interesting when creating a metric. In an article I wrote for Advanced Football Analytics I introduced Principal Component Analysis (PCA) as a tool for NFL draft player evaluation. This article focused on a group of 82 defensive end prospects over the past decade. I found that Principal Component 1 (PC1) contributed collectively to about 20% of the variance. The linearity of each individual parameter is given in Table 1. Table 2 lists the “significant” parameters of PC1.

Of all these measurements, the following appeared to be the only significant measures in PC1:

Table 1: Eigenvectors to PC1. Data from: http://www.advancedfootballanalytics.com/2014/04/draft-prospect-evaluation-using.html

Table 2: Significant measures selected from PC1 in Table 1.

To build a predictive model, I performed a multiple regression for the group of 82 defensive ends using all these measures (Table 2) except games played and assisted tackles. To still account for games played, I normalized solo tackles, total tackles, tackles for loss, forced fumbles, and sacks per game played. Additionally I did not include assisted tackles as I felt that using both solo and total tackles already did so.

I performed the multiple regression using these 11 measures as predictors of a player’s “Career NFL sacks + tackles for loss per game”. Admittedly, there are better ways to quantify a defensive end’s value in the NFL, and the advanced football analytics community can certainly help here. The multiple regression returned the following correlation: R = 0.704 and R2 = 0.496.

I used the following regression equation from the statistical output to predict “Career NFL sacks + tackles for loss per game” for each of our 82 defensive ends:

“Career NFL sacks + tackles for loss per game” =

-3.294 + (0.0640 * 40) + (0.00664 * Vertical Leap) - (0.000392 * Broad Jump) - (0.255 * Shuttle) + (0.439 * 3-Cone Drill) + (0.551 * NCAA tackles for loss per game) - (0.579 * NCAA sacks per game) + (0.320 * NCAA solo tackles per game) - (0.138 * NCAA total tackles per game) - (0.00327 * Body Weight) + (0.00181 * NCAA forced fumbles per game)

I plotted predicted vs. actual recorded NFL “sacks + tackles for loss per game” for each of our 82 defensive ends (Figure 3).

In Figure 3 we can see there is a general trend, but we can already see that the correlation (r2) is going to be pretty weak. Simple linear regressions only report an r2 for the median quantile, or 50th percentile. This is why a line is drawn directly through the center of a regression. Yet, by looking only at the median quantile response, models may be overlooking very meaningful relationships between those predictor stats/numbers and a prospect’s NFL success.However, when I perform a Quantile Regression using the 25th, 50th, 75th, and 95th quantiles, the picture becomes a bit clearer.

Quantile regression is a technique to estimate the quantiles of a response variable distribution in a linear model. Quantiles are essentially percentiles, so data at the 0.5 quantile are equal to the 50th percentile. Figure 4 shows 4 different hypothetical example data sets illustrating how limiting factors can control responses. Figure 4.A shows a direct relationship where only the measured predictor, NCAA Sacks per Game, limits how many NFL sacks per game a player has. Figure 4.B shows what the data looks like when an additional limiting factor is present, but not measured. This additional factor could be a player’s body weight or height. Figure 4.C shows more than one limiting factor for a number of players (represented by the data points). In Figure 4.D, we see many unmeasured limiting factors for many of the players, resulting in a wedge-shaped distribution.

The 50th percentile, or 0.5 quantile is what linear regressions traditionally set to establish the correlation between the predictor and the average response (solid lines in Figure 4). However, an NFL GM may be interested in the “ceiling” (95th or 75th) or the “floor” (25th or lower) of a prospect that they are gambling millions on. In Figure 5, we see just this. Players like Whitney Mercilus, Ziggy Ansah, Robert Quinn, and JJ Watt fell along the 95th quantile of observed NFL production, meaning they undoubtedly realized their potential. Mario Williams, Brian Robison, and Anthony Spencer fell along the median quantile or 50th percentile, meaning they neither exceeded nor disappointed, statistically speaking. Jadeveon Clowney’s name is highlighted in red in Figure 5. Clowney’s predicted production (0.74) falls just short of Anthony Spencer’s predicted NFL numbers (0.83). At a predicted value of 0.74, there is essentially a range of 0.75 observed NFL sacks+TFL per game between the 25th and 75th quantiles. That seems like a lot of uncertainty to gamble on. But notice how when this model predicts that a player will record at least 0.6 sacks+TFL per game over his career, there is good reason to believe that he will not completely bust (“Bust Threshold” Figure 5). There are a tremendous amount of “busts” in players predicted to record below this threshold of 0.6 sacks+TFL per game.

Quantile regression can isolate differences in the linear relationship at different percentiles of the population. Figure 6 shows the change in Y-intercept (A)and slope (B) across a gradient of percentiles. This is yet another utility of the Quantile Regression. Quantile Regression does not only look at the average response, but also shows those who over- and underachieve.

Quantile Regression helps detect trends in draft prospects that may have been previously dismissed as statistically indistinguishable. Jadeveon Clowney could realize his “ceiling” as an all-pro performer like Robert Quinn, or fall to the “floor” like Vernon Gholston. The usefulness of the scouting combine has been questioned for years. However, this study shows that for defensive ends, the combine along with a player’s production in college DOES matter. In messy data sets, Quantile Regression addresses the large amount of variability caused by all the things that we don’t quantify. Can we measure motivation, dedication, work ethic, or focus? No, but with techniques like PCA and Quantile Regression we may be able to better account for those variables that we simply cannot attach a measurement to.

Feel free to contact me at Casan_Scott@Baylor.edu or casanscott@gmail.com for any comments, questions, or advice. I’d love to share any methods, coding, etc. to anyone interested.

3 Responses to “NFL Prospect Evaluation using Quantile Regression”

Anonymous says:: Friday, June 27, 2014; Truly impressive work. Tho I'm pretty sure you're inadvertently double counting sacks as tackles for loss. NFLGSIS now has tackles for loss broken out versus run and pass. And a player's TFLp are almost mirror image to his sack haul (within +/- 1). Dunno why the discrepancy. Mining gamebooks doesn't provide absolute answers. But it does appear as tho they do count almost all sacks as TFL. It would be helpful if they included clear definitions for their stats, but they don't. Would be a lot more work, but thru NFLGSIS you might be able to just add sacks and TFLr, thus excluding TFLp. Easier yet, tho less accurate, just use TFL and ditch sacks.

Again, great work.
Anonymous says:: Monday, June 30, 2014; Funny that the super-raw Ziggy Ansah came close to hitting his ceiling his rookie year. While, if you plug in his pro-day numbers, out-of-the-box-ready Jarvis Jones would land somewhere to the left of the lower left corner of the box in Fig5. Obviously Vegas doesn't use anything resembling your model, as they set Jones' O/A for rookie sacks at 8.5 (Zig at 4.5, Mingo 3, btw). Maybe they should.
Anonymous says:: Monday, August 18, 2014; I enjoyed the article! However, why did you focus only on PC1; did you do a scree plot, or a parallel analysis, for example?

Note: Only a member of this blog may post a comment.

NFL Prospect Evaluation using Quantile Regression

3 Responses to “NFL Prospect Evaluation using Quantile Regression”

Leave a Reply

Special Note

Search Advanced Football Analytics

Required Reading

Archive

@BBurkeESPN

ANS COMMUNITY

Support Military Families