Using Probabilistic Distributions to Quantify NFL Combine Performance

Casan Scott continues his guest series on evaluating NFL prospects through Principal Component Analysis. By day, Casan is a PhD candidate researching aquatic eco-toxicology at Baylor University.

Jadeveon Clowney is thought of as a “once-in-a-decade” or even “once-in-a-generation” pass rushing talent by many. Once the top rated high school talent in the country, Clowney has retained that distinction through 3 years in college football’s most dominant conference. Super-talents like Clowney have traditionally been gambled on in the NFL draft with little idea of what future production is actually statistically anticipated. For all of the concerns over his work ethic, dedication, and professionalism, Clowney’s athleticism and potential have never been called into question. But is his athleticism actually that rare? And is his talent worth gambling millions of dollars and the 1st overall pick on? This article aims to objectify exactly how rare Jadeveon Clowney’s athleticism is in a historical sense.

Jadeveon Clowney set the NFL draft world on fire at this year’s combine when he delivered one of the most talked-about combine performances of recent memory, primarily driven by his blistering 40 yard dash time of 4.53. Over the years, however, I recall players like Vernon Gholston, Mario Williams, and even Ziggy Ansah displaying mind-boggling athleticism in drills. But if each year a player displays unseen athleticism at the combine, who is really impressive enough that we deem them “Once-in-a-decade?”

Probability Ranking allows me to identify the probability of encountering an athlete’s measurable. For instance, I probability ranked NFL combine 40 yard dash times for 341 defensive ends from 1999-2014 (Table 1 shows the top 50). In this case, Jadeveon Clowney’s 40 time of 4.53 had a probability rank of 99.12, meaning his speed is in the 99th percentile of all DEs over this time span.

NFL Prospect Evaluation using Quantile Regression

Casan Scott continues his guest series on evaluating NFL prospects through Principal Component Analysis. By day, Casan is a PhD candidate researching aquatic eco-toxicology at Baylor University.

Extraordinary amounts of data go into evaluation an NFL prospect. The NFL combine, pro days, college statistics, game tape breakdown, and even personality tests can all play a role in predicting a player’s future in the NFL. Jadeveon Clowney is arguably the most discussed prospect in the 2014 NFL draft, not named Johnny Manziel. He is certainly an elite prospect and potentially the best in this year’s draft, but he doesn’t appear to be a “once-in-a-decade” type of physical specimen based exclusively on historical combine performances. From the research I’ve done, only Mario Williams and JJ Watt can make such a claim. Super-talents like Clowney have traditionally been gambled on in the NFL draft with little idea of what future production is actually statistically anticipated. All prospects have a “ceiling” and a “floor” which represent the maximum and lowest potential that a prospect could realize respectively. But what does this “potential” mean and does it hold any importance for actually predicting a prospect’s success in the NFL? In this article I will show how Quantile Regression, a technique used by quantitative ecologists, can clarify what Clowney’s proverbial “ceiling” and “floor” may be in the NFL.

Athletes are a collection of numerous measured and unmeasured descriptor variables. Figure 1 shows a single predictor (40 yard dash time) vs a prospects’ Career NFL sacks + tackles for loss (TFL) per game.

Podcast Episode 22 - Brian Burke

Brian Burke returns to the show to recap the 2014 NFL draft. He describes the Bayesian Draft Analysis tool he created and discusses the value of trades made by teams during the draft. Brian and Dave then discuss their favorite new addition to the league, John Urschel, and make a pitch to get him to contribute to the site. Brian also previews his new project, WOPR, and explains how it'll help generate data for some previously unanswerable questions.

This episode of Advanced Football Analytics is brought to you by Harry's. Harry's delivers high-quality shave products straight to your door at a fraction of the price of shaving competitors. Go to Harrys.com and use the offer code "AFA" at checkout to save $5 off your first purchase.

Subscribe on iTunes and Stitcher

The AFA Draft Pick of the Year

Was the next Virgil Carter drafted yesterday? Penn State Guard John Urschel was taken with a compensatory pick in the 5th round by Baltimore. John stands out because he has an unusual plan for his time after his playing days are over. He says he's very interested in  "sports analytics. Data analysis for football."

If he does, he'll analyze circles around the rest of us. While playing for PSU, John earned his degree in Math in just three years. Then added a masters degree in math, and is currently working on a second masters in math education. He's published research with names like Instabilities of the Sun-Jupiter-Asteroid Three Body Problem, A Space-Time Multigrid Method for the Numerical Valuation of Barrier Options, and Spectral Bisection of Graphs and Connectedness in which he proved the Urschel-Zikatanov Generalized Bisection Theorem. Man, I wish I had a theorem named after me.

To us, his most interesting research might be this article he wrote for ESPN The Magazine. He looked at "1) how best to predict a lineman's draft position, 2) that prospect's success in terms of NFL starts, and 3) whether a fringe prospect will be selected." Sounds like it would have made a good guest post here.

The Bayesian Draft Model estimated the most likely time Urshel would be taken was pick 167, not very far off from his actual selection at 175. The chance he would be available at 175 was 43% according to the numbers. So almost spot on. Interestingly, Urshel's own selection may have been the result of some sharp analytics. Baltimore is known to have "a proprietary formula—a “special sauce,” assistant GM Eric DeCosta calls it—that factors in potential compensatory picks to the free agency cost-benefit analysis."

Urschel would make a killer impact on the world of football analytics if he chose. However successful his pro career turns out, he'll carry the credibility of a pro-caliber player. Coaches will take what he has to say much more seriously than what an ex-Navy pilot writes on a website.

So, congratulations, John! I'll be rooting for you on the field and off. Play like a Raven!


Project WOPR is Coming

With the Bayesian draft tool completed, I can now focus on completing Project WOPR. For those who might be fans of mid-80s Matthew Broderick movies, you may have figured out what the WOPR is.

I'll give another clue:
It's purpose to answer the un-answerable questions of football strategy.

But for now, it's taking up my entire basement and has driven my electricity bill through the roof. The liquid-nitrogen cooled 32-core processors aren't cheap either.

Live Updates Tonight

I'll be updating the Bayesian draft model live tonight. I was triple-booked for this evening, and I thought I wouldn't be able to make it happen. But now I'm only double-booked, so in the immortal words of Bill O'Reilly, "F--- IT. WE'LL DO IT LIVE!"

As players are chosen, the probabilities will obviously start changing rapidly.  The fact a player is off the board and no one else could fill that slot is information (with absolute certainty) that can be fed back through the model. The effects will cascade through the rest of the available picks.

Unfortunately, the interface won't update automatically for users. You'll need to click refresh or hit F5 after each pick. There will be at least two or three minutes of lag for the updates to work through the system, so be patient.


New Feature on the Draft Model

In my last job I worked with a team of software developers. The interfaces they designed didn't make much sense to me. The interfaces were always, at heart, a giant expanding tree of classes, objects, and properties. Huh? Lots of tiny plus and minus marks everywhere to expand and contract the accordion. Left click to view something. Right click to modify it. If you ever had to deal with the Windows registry, it was like that. Steve Jobs would not have been thrilled.

When I learned a little about object oriented programming, it all made sense. The software engineers were designing the interface for their own convenience, not for ease of use. It made sense from an efficiency standpoint...a programming efficiency standpoint. But from the perspective of the user, it wasn't so efficient. The least used feature was just as accessible as the most common feature, and all of them were hidden until you expanded the right portion of the tree.

Yesterday I realized I was doing the same thing with the draft model. From my point of view, it's easiest to think in terms of players and their probability to be selected at each pick number, because that's how the software that runs the model works. It goes down the list of prospects, player-by-player, looking at the probability he'll be selected pick#-by-pick#.

For the players and their agents, and for fans of particular players, this is ideal. They want to know where and when they'll go. But the user is probably thinking of things from a team's perspective. Whether the user is a team personnel guy or a fan of a team, he'd rather see things from the perspective of a pick #. Right now, a Vikings fan (or exec) would have to click through over a dozen or so of the top players to see who's likely to be available to them at pick #8. And if they were wondering about who'd be available if they trade up or down, that's another few dozen clicks. Scroll, click. Scroll, click...

Podcast Episode 21 - Cade Massey

Cade Massey, Professor of the Practice at the Wharton School of Business, joins the show to discuss his research on the NFL draft. Professor Massey is the co-author of "The Loser's Curse: Decision Making & Market Efficiency in the National Football League Draft", a paper analyzing the market for draft pick trades. He and his co-author, Richard Thaler, discovered that teams picking at the top of the draft actually sacrifice a great deal of what he calls "surplus value" by not trading down for additional selections.

Dave and Cade look at the reasons why teams employ less than optimal strategies, including risk aversion, adherence to norms established by "The Chart" and other psychological factors. Professor Massey defends his paper against critiques, and discusses why he believes the draft is such a compelling spectator event.

Subscribe on iTunes and Stitcher

Bayesian Draft Analysis Tool

This tool is intended to help decision-makers better assess the NFL draft market. Specifically, it estimates the probability each prospect will be available at each pick number.  The estimates are based on a Bayesian inference model based on consensus player rankings and projections from individual experts with a history of accuracy.

For details on how the model works, please refer to these write-ups:

 - A full description of the purpose and capabilities of the model
 - A discussion of the theoretical basis of Bayesian inference as applied to draft modeling
 - More details on the specific methodology

If you want to jump straight to the results, here they are. But I recommend reading a little further for a brief description of what you'll find.


The interface consists of a list of prospects and two primary charts. Selecting a prospect displays the probabilities of when he'll likely be taken. You can filter the selection list by overall ranking or position.

The top chart plots the probabilities the selected prospect will be taken at each pick #. I think this chart is pretty cool because it illustrates the Bayesian inference process. You can actually see the model 'learn' as it refines its estimates with the addition of each new projection. Where there is a firm consensus among experts, the probability distribution is tall and narrow, indicating high confidence. When there is disagreement, the distribution is low and wide, indicating low confidence.

The lower chart is the bottom line. It's the take-away. It depicts the cumulative probability that the selected prospect will remain available at each pick #. For example, currently there's an 82% chance safety HaHa Clinton-Nix is available at the #8 pick but only a 26% chance he's available at #14. A team with an eye on a specific player could use this information in deciding whether to trade up or down, and in understanding how far they'd need to trade.



Hovering your cursor over one of the bars on the chart provides some additional context, including which team has that pick and that team's primary needs (according to nfl.com).

The box in the upper right gives you the player's vitals - school, position, height, weight. The expert projections used as inputs to the model are also listed. Currently those include Kiper (ESPN), McShay (Scouts, Inc.), Pat Kirwan(CBS Sports), Daniel Jeremiah (former team scout, NFL Network), and Bucky Brooks (NFL Network). Experts were selected for their  reputation, historical accuracy, and independence--that is, they don't all parrot the same projections. Not every prospect has a projection from each expert.

Link to the tool.

Bayesian Draft Model: More Methodology

Boomer, when you think about a guy like Thomas Bayes you think high motor, long arms, quick off the snap. Huge upside in any 3-4 scheme. Gets leverage on those tricky probability theorems right off the block. Game 1 starter for 90% of the teams out there. Writes proofs all the way through the end of the whistle. Definitely like him in the late first, early second round...

The new Bayesian draft model is nearly ready for prime time. Before I launch the full tool publicly, I need to finish describing how it works. Previously, I described its purpose and general approach. And my most recent post described the theoretical underpinnings of Bayesian inference as applied to draft projections. This post will provide more detail on the model's empirical basis.

To review, the purpose of the model is to provide support for decisions. Teams considering trades need the best estimates possible about the likelihood of specific player availability at each pick number. Knowing player availability also plays an important role in deciding which positions to focus on in each round. Plus, it's fun for fans who follow the draft to see which prospects will likely be available to their teams. Hopefully, this tool sits at the intersection of Things helpful to teams and Things interesting to fans.

Since I went over the math in the previous post, I'll dig right into how the probability distributions that comprise the 'priors' and 'likelihoods' were derived.

I collected three sets of data from the last four drafts--best player rankings, expert draft projections (mock drafts), and actual draft selections. In a nutshell, to produce the prior distribution, I compared how close each player's  consensus 'best-player' ranking was to his actual selection. And to produce the likelihood distributions I compared how close each player's actual selection was to the experts' mock projections.

Theoretical Explanation of the Bayesian Draft Model

I recently introduced a model for estimating the probabilities of when prospects will be taken in the draft. This post will provide an overview of the principles that underpin it. A future post will go over some of the deeper details of how the inputs for the model were derived.

First, some terminology. P(A) means the "probability of event A," as in the probability it rains in Seattle tomorrow. Event A is 'it rains in Seattle tomorrow'. Likewise, we can define P(B) as the probability that it rains in Seattle today.

P(A|B) means "the probability of event A given event B occurs," as in the probability that it rains in Seattle tomorrow given that it rained there today. This is known as a conditional probability.

The probability it rains in Seattle today and tomorrow can be calculated by P(A|B) * P(B), which should be fairly intuitive. I hope I haven't lost anyone.

It's also intuitive that "raining in Seattle today and tomorrow" is equivalent to "raining in Seattle tomorrow and today." There's no difference at all between those two things, and so there's no difference in their probabilities.

We can write out that equivalence, like this: