Kevin Meers is the Co-President of the Harvard Sports Analysis Collective. He is a senior majoring in economics with a statistics minor, and has spent the past two years or so as an analytics intern in the NFL. He is currently writing his thesis on game theory in the NFL, and probably puts too much thought into how the perfect fantasy football league would be structured.
The coach’s challenge is an important yet poorly understood part of the NFL. We know challenges are an asset, but past that, we do not have a good understanding of what makes a good challenge or if coaches are actually skilled at challenging plays. This post takes a step towards better understanding those questions by examining the value of the possible game states that stem from challenged plays.
To value challenges, we must understand how challenges change the game’s current state. When a play is challenged, the current game state must transition into one of two new game states: one where the challenged play is reversed, the other where it is upheld. These potential game states are the key to valuing challenges.
Let’s look at a concrete example from last season. With two minutes and two seconds left in the fourth quarter in their week ten matchup, Atlanta had first and goal on New Orleans’ ten-yard line. Matt Ryan completed a pass to Harry Douglas, who was ruled down at the Saints’ one-yard line… only Douglas appeared to fumble as he went to the ground, with the Saints recovering the ball for a potential touchback. When New Orleans challenged the ruling on the field, the game could have transitioned into two possible game states: Atlanta’s ball with second and goal on the one, or New Orleans’ ball with first and ten on their own 20 yard line. If the Saints lost the challenge, they would have a Win Probability (WP) of 0.28, but if they won, their WP would jump to 0.88. This potential WP added, which I refer to as “leverage,” is key to valuing challenges. Mathematically, I define leverage as:
Given this definition, the leverage for the Saints’ challenge last year was 0.6, making it the most leveraged challenge of the 2012 season.
By examining the leverage of each historical challenge, we can begin to quantify the value of challenges. Because turnovers and scoring plays were not automatically reviewed in seasons before 2012, the distribution of leverage would look much different from leverage last season and so far this season, since coaches used to challenge touchdowns and turnovers, which are very high leverage plays. For this reason, I’ve restricted this analysis to the 2012 season.
Leverage
This histogram shows the distribution of leverage on the 140 plays that coaches challenged during the 2012 season. The average leverage was 0.07 percentage points of WP. Over 90% of challenges occurred on plays whose leverage was less than or equal to 0.15 WP. Past 0.15 WP, the distribution becomes very sparse – these plays had huge swings in WP, but they were also very rare events.
The most interesting takeaway from this graphic is how many challenges have almost zero leverage. Over 15% of challenges last season had no leverage - in other words, the team challenging the play should be indifferent between the challenge succeeding and failing. On these plays, the coach spent a challenge and risked losing a time out for no potential benefit, which would be highly illogical. With such an odd result, I went back to look at these play descriptions, and found three notable patterns.
First, about half of these challenges occurred in game states where the score differential was over 14, and it is hard for non-scoring plays to significantly change win probability in those cases. Of the remaining zero-leverage plays, about half were challenges that would bring the offense from “close to their opponent’s end zone” to “a bit closer to their opponent’s end zone,” which doesn’t really help the team’s WP very much (or at all).
Neither of these reasons makes the decisions to challenge those plays better, but the last reason I found might take some decisions off the hook. Many of these zero leverage plays involved keeping a highly efficient offense on the field (or getting it off the field, depending on who challenged the play) by challenging the result of a third down play. For example, in the infamous matchup between the Packers and the Seahawks last season, Green Bay challenged a third down measurement to keep Aaron Rogers and Co. in the game. In cases like this one, the WP model may not fully account for the specific in-game strengths and weaknesses of the teams involved (since it is based on an average team). Therefore in some circumstance, challenges might appear to have no leverage because of our WP estimates.
Does Leverage Affect a Challenge’s Success?
We can also look at the distributions of successful and failed challenges by the challenge’s leverage. If the distributions are meaningfully different, it could tell us something about which challenges are more likely to succeed than others.
This plot shows that very low leverage challenges succeeded more often than higher leverage challenges. In fact, after leverage reaches about 0.15 WP, challenges became much more likely to fail, and it seems at least possible that officials are less likely to overturn plays that are hugely influential on the outcome of the game. The results of a logistic regression lend some support to this theory, with a negative coefficient on leverage with p = 0.07. However, this result is entirely driven by a handful of failed high leverage challenges from one year of data, so I’m hesitant to declare that this relationship definitively exists.
Using Leverage to Value Timeouts
We can also use our findings on leverage to put a reasonable bound on the value of the average timeout. For a rational coach to challenge a play, he must think that:
Leverage and probability of success can obviously change from challenge to challenge, and the value of a team’s last time out may be significantly different from the value of its first time out. However, given an average leverage and average success rate, coaches (if they are behaving rationally) value their timeouts, on average, at 0.03 WP at most.
Conclusions
This post has covered a lot of ground: establishing what leverage means in the context of challenges, what its distribution looks like (at least in the 2012 season), how leverage affects the probability of a successful challenge, and putting a reasonable upper bound on the value of a timeout. While we've made progress on our original questions, this work can both be improved and expanded upon. As more and more challenges happen, we’ll get a better and better idea of what the distribution of leverage actually looks like, which will improve all of this analysis. Logical expansions from here include examining which kinds of challenges tend to have the highest (or lowest) leverage, whether certain teams or coaches are better at challenging high leverage plays, and any others you can think of. We also don’t know whether these are the plays coaches should be challenging, which is an interesting question for another time.
Related to your last statement, I was thinking one of the difficulties in your analysis has to be what type of plays that were eligible to be challenged were not? As they were not challenged, we can't assess whether the challenge would be successful. However, while this would be a vast undertaking, the following issues would surely arise:
-Was the coach simply out of challenges, though he would like to challenge the play?
-Was the benefit of a successful challenge not worth the cost of using a challenge opportunity (e.g. a 10-yard gain on first down in the first quarter that the official spots just shy of the first down marker)?
-Was the likelihood of overturning the call on the field too low for the benefit of overturning the challenge?
-Would the play have been challenged but the offense ran the next play before the challenge flag could get thrown?
From a bad reasoning standpoint could you re-run this analysis changing one key factor, assume all games are tied, or no team is down by more than a touchdown (maybe 10 points).
I only ask because part of coaching is also appeasing your players and fans. So while your analysis says what is the (real) value of a challenge I would be curious if this could be shifted to what is the fan (or player) perceived value (lets assume fans always believe there is a chance to win). This would see if coaches tend to challenge plays that in a close game have high win percentage value even though they don't in the actual event. Therefore (I think) if the hypothesis is true then similar types of plays are challenged no matter the game state. Potentially showing coaches don't adjust, or that they go through the motions because it is expected (good coaching or not).
Another way to look at the value might be instead of raw WP change you could do % change in WP, so with a .5% WP a .1% change (from .5 to .6) is a 20% gain while with a 50% WP a .1% change is a .2% change. Do coaches challenges do a better job of maximizing WP% change instead of WP.
"This plot shows that very low leverage challenges succeeded more often than higher leverage challenges. In fact, after leverage reaches about 0.15 WP, challenges became much more likely to fail, and it seems at least possible that officials are less likely to overturn plays that are hugely influential on the outcome of the game."
Couldn't this just be selection bias? A coach is more likely to challenge an "obvious" bad call even if it is low leverage. And a coach is more likely to challenge just an "iffy" call as long as the payoff is big enough from a WP standpoint.
> .... it seems at least possible that officials are
> less likely to overturn plays that are hugely
> influential on the outcome of the game.
I think coaches are also more likely to make long-shot challenges in high leverage situations.
> ...the expected value of a challenge, which
> is just the challenge’s leverage multiplied the
> probability of success:
That's an implicit assumption that the value of a failed challenge is zero, but the challenge will still - for example - stop the clock. (Under specific circumstances it can even stop the clock for a longer time than a time out would.)
"coaches (if they are behaving rationally) value their timeouts, on average, at 0.03 WP at most."
That is quite a statement. It is based on equation,
"E(Challenge) ≥ E(Timeout) ≥ 0"
But perhaps consider, strangely enough, that coaches values timeouts more than challenges. Challenges cannot be made at last 2 mins where they often have the most impact. And coaches cannot challenge at any point in a game (effectively challenge at least). Where as a time out can be called at any time.
Maybe this proves that coaches value timeouts more than .03 WP.
There's some selection bias in the success or failure of high/low leverage situations since it's much more likely a coach needs to be sure of a low leverage challenge than a high one. If you binned the leverage to get a success rate, what does the leverage*p(success) look like? I'd expect a flattish line, though there's probably not enough data above 0.2 leverage.
I think we should expect to see a lot more failed challenges in the high-leverage range than we do. The higher the leverage, the lower the probability of reversal required for the challenge to have positive expected value.
The unsuccessful challenges do appear to be more densely distributed in the higher leverage range, but not much. It should (ideally) be proportional, where double the leverage means the challenge should be half as successful.
I am not surprised to see a higher failure rate among high-leverage challenges. My anecdotal observation is that coaches sometimes treat their challenge as a Hail Mary attempt. I have seen plenty of times when the replay seems to CLEARLY confirm the original call, and yet a coach challenges the call because A) a reversal would lead to a huge swing in leverage, and B) the coach is in a desperate situation and seemingly has little choice but to make such a specious challenge.
Interesting topic for consideration. I agree that there is certainly more about this that can be fleshed out and that there is huge room for coaches to improve here and we just don't understand the nuances of that as well as for fourth downs and passing v running.
Is it possible that low leverage challenges in games that have a 14+ lead (you note that this accounts for half of the really low leverage red flags) are not irrational. In these games, challenges don't need anywhere near the expected payoff to be justified as they would in closer games: this seems the case because timeouts are certainly less valuable in blowout games than they are in close ones. As the data set becomes more robust it might be advisable to ignore those games or separate them into bins. You might see interesting results and be able to determine the value of a timeout in games of different score spreads too, if indeed your hypothesis about coaches rationally judging challenges and timeouts in terms of WPA (it seems to make sense to me).
I'm also inclined to agree with those who are suspicious of the "officials are not likely to overrule high leverage plays" claim. If coaches are rational then they see that the "pot odds" favor them in high leverage situations even when the risk of success is low, as others have stated. Additionally, there is such intensive scrutiny on officials by both the public (via the media) and the league itself (internal grading) that I can't imagine this being a major factor. They have lots of camera angles and it seems like they get it right a really high percentage of the time to me--most errors are the officials applying a badly written rule correctly (think tuck rule or Calvin Johnson "catch" in Chicago).
I was surprised that there wasn't an even greater concentration of failed challenges in this "high payoff long-shot" category.
I think your analysis of the expected value of the average timeout is incorrect.
If I understand correctly, you argue that because the coach chose to challenge rather than take a timeout, you can conclude
E(challenge) >= E(timeout)
But you've only established this in situations *when the coach chose to challenge*. There's no reason to think those situations are good times to call a timeout. If timeouts are actually taken in better situations, then the average value of a timeout *when timeouts are actually taken* will be higher.
I like the Hail Mary challenge explanation way more than my "Officials are biased" explanation - makes much more sense from a decision theory perspective.
>I like the Hail Mary challenge explanation way more than my "Officials are biased" explanation - makes much more sense from a decision theory perspective.
Agreed. Near the end of a game, when leverage is likely to be the highest, coaches are more willing to throw a challenge just in case, even if they know they have a 1% chance of success.
Another thing I just realized - if the challenge is successful, the team gets their timeout back. If the team was planning on calling a timeout anyway, why NOT throw the flag? It's a free timeout in that case. Wait for a play with even a small chance of being overturned, even if it's not high leverage, and throw the flag. Challenges go away once you reach the 2 minute warning as well, so you may as well use them before then.
And the final point was that challenges have decreased since the league began automatically reviewing all scoring plays and turnovers. Those are the highest leverage plays you can have, so the only plays coaches really need to challenge anymore are catch/no catch on a 1st down conversion, and the spot of the ball.
The number of challenges has decreased sharply, from 249 in 2010, to 209 in 2011, to just 157 in 2012, for what that's worth.
"And the final point was that challenges have decreased since the league began automatically reviewing all scoring plays and turnovers. Those are the highest leverage plays you can have, so the only plays coaches really need to challenge anymore are catch/no catch on a 1st down conversion, and the spot of the ball."
This is partially, but not completely true. Almost-scores and almost-turnovers are not automatically reviewed, but have just as much leverage as scores and turnovers. Those still constitute a significant portion of challenges.