The Problem with Power Rankings

Power rankings, everybody's got 'em. ESPN, Sports Illustrated, and every sports page of every major paper in the country. There's not much football going on during the week, so fans need something to chew on.

There's one big difference between other rankings and the kind you'll find here. The rankings here are tuned to be predictive rather than explanatory. There's a lot of randomness in football game outcomes, and the model here is designed to ignore the noise and focus on the signal. This approach can sometimes produce curious results, as teams with relatively poor records are often ranked well ahead of teams with better records.

At the time this was written, a 2-2 team was ranked #1, while the two 5-0 teams were ranked #5 and #8. There was even a 4-1 team ranked #15, behind a 2-3 team ranked #13. Certainly, there must be something wrong here, at least if we're to believe the dozens of comments in the weekly rankings posts.

Maybe not. Consider a world in which all NFL teams were perfectly evenly matched. Every game would be little different than the flip of a coin. In this kind of world, there would likely still be at least one 5-0 team and one 0-5 team, despite being completely equal in team strength. In this hypothetical world, a power ranking based on team record would be an exercise in self-delusion.

The real NFL isn't too far off from the National Coin Flip League. Certainly there are better and worse teams, but there is also randomness. So although team records are not complete illusions, they are partially random and often misleading, especially early in the season.

Most power rankings of one kind or another are heavily influenced by team record, and it's easy to understand why. Even the sharpest and most diligent sports pundits, who have the luxury of watching a wall of simultaneous games on Sunday, can only take in a portion of the action. And they remember even less--their brains aren't Tivos after all. And the previous weeks' action is an even fainter memory. And what they do remember is likely biased by the information they have at their fingertips, most prominently game scores. So it's not surprising that most power rankings closely match team records.

The illusion of power rankings goes deep. Even if power rankings are not necessarily good predictors of future performance, they can still be sound predictors of final team records and playoff qualification. The reason is the wins-in-hand effect. Season-ending records and playoff qualification is a function of two factors: 1) to-date wins in hand, and 2) future win expectancy. Power rankings that are overly influenced by to-date wins in hand will appear to be far more accurate by the end of the season than they really are. Even objective quantitative rankings that over-fit to the noise of past performance will intuitively seem to be a much better gauge of team strength that they truly are. If you're grading someone's rankings according to how well they match final records, you're letting them cheat. They've glimpsed half the answer key from the teacher's desk.

This illusion also tricks us into thinking the NFL is more predictable than it really is. It feeds our hindsight bias because everyone's power rankings tend to end up matching the playoff seedings, if only due largely to the wins-in-hand effect. And worse, everyone's power rankings tend to match each other's power rankings, creating a mirage of predictability and certainty.

The model at this site focuses exclusively on factor #2: future win expectancy. You already know factor #1 (to-date team wins), so the last thing you need from us is yet another set of rankings based heavily on win-loss records.

  • Spread The Love
  • Digg This Post
  • Tweet This Post
  • Stumble This Post
  • Submit This Post To Delicious
  • Submit This Post To Reddit
  • Submit This Post To Mixx

21 Responses to “The Problem with Power Rankings”

  1. Anonymous says:

    Not that this is necessarily feasible or a can of worms you want to open, but would adding a QB's career INT rate (for veteran QBs) as an input to the model help its accuracy or not? It seems that because INTs are so rare (sample so small), your model has to regress YTD INT rates heavily to league average -- and correctly so, if you are limited to current-season inputs. But I would think that long-term INTs rates are fairly predictive on a player level, and so would possibly account for some of the observed prediction error.

    The other way to ask this is: do you think your model tends to underrate teams with long-term low-interception QBs, and overrate teams with long-term high-interception QBs, because the model is not allowed to know info. from past seasons and thus has to be very conservative in projecting abnormal INT rates?

  2. Anonymous says:

    I'd be interested to know the answer to that question as well.

  3. Brian Burke says:

    There's definitely something to that. I've looked at year-to-year int correlation. It's very low, but so is intra-season int rate correlations.

    Here's more from Chase Stuart.

  4. Jim Glass says:

    Chase Stuart's point about QB picks being very random year-to-year back through all 16-game seasons remains true this year -- Brady had only 4 all last year, already has 8 this year including 4 in one game -- and was true back through the 14- and 12-game-season eras as well.

    Back when there were a lot more interceptions thrown than today, Bart Starr won his one MVP award in 1966 on the back of the miracle achievement of throwing only three picks for the year -- the all-time low number.

    The next season he started by throwing 9 in his first two games, and for the full year threw more picks than any other QB in the league. Same QB, same teammates on offense, same system, same coach (same end result winning the Super Bowl). Go figure. There's lot of randomness in this game.

  5. Anonymous says:

    I sense this is a preemptive justification for Dallas being atop the week 7 power rankings! And, I would agree.

  6. Tom says:

    I agree entirely with your sentiment, Brian, but I have to admit that the model, relative to others, is not doing well this year. This season has been very predictable so far, but the model here has struggled to capture that predictability.

  7. Anonymous says:

    That's simply not true. Brian's model is within a couple games of Vegas favorites.

  8. Tom says:

    In comparison to the system average at thepredictiontracker it is not doing all that well. It may be within three or four, but that is a large difference over a few weeks. I deliberately picked the system average as it doesn't suffer from the luck bias that individual systems can - there are those that have done better.
    My point was more that, whilst it is easy to claim that these statistics are intrinsically better predictors, they do not perform any better than those using the more common statistics in intelligent ways. Whether this simply means that you can capture as much with points and yardage as you can with ratestats, or that the current use of these ratestats is inadequate I cannot say, but I can say that at present this model, whilst interesting, and potentially powerful, is not separating itself from the crowd.

  9. Ted says:

    You've got to be kidding. Based on 3 weeks of games?

  10. Anonymous says:

    "This season has been very predictable so far, but the model here has struggled to capture that predictability."

    You just don't understand how things work, do you?

  11. Tom says:

    Actually, I understand extremely well how these things work. You are thinking 'oh, he thinks picking games is the key to a good model'. No, I don't, I know that the combination of model confidence, and model calibration, is what makes a model good, and I know, I have looked back at the numbers, my own, and those of others, and found that there is no significant difference in quality between Brian's numbers and those of other quality prognosticators.
    As for three weeks of games being a small sample, it is. I agree, so we shall see how that picks up. However, those who have been predicting games from week one have been predicting at a higher average rate, including those early weeks, than Brian's model, and that should be food for thought, because there is clearly data there that is being thrown out, and I hate to see so much potential go to waste.
    I have spent many a comment defending the model, but I think it is time for an overhaul, because it has more potential than is being realised.

  12. Anonymous says:

    Maybe this article is better suited for the NY Times or Washington Post, but as someone commented last week, most readers of this site are smart enough to understand that the team with the best record doesn't always win. Or maybe traffic coming from those sites is dumbing down the readership here. It's not much of a stretch to believe that your rankings are more predictive than most subjective power rankings, but I don't know that you've provided much evidence that it's any more predictive than dozens of other objective statistical models.

  13. Anonymous says:

    "I have looked back at the numbers, my own, and those of others, and found that there is no significant difference in quality between Brian's numbers and those of other quality prognosticators."

    How exactly did you test the quality, and what were the actual results?

  14. Tom says:

    I tested the quality in terms of calibration, confidence, and absolute error. I don't have the results to hand, as I'm not at my PC, but they were as suggested.
    I also found that when Brian's model tended to disagree strongly with more traditional models, it was more often to his model's disadvantage that it did so. His model was also overconfident late in the season, though from what I understand he has rectified that somewhat now.

  15. Jacob Stevens says:

    Great article.

  16. David says:

    Power Rankings aren't meant to be predictive. They are meant to indicate who is having the best season at a particular moment.

    The outcomes of plays in the last couple minutes of a tied game don't have much predictive power, but they can have a massive effect on who is having a good season (Niners) and who is having a lousy season (Cowboys), which is what the power rankings capture.

  17. David says:

    Obviously the standings essentially capture the same thing I'm describing, with slight mental adjustments for strength of schedule and reputation, which is why they are pretty dumb. But I'd think of the power rankings more as a proxy for RPI or BCS style standings than for predictive accuracy.

  18. AES says:

    As a very primitive test, there have been 8 games so far where Brian's model has predicted an outcome that is different from the consensus prediction of the computer models tracked at Brian's model is 2-6 in those 8 games.

    Week 4: Min at KC (W), NYG at Ari (L), Atl at Sea (L), NE at Oak (L)
    Week 5: Cin at Jac (L), SD at Den (L)
    Week 6: Car at Atl (L), Buf at NYG (W)

  19. Anonymous says:

    Being 2-6 (or worse) over 8 games would happen 14% of the time if we were just flipping coins. The sample size is too small to say anything about whether Brian's method is better/worse than the consensus.

  20. AES says:

    Agreed about the sample size. I was just curious so I compiled all the data I could find. I'll probably continue to track this. Should I bother to post the results here when the sample size is larger?

  21. RaH says:

    ' Tuesday, October 18, 2011
    Tom said...

    I agree entirely with your sentiment, Brian, but I have to admit that the model, relative to others, is not doing well this year. This season has been very predictable so far, but the model here has struggled to capture that predictability.'

Leave a Reply

Note: Only a member of this blog may post a comment.