How the Model Works--A Detailed Example Part 1

One of the most common requests I get is to write up a complete sample game probability calculation. In this article, I'll explain how the model works and do a full detailed example using the upcoming Super Bowl between the Steelers and Cardinals.

When I originally constructed this model, the goal wasn’t to predict game outcomes but to identify how important the various phases of the game were compared to the others. In order to do that, I had to choose stats that were independent of the others, or at least as independent as possible.

There were several options, such as points scored and allowed, total yards, or first downs. But if I’m trying to measure the true strength of a team’s offensive passing game, passing touchdowns may not tell us much. A team may have a great defense that gives them good field position on most drives, or it might have a spectacular running back that can carry the offense into the red zone frequently. So points or touchdowns won’t work.

The other obvious option is total yards. But losing teams can accumulate lots of total passing yards late in a game's “trash time.” Or a team can generate lots of pass yards simply because they pass more often. That really doesn’t tell us how good a team is at passing. Total rushing yards presents a similar problem. A team with a great passing game can build a huge lead through three quarters, and then run out the clock in the 4th quarter accumulating a lot of rushing yards.

First downs made or allowed tells us a lot about how good an offense or defense is, but it doesn’t tell us anything about the relative contributions of the running and passing game of a team.

So, the best choice is going to be efficiency stats. Net yards per pass attempt and yards per rush tells us about how good a team truly is in those facets of the game. They are also largely independent of one another—not completely, but about as independent as possible.

Turnovers are also obviously critical. But total turnovers can be misleading just like total yards. Teams that pass infrequently may have few interceptions, but it may only be because they simply have fewer opportunities. So I also use interceptions per attempt, and fumbles per play.

So the model starts with team efficiency stats. But I don’t use all of them. For example, I throw out defensive fumble rate because although it helps explain past wins or losses, it doesn’t predict future games. A team’s defensive fumble rate is wildly inconsistent throughout a season, which suggests it’s very random or mostly due to an opponent’s ability to protect the ball. Forced fumbles and defensive interceptions show the same tendency. In the end, the model is based on:

  • Offensive net passing yds per att
  • Offensive rushing yds per att
  • Offensive interceptions per att
  • Offensive fumbles per play
  • Defensive net passing yds per att
  • Defensive rushing yds per att
  • Team penalty yds per play
  • Home field advantage

The model is a regression model, specifically a multivariate non-linear (logistic) regression. I know that sounds very technical, but the general idea behind regression is pretty intuitive. If you plotted a graph of a group of students’ SAT scores vs. their GPA, you’d see a rough diagonal line.

We can draw a line that estimates the relationship between SAT scores and GPA, and that line can be mathematically described with a slope and intercept. Here, we could say GPA = 1.5 + 2 * (test score).

Regression is what puts that line where it is. It draws a line that minimizes the error between the estimated GPA and the actual GPA of each case.

We can do the same thing with net passing efficiency and season wins. We can estimate season wins as Wins = -6.5 + 2.4*(off pass eff). Take the Cardinals this year. Their 7.1 net passing yds per attempt produces an estimate of 10.7 wins. They actually won 9, so it’s not a perfect system. We need to add more information, and that’s what multivariate regression can do.

Multivariate regression works the same way but is based on more than one predictor variable. Using both offensive and defensive pass efficiency as predictors, we get:

Wins = 9.6 + 2.3*(off pass eff) – 2.6*(def pass eff)

For the Cardinals, whose defensive pass efficiency was 6.5 yds per att in 2008, we get an estimate of 9.4 wins.

Adding the rest of the efficiency stats to the regression, we can improve the estimates even further. Unfortunately, linear regression, like we just used, can sometimes give us bad results. A team with the best stats imaginable would still only win 16 games in a season, but a linear regression might tell us they should win 21. Additionally, linear regression can estimate things like the total season wins, but it can’t estimate the chances of one team beating another. That’s where non-linear regression comes in.

Non-linear regression, like the logistic regression I use, is best used for dichotomous outcomes such as win or lose. A logistic regression model can estimate the probabilities of one outcome or the other based on input variables. It does this by using a logarithmic transformation, which is a fancy way to say taking the log of everything before doing all the computations. After computing the model and its output just as you would with linear regression, you “undo” the logarithm by taking the natural exponent of the result. Technically, logistic regression produces the “log of the odds ratio.” The odds ratio is the familiar “3 to 1” odds used at the race track, which can be translated into a probability of 0.75 (to 0.25).

Logistic regression would be useful if, instead of predicting GPA, you wanted to predict a student’s probability of graduation. Graduation is a yes-or-no dichotomous outcome, and winning an NFL game is no different. We can use the efficiency stats, that we already know contribute to winning, to estimate the chances one team beats another.

As an example, let’s compute the probability each opponent will win the upcoming Super Bowl based on offensive rushing efficiency alone. Based on the regular season game outcomes from 2002-2007, the regression output tells us that the intercept is zero and the coefficient of rushing efficiency is 0.25. The model can be written:

Log(odds ratio) = 0 + 0.25*(ARI off run eff) – 0.25*(PIT off run eff)
= 0.25*(3.46) – 0.25*(3.67)
= -0.052

The odds ratio, would be e-0.052 = 0.95. In other words, based on offensive running alone, the odds Arizona wins would be 0.95 to 1. In probability terms, this is 0.49, giving Pittsburgh the slightest edge. Another way of saying this is, holding all other factors equal, Pittsburgh’s advantage in rushing efficiency gives them just a 51% chance of winning.

[Note: You can translate odds ratios into probabilities by using prob = odds/(1+odds).]

Now we can do the same thing, but with the full list of predictor variables. The independent “input” variables are the efficiency stats for each team, and the dependent variable is the dichotomous outcome of each game—either 1 for a win or 0 for a loss. My handy regression software tells us that the model coefficients come out as:















CoefficientValue
Constant-0.36
Home Field0.72
O Pass0.46
O Run0.25
O Int-19.4
O Fum-19.4
D Pass-0.62
D Run-0.25
Pen Rate-1.53



The “logit,” or the change in the log of the odds ratio, can be written as:

Logit = const + home field + Team A logit - Team B logit

or

Logit = -0.36 + 0.72 + 0.46*(team A off pass eff) + 0.25*(team A off run eff) +...
- 0.46*(team B off pass eff) – 0.25*(team B off pass eff) - …

We have the constant, the home field advantage adjustment, and the sum of the products of each team’s coefficients and stats. The equation will eventually tell us Team A’s odds of winning, so we add its component logit and we subtract Team B’s. If Team A is the home team, we add the home field adjustment (0.72 * 1). If not, we can leave it out (0.72 * 0).

Now let’s look at Arizona and Pittsburgh in terms of their probability of winning Super Bowl XLIII. I’ll compute both teams’ logit component, combine them in the overall logit equation, then convert it to probabilities. To keep things simple, I’m going to only use team stats from the regular season for this example.

Arizona’s logit component would be:

Logit(ARI) = 0.46*7.1 + 0.25*3.5 – 19.4*0.024 – 19.4*0.028 – 0.62*6.5 – 0.25*4.0 – 1.53*0.39
= -2.45

Pittsburgh’s logit component would be:

Logit(PIT) = 0.46*6.0 + 0.25*3.7 – 19.4*0.030 – 19.4*0.026 – 0.62*4.3 – 0.25*3.3 – 1.53*0.41
= -1.51

Because the Super Bowl is at a neutral site, I’ll only add half of the home field adjustment when I combine the full equation.

Logit = -0.36 + 0.72/2 - 2.45 + 1.51
= -0.93

Therefore the odds ratio is e-0.93 = 0.39. That makes the probability of Arizona beating Pittsburgh at a neutral site equal to 0.39/(1+0.39) = 0.28. Pittsburgh’s corresponding probability would be 0.72.

(Notice how the constant and the home field adjustment cancels out to zero for a neutral site.)

In part 2 of this article, I'll explain how I factor in opponent adjustments and how I calculate a team's generic win probability (GWP)--the probability a team would win against a league-average opponent at a neutral site.

  • Spread The Love
  • Digg This Post
  • Tweet This Post
  • Stumble This Post
  • Submit This Post To Delicious
  • Submit This Post To Reddit
  • Submit This Post To Mixx

50 Responses to “How the Model Works--A Detailed Example Part 1”

  1. coldbikemessenger says:

    Instead of fumbles per play, have you considered (plays-inc passes)? There is probably a wider varience in fumbles then between teams. That may be slightly more accurate. I have not tried this, just throwing it out there.

  2. Brian Burke says:

    Man, you're quick. Yes, in fact that's exactly what I do. I didn't want to get bogged down on that point yet.

    There's advantages and disadvantages to both methods. Like you mention, incomplete passes can't be fumbled. But then again, there is something to be said for throwing a ball away to avoid a sack--and a chance to fumble.

  3. buzz says:

    This is a great post and I am glad that you really went into detail on the numbers taht you came up with. I think that is one of the biggest things that is missing from a few of these websites that are doing this type of thing and why I think yours is one of the best around. A couple of things that I was wondering about. Would it make sense to include sack rate as a stat or is that included in the net passing amount? Secondly, as much as I agree that per play amounts are important it is also very important to be consistent with those rates. I know FO has often given penalties to boom or bust players. More precisely, this year's colts team on defense a team could pretty much get 5-6 yards per play no matter when they wanted to. 5-6 yards per passing play would probably be considered pretty decent in your rate stats but when you can get it most all of the time it isn't nearly as good because part of the reason why 5-6 is good for the rest of the league is because there is an expectation that there are a bunch of zeros and /or int's in there. They were pretty much worst in the league in getting off the field in 3rd downs. On the flip side the offense picked up yards at a slower pace than what they could have but they had to be more conservative because if they gave up any type of turnovers the defense was probably not going to be able to force a punt and get back off the field. They were basically the opposite of their defense, they could pick up a smaller amount of yards most every play, well their passing game could. Im not sure how that consistency is measured but I would have to think that it would be a key factor into winning games.

  4. Anonymous says:

    Things get real interesting for the big game. Of the 12 teams that made the playoffs, Arizona has the best pass offense, and Pittsburgh has the best pass defense. I am a bit confused and seek clarification. If I understand the correlation coefficient correctly, defending the pass (net passing yards/attempt) is more important than net offense passing yards/attempt. DP = -.62 vs. OP = .46 Is this correct? Also have I mistaken that in a previous article that the offense efficiency is more important then the defense efficiency?

    What does the constant mean?

  5. Brian Burke says:

    Buzz-Yes, sacks are the "net" part of net passing efficiency. Regarding consistency, I've often wondered (aloud) if median stats would be better than average stats. I like to keep things as simple as possible, and I think median would be a good way to do that. Unfortunately, median stats would be hard to come by, so the next best thing is averages.

    Anon-You are partially correct. The reason why the def passing coefficient is larger than the offensive coefficient is because I left out defensive interception rate from the model. Defensive interceptions and defensive pass efficiency correlate significantly. So a lot of the variance attributable to def ints is captured by the def pass eff variable. In other words, offensive passing is a combination of the off pass eff coefficient and the off int rate coefficient, while pass defense only has one coefficient.

    Also, the coefficients are non-standardized, so we can't directly compare them to say how important each variable really is. For more, check out the "Why Teams Win" article. The link is in the top articles section on the right.

  6. buzz says:

    not to linger on the sack rate but in some analysis that I did last summer number of sacks had close to the highest correlation with wins as any stat (of course this could partially be because teams that are losing are passing more at the end of the game). along with its effect of approximately 2pts that you came up with would make me think that it is worth more yards than the -5 or so yards that it reduces from passing yards average. I know that the fumble aspect is included in Ofumrate but it might be a good way of using more a median play that you talk about and isolating it since it is so large (just like an INT) instead of averaging it in with those other two stats. I think its overall effect is probably mostly included just maybe not to its full importance of that one play.

  7. Tom G says:

    Would using some formula based on completion % and yards per attempt (perhaps the product of the two, perhaps something else) give a good idea of median v. mean?

    Continueing with what Buzz said, just look at Indianapolis pass defense compared to Giants pass defense

  8. bytebodger says:

    I'm a big fan of using medians rather than averages because, in so many areas of study, averages can be skewed by anomalous results. However, in doing my own regression analysis on the NBA, I actually built a model that used median values and I was shocked to find that it was significantly LESS accurate than the model which relied on averages. I'm not saying that this would necessarily be the same with your NFL analysis, but it is erroneous to blindly believe that median values are always superior to averages.

    At first, this finding very much surprised me. After giving it some more thought, I think it actually makes sense. Median values are so meaningful because they mitigate the tendency of extreme outliers to skew the results. But in sports analysis, you probably WANT those outliers fully weighted in the data set. I would argue that you want those outliers in there because the outliers hold deeper meaning to the overall performance of the team/player.

    Consider this example:

    A running back routinely churns out low-yardage runs of 2, 3, or 4 yards, but once or twice per game he cranks out a 40-80 yard touchdown run. If he gets 20+ carries per game, the median value of his runs would be very near 3 yards per attempt and this would look like a very mediocre (or even sub-par) back. But we know that a RB who routinely breaks one or two TD runs per game is probably very, very good, even if you have to settle for many short gains.

  9. Brian Burke says:

    good point

  10. buzz says:

    bytebodger, I would agree that in football those big plays are huge and you wouldn't necessarily want to use median plays, I think you would want to wait those big plays as big as they are. Part of the reason I think sacks are possibly underated. I also think Tom's point on completion % might make a lot of sense. If you are going to tell me that the colts pass defense is considered as good as the giants and better than league average I would have to laugh, which their 5.9 dpass rating says. Yes they were good at sacks which helps their number but a team could complete a first down on them any time they wanted. It doesnt' seem like that pass defense is really part of what is helping them win games (hence their 29th league rank of forcing punts per drive).

  11. JMM says:

    I have wondered if there is a role for a "difference between average and median" variable to capture the skew in the data and differentiate between the 4 yards and a cloud of dust guy and the 9x2 + 22 guy.

  12. jarhead says:

    A few questions and a few comments....

    I am not familiar with logistic regressions so I have some dumb questions about the regression...

    What platform do you use? What is the R squared for your regression? Did the program run adjusted R squared? Is there a standard error in the estimate?

    On the independent variables....do they (alone or in total) cause a team to win or are they result of a team winning? or maybe a little of both? We can produce a model which mimic's yours using only the various teams won loss percentages and home team advantage.

    On the fumble rate and not counting incomplete passes...there are many fumbles on the transfer from center to quarter back, so not counting the successful transfer on an incomplete pass skews the fumble rate.

    Also on the fumble rate, do you use fumbles or fumbles lost?

    On medians vs averages, based on my experience, the averages perform better than the medians in regressions. And the year to date averages preform better than "x" week moving averages. I have done some testing on averages with more weight given to more recent games and less to games further back in time, and the approach shows the promise of a small improvement over a simple average. Of course my experiences with averages, medians and moving averages don't have much statistical signification.

  13. Brian Burke says:

    jarhead-Lots of good questions there.

    Logistic regression doesn't provide an r-squared or an overall SE like linear regression. For goodness-of-fit estimates, you can use %of cases correctly predicted as a rough measure, but there are a few abstract measures that can tell you if a model is improved or not by changes, but doesn't tell you much about absolute fit.

    But one thing I did was run a linear regression against total season team wins using the same independent variables. The r-squared and significance levels can be found in this article.

    I use Gretl--it's open source. I used to use SPSS.

    Regarding the independent variables, I believe they cause winning instead of the reverse. That's the primary reason I'm a big believer in rate/efficiency stats like yards per rush. I'm sure there is some tiny bit of causation back-flow, but I think efficiency stats are the best possible of all the alternatives. In my mind, total rushing yards is mostly caused by winning, where yds per rush measures how well a team's rushing ability contributes to their chances of winning.

    Regarding fumbles, I use total fumbles, not fumbles lost. Fumble recovery is extremely random. It's literally a 50/50 proposition of who will recover once the ball comes out, and there is absolutely no consistency at the team level of being able to recover fumbles more or less often than other teams. So when explaining past wins, fumbles lost is the better variable. But for predicting future wins, total fumbles is a better measure of a team's likelihood of losing the ball. It's free of the randomness in the recovery.

    There may be a small bias by excluding incomplete passes. But my feeling (without much to back it up) is that the bias would be greater if I included incomplete passes. Teams with low completion% would appear to have very good ball protection when they really just stink at passing. Also, a lot of fumbled snaps are just fallen on and counted as "aborted plays."

    Even if medians proved slightly better than averages as predictors, they're much, much harder to get and compute. I did some earlier research on boom/bust/consistency stuff using a financial statistic called the Sharpe ratio.

  14. Tom G says:

    "A running back routinely churns out low-yardage runs of 2, 3, or 4 yards, but once or twice per game he cranks out a 40-80 yard touchdown run"

    It isn't a matter of which is better, median or mean, it is that both are important. To me, Barry Sanders is the best non-QB ever to play the game. The man single handedly made the Lions a playoff team. That has to be one of the greatest accomplishments in football history. But were the Lions really 28% more efficient with Sanders than when the Steelers haded off to Jerome Bettis?

    I would say no

    In general, I would agree that going beyond the yards / play average is not worth the effort, especially for rushing. With passing, when there are so many more zero's and so many more big plays, I think completion % along with yards / attempt is definitely better than either one alone

  15. Anonymous says:

    IMO, net yards per attempt is perfect for passing, but yards per carry is terrible for RB's.

  16. Anonymous says:

    It seems that you use 8 independent variables. Does it mean that for a game you have 2 lines of data? One for the home team (Home Field = 1) and the other one for the away team (Home Field = 0). I am talking about data that "feeds" the regression to get the coefficients.

  17. Brian Burke says:

    Right. There's actually 15 independent variables for each case. A dummy (1 or 0) home field variable, and 7 stat variables for each team.

  18. Anonymous says:

    I assume the dependent variable is "Team A won the game".

    Do you have
    1 / Team A Stat / Team B Stat / Team A Won
    0 / Team A Stat / Team B Stat / Team A Won
    Or
    1 / Team A Stat / Team B Stat / Team A Won
    0 / Team B Stat / Team A Stat / Team A Won

  19. Brian Burke says:

    It's like this:
    dep var = Team A won
    indep vars =

    Team A home
    Team A stats
    Team B stats

  20. Anonymous says:

    So for each game do you have one or two lines? Sorry for asking simple question.

  21. Brian Burke says:

    Each game gets 2 cases in the final specification. One where Team A is the home team, and one where Team B is the home team.

    That's what allows me to get a coefficient for home field advantage. The other option is to specify a model like:

    Home team stats
    Visitor team stats

    The resulting coefficients for the home team would be higher for the visiting team, capturing the variance from the home advantage.

  22. Mr.Ceraldi says:

    Hey Brian;
    To understand you correctly
    When you did the non-linear regression did
    you have to input each "stat for EACH game along with the win/loss outcome?" for each dependent efficency stat and the coressponding outcome (win loss 0,1)
    Is this different than linear regression that can be quickly calculated on season stats...
    DAn

  23. Brian Burke says:

    Yes, and yes.

    For the linear season win total regression: each case is a team-year; the dependent variable is season wins (0-16); and the input variables are team efficiency stats.

    For the game-by-game non-linear probability model: each case is a game; the dependent variable is Team A won or lost (1,0); and the input variables are team efficiency stats.

  24. Anonymous says:

    Thanks ...WOW..thats alot of work calculating each game!!!

  25. Anonymous says:

    Do you have a specialized program/script to scrape data from NFL.com into your database or can you recommend one? I wish to do some work in another sport(NHL) and their data is in HTML etc.)
    thanks

  26. Mr.Ceraldi says:

    Hi brian;
    I am using Logistic regression(following your model with another sport)
    When I input the data with off efficency stats THEN def.efficency stats together for each team
    as independent variables I do notget any results
    (My stats consultant who is not aware of your work suggests it is because of the "perfect predictor problem and as a result there is no convergence?)
    When take out the def efficeny stats for each team (which mirror each other) it works fine?
    Do you think this is just a fault of the specific logistic regression program I am using? did you run into the same problem> If so how did you get around it ....I amsure you stated that you placed all efficency stats for each team on one line followed ny the outcome?
    thanks

  27. Brian Burke says:

    It sounds like you are using each team's individual game efficiency stats as the predictor for the game outcome. What I do is use each team's season-long efficiency stats as "predictors."

  28. Mr.Ceraldi says:

    Oh! thanks! by "each teams season long efficency stats" I take it you mean the cumulative total to date ..prior to the game
    in question? If so, how do you include game 1
    when you have no total to date?

  29. Anonymous says:

    Brian;
    So to clarify - with your non-linear game model
    If you were recording last weeks Sd Denver game
    there would be 2 cases
    case
    1) Team A (SD) 1(Home team), O PASS(-cumulative for the season to date), O RUN(-cumulative for the season to date),O INT...,O FUM...,D PASS...,D RUN...,D INT (added back)...,PEN RATE...,(DEN STATS) O PASS, ...etc, 0(Dependent outcome variable indicating loss for SD)

    case
    2) Team B (DEN)0(dummy home variable), O PASS(-cumulative for the season to date), O RUN(-cumulative for the season to date),O INT...,O FUM...,D PASS...,D RUN...,D INT (added back)...,PEN RATE...,(SD STATS) O PASS, ...etc, 1(dependent outcome variable indicating win for DEN)


    * 17 independent variables
    * 2 cases per game one with each team as the home team
    * all efficency stats are cumulative to date
    ? along the lines of Mr. Ceraldi
    when recording the games/cases in the first week of the season what do you do?
    thanks
    Great work!

  30. Brian Burke says:

    All correct except for 1 thing. When 'training' the model, I use year-end stats. If you only used year-to-date stats, you'd get wildly inconsistent results from week to week.

    The result is good solid coefficients with accurate relative weights, except that they rely on a full 16 games of data. This would cause severe overconfidence in the model, particularly in the early weeks.

    During the season, when I use the model to estimate game probabilities, I regress each team stat toward the league mean to contradict the overconfidence. The degree of regression is based on how self-consistent each stat tends to be during a season.

  31. Anonymous says:

    two follow up questions:

    1)So you get your cofficents for your non-linear model by taking for example the stats at the end of 2008 and using these same static rates for each team and run them back through every game in the season (kind of a retro fit) but in the format I outlined above?
    2)Does your in season regression take place all the way through out the season

  32. Mr.Ceraldi says:

    Brian;
    I assume the amount of degree of regression
    you apply to each stat is quite compicated?
    Can you give some info. on how you calculate the
    amount of "self-consistent" ? Is it game to game variance
    thanks

  33. Anonymous says:

    Brian you wrote "that there are 15 independent variables for each case "(both team stats
    HOWEVER, doesn't this give you 15 coefficents
    the 7 you listed and the seven coefficents for opposing team stats?Forgive my ignorance but why
    include the opposing team stats in each case
    if you don't use them?

  34. Brian Burke says:

    I use both opponents' 7 efficiency stats plus home field as predictors for a total of 15. The coefficients for the second opponent are simply the inverse of those for the first opponent.

  35. Mr.Ceraldi says:

    yes Brian I was wondering as well..
    what do you do with the opposing team coefficents?

  36. Anonymous says:

    1. Brian you mentioned to jarhead that with log regression you can use " % of cases predicited "
    as a rough estimate was wondering what was your
    % with your five year model?

    sorry if I missed this

  37. Brian Burke says:

    If I recall it was around 74%.

  38. Anonymous says:

    Brian;
    I have read all your articles around your game model. Did you ever consider using a points/pass attempt(adjusted with sacks) efficency stat?
    This comes from Bud Goode who is the forefather of the foundational yds/adjusted pass stat you use?There is a strong correlation and I beleive he has tested it for predictablility(though not positive)The only thing that bothers me about your model is it doesn't seem to account for the skill of scoring (you may argue/believe it doesn't exist)However,intuitively,it seems on some level to do?any ideas? or our you dead set against including any scoring in the predictor variables?
    SNOWMAN

  39. jgrenci says:

    Brian, I read your post and all the comments, and I could have missed it, but your original regression to get the coefficients, was it linear or non-linear? for example, O run was .25. how does that translate with y=.25x (forget the other variables for a sec). what is y and what is x?

    I understand how you finished the GWPs with logistic regression, but I am tgrying to follow what you did before that?
    thanks

  40. Brian Burke says:

    It's a non-linear logit model.

  41. John G says:

    Hey Brian, I guess I am asking, what does y represent for y=.25x? is it the point spread advantage? is it seasonal wins?

    thanks

  42. Brian Burke says:

    In a logit model, y is the natural logarithm of the odds ratio of winning for the visiting team.

    So, you'd add up the constant plus the product of the coefficients * team stats. That would be the y. Then you calculate e^y. That's the odds ratio, like you'd hear at a horse race--1:3, 2:5, or whatever.

    To get a probability from the odds ratio, you just need to go through a little algebra described in the article above.

  43. Mr.Ceraldi says:

    John
    the .25 is one of the coefficents generated by use of a non-linear multivarate regression. (where the dependent outcome variable is either 0,(loss), 1(win)and the independent variables are the season eff. stats (regressed to season mean)

    .25 can be translated into odds to win by
    multiplying it by the off. rush stat for the team
    example Mia 4.8*.25 = 1.2
    no 4.6 * .25 = 1.15
    Mia - No = 1.2-1.15 = .05

    this difference(advantage) can be translated into odds by
    using e (2.71 approx)
    e^.05 = 1.05 this is called odds ratio
    then you use the formula
    prob = odds/(1+odds)
    = 1.05/(1+1.05) = 51.2%
    So based on Off. rushing alone! and at a neutral
    field Miami is a slight fav. over No.

    so y=.25x(linear equation model is irrelevant with logistic regression.
    Dan
    (full explanation found at Brian's page
    "What makes Teams Win part 1")

  44. John says:

    Mr. Ceraldi, can I get your email address? mine is zonkerjohn@yahoo.com. thanks

  45. Anonymous says:

    You may want to change

    Logit = -0.36 + 0.72 + 0.46*(team A off pass eff) + 0.25*(team A off run eff) +...
    - 0.46*(team B off pass eff) – 0.25*(team B off pass eff) - …

    to

    Logit = -0.36 + 0.72 + 0.46*(team A off pass eff) + 0.25*(team A off run eff) +...
    - 0.46*(team B off pass eff) – 0.25*(team B off run eff) - …

    ie change the last instance of pass to run

    KenyonLV

  46. Anonymous says:

    Ok, I did a logistic regresion using season avergages on all games from 2002 to 2008, and wanted to point out to others that you should remove the two tie games otherwise your A and B Team coefficients won't match up and your constant won't be exactly half of AHome.

    KenyonLV

  47. Anonymous says:

    Brian;
    I am using your non-linear log.prob. model fro another sports(NHL).I'm using Gretl(thanks for the tip..it's a great program!) I've run into an interesting snag?One of my officency stats is clearly a strong positive correlation (.65 to wins) When I run it by itself(or with the home field variable) it works fine and delivers a strong positve coefficent as it should.However, when I run it with my other 10 efficenty stats it always comes out as a "negative" coeficcent?...I'm stumped I have carefully checked data I can't figure this out? I know it's a strong positve indicator? It is happening withone other of my positve efficeny stats as well? I did a linear regression to season wins (similiar to you) prior and it works fine.
    Any ideas? I am using the 'binary'option for non-linear in Gretl.I have two cases for every game (one with Team a as home the other with Team B.and0,1, outcome asmy independent var.
    thanks
    Dan

  48. Joe G says:

    I think there is a minor typo in the example..

    Halfway through, when you calculate the winning % using just rushing stats - you undo the log to get the .95, but AZs chance is not ".95 to 1" as written - which would make them a favorite - it's .95/(1+.95)=.49, which you got.

  49. Brian Burke says:

    Thanks!

  50. Unknown says:

    Brian - where are you getting the data from? For example, I got the data for the Arizona Cardinals from 2002-2008 off profootballreference.com and I cannot get your coefficient or intercept to match the formula for season win totals with just pass efficiency. The data I am using is below:

    year wins pe
    2002 5 4.7
    2003 4 5.1
    2004 6 5
    2005 5 6.2
    2006 5 6.3
    2007 8 6.6
    2008 9 7.1

Leave a Reply