A couple months ago I posed an apparent paradox. Aaron Rodgers' new $21M/yr contract was either a solid bargain or a disastrous ripoff depending on how we analyze the data. By only flipping the x and y axes of a scatterplot, we can come to completely opposite conclusions about the value of a QB relative to what we'd expect for a given salary or for a given level of performance. Much of this post is derived from the many insightful comments in the original. Please take the time to read them, especially those from Peter, X, Phil and Steve.
By regressing salary on performance (adjusted salary cap hit on the vertical (y) axis and Expected Points Added per Game (EPA/G) on the horizontal (x) axis), Rodgers' deal is insanely expensive by conventional standards. But by regressing performance on salary, his new contract is a bargain.
Which one is correct? That depends on several considerations. First, there are generally two types of analyses. The one I do most often is normative analysis--what should a team do? The second type is descriptive analysis--what do teams actually do? The right analytic tool can depend on which question we are trying to answer.
The reason that we saw two different results by swapping the axes is that Ordinary Least Squares (OLS) regression chooses a best-fit line by minimizing the square of the errors between the estimate and the actual data of the y variable. OLS therefore produces an estimate that naturally has a shallow slope with respect to the x axis. When we swap axes, the OLS algorithm is not symmetrical because of that shallowness.
If we chose another error-minimization function other than OLS, we get different estimates. One of the simplest is Least Absolute Deviation (LAD), which is similar to OLS except that it minimizes the absolute value of the error rather than the square of the error. Another method, mentioned in a comment to the original post by Peter, is called Reduced Major Axis (RMA) regression, which regresses both the x and y axes. RMA is useful when there is error in both the x and y variables.
The chart below illustrates how the results of each method compare. For now, performance (in terms of EPA/G) is on the x axis while salary (adjusted cap hit) is on the y axis. The black line is the OLS estimate where salary is the regressed variable--Rodgers is a ripoff (Brian 2's perspective in the original post). The purple line is the OLS estimate where performance is the regressed variable--Rodgers is a bargain (Brian 1's perspective). The green line is the LAD estimate, and the red line is the RMA estimate. Note that Rodgers' new contract would theoretically put him below the letter 'e' in 'Game' in the title.
Causation
We're typically taught that x and y axis choices should be based on cause and effect. The x, or independent variable, "causes" the y, or dependent, variable. So the choice for which variable goes on the x axis, and which should be regressed on the y axis should be simple. Does salary cause performance, or does performance cause salary? I think the answer is neither.
I think it really is a matter of perspective... For example, from the player's perspective, if he reliably performs around 11 EPA/G (independent/cause), how much money can he expect in return on the FA market (dependent/effect). But from the team's perspective, if they buy $21M worth of QB on the FA market (independent/cause), how much performance can they expect (dependent/effect)?
You might say (as I think someone above did), arbitrarily paying a person a lot of money does not "cause" him to play well at QB, as the Jets proved with Mark Sanchez (zing!). Case in point--if you paid me $20M to be an NFL QB, I'd average -100 EPA/G.
But I've left an important systematic linkage out of the discussion: The Market. Paying someone $20M to play QB doesn't cause someone to be skilled, but purchasing a $20M asset in a competitively priced market provides a systematic linkage from pay to performance. It's not unlike buying a race car. All other things being equal, paying $100k for a car rather than $50k for a car in a competitive market means I should expect a faster car. Money does "cause" performance, indirectly via a competitive market process.
From what I understand, and according to several of the comments in the original post, the choice of which variable should be regressed onto the other should be based on something else than our conception of causation.
Uncertainty and Error
In a bivariate regression, the regressed variable should be the one subject to statistical variance. In other words, the y variable should be the one with a component of random error, while the x variable is the one we know with certainty.
In this case, we know salary with absolute certainty. We know Peyton Manning has been paid an average of $17M per year. It's not as if we're not quite sure of his salary but we have some idea with some error built in. Tragically, Mark Sanchez has been paid $8M per season since his extension. That's an exact amount with no uncertainty. Now, one could say that amount was in error, because Sanchez never came close to living up to his contract. But that's not the kind of error we're talking about. Statistical error is not a mistake. It's the difference between what we would expect based on a model and what is actually observed.
For pure OLS regressions to be unbiased, the y variable should have a normal distribution. In fact, the least squares method is not an arbitrary choice. It is directed derived from the formula for the Gaussian (normal) distribution.
Here are the distributions of the two variables. The first is salary.
You can see that it's not normal, at least for the range of our sample. It's more like a power-law distribution where there are lots of players with relatively low pay, and fewer players as salary increases. This is a near universal salary distribution found in almost any context and every type of job. But I suspect it's not really the power law at play--It's probably the extreme right tail of a normal distribution of all athletes in the general population who could conceivably be QBs. After all, if scouts and coaches are doing their jobs, that's where NFL QBs will be found.
In contrast, the histogram of EPA/G appears very normal (bell-curved), which suggests there is a random error component at play. It's not that the normality of distributions should decide which variable gets regressed. Rather, the distribution betrays an uncertainty in the variable. In this case, the uncertainty surrounds the "true" value of a QB. EPA/G is a good stat, but it's only a sample of a player's "true" ability. There are many other factors beyond true ability that determine a QB's EPA/G, including teammates, coaches, opponents, sample error. Ideally, a QB's pay is in exchange for his true talent, but that can never be known. It can only be estimated. EPA/G is really just a crude approximation of a player's underlying ability.
[As an aside, one might wonder how the right tail of a normal distribution can produce a complete normal distribution in performance. Shouldn't EPA/G's distribution also look like the right tail of a bell curve? No, because when two right tails compete (offenses comprised of right-tail talent vs defenses comprised of right- tail talent, the outcome will be normal.]
So we know a player's salary with absolute certainty, and we can only estimate his true talent. EPA/G, the stat I chose to best approximate talent is clouded by sample error and unaddressed external factors, like surrounding team talent.
Ultimately, what I've learned from this exercise is that the selection of x and y variables in a regression don't have to do with cause and effect, or independent vs dependent. It's about which variable you know without (statistical) error, and which variable contains uncertainty.
Subscribe to:
Post Comments (Atom)
That's exactly what I always thought. Thanks for confirming it for me Brian.
"Does salary cause performance, or does performance cause salary? I think the answer is neither."
Then perhaps fitting a straight line to them doesn't really mean anything. In fact, at first glance, the data looks like a cloud not a line. The fact that we get such varying fits should lead one to conclude that these fits do not have great meaning.
I'd point out that we really don't do performance versus salary. It is expected future performance vs future salary. The future salary is known, it is the contract. The expected future performance is just a guess, which is often wrong (current results not indicative of future performance).
How do other measurements of performance look (touchdowns, yards, post season succeess, vs salary, for instance)? These are directly considered when estimating what salary a player gets, where EPA/G probably is not.
Anonymous-You completely misunderstand the post. I see a diagonally oriented "cloud" with a very significant correlation. Please read the original post linked above. Besides, if it were a line, we wouldn't need a regression, so that makes no sense.
If teams were actually any good at digesting touchdowns, yards, etc into a single number, they'd be real close to EPA. And in aggregate there would definitely be very solid connection.
Here we are, comparing what teams *should* do and what they *do* do. That's right, I said do do.
Brian, are you sure the Y variable needs to be normal? I thought only the error term needed to be normal.
So Rodgers is a bargain, right? :)
Phil Birnbaum is correct. It is the error term of the Y variable that must be normally distributed, not the Y variable itself.
Back to the original-original question of the contract's value; Could one combile total EPA/G for all of the different possible combinations of teams based on position and salary cap restraints and compare the average EPA/G of those teams to the average EPA/G of all of the different possible combinations with Rodgers.
This would show how restrictive Rodger's contract is to the remaining roster positions. Also it would account for the actual contract conditions in the NFL.
Brian, you ignored Anon's second point, which is a very important one. Based on your original post (which I did read...and the comments) the question is "is AR's contract a bargain or not?" and the measure of "bargain" must be against the market value.
As Anon pointed out, salary is based on projected future performance based on the performance up to that point. If you are trying to tease out the mean performance valuation NFL GM's have in their heads when making QB contracts and thus whether AR's contract is in or out of line with that expectation, then you can't regress career averages against each other, rather you need to work with the information the GM's had at the time: performance up to the date the contract was signed. This performance estimate may or may not correspond to EPA (to the extent that it does, the GM is doing a good job maximizing the things that contribute to winning) and is probably strongly weighted to recent performance (leading to the concept of the contract-year). Future performance does somewhat correlate to past performance (there is such a thing as a good QB), but the salary paid in a given year has absolutely zero direct logical relation to the performance in that year, and average performance only weakly so (if the time period over which the average was taken includes a new contract and thus a chance for performance information in the first part of the interval to feed forward to the latter part).
The rest of your statistical methodology discussion is fine, but it really just serves to highlight the importance of getting your question right in the first place before hauling out the statistical toolbox to answer it.
Will-Just trying to establish a baseline for the market. The analysis is for current year salary for current year performance. In the long run, and in aggregate among all the FA QBs, the relationship will hold.
Regarding normality of the y-variable...It was always my understanding that OLS regression assumes normality. That's why we minimize the square the of error. The choice of squaring the error is not arbitrary and is derived directly from the equation that describes the Gaussian curve. The t-tests for significance and the goodness-of-fit calculations (r and r-squared, for example) assume normality. I'm not the authority here, so perhaps I'm mistaken.
Also, it is true that for unbiased regression estimators, the distribution of the errors must be normal.
No normality assumptions are required for linear regression estimates to be unbiased. Actually, you don't need a lot of assumptions for linear regression estimates to be unbiased (though one of them is that the true underlying relation is linear, which you usually don't actually quite believe when you're running linear regressions).
The standard errors, now, those are a whole other ball of wax. Those require assumptions like normality (of at least the residual and maybe the X variables, and note that if the residual and the X-variables are joint-normal and the true relation is linear, then the y-variable is normal) and that the variance of the residual doesn't depend on the X-variable and that the cows are spherical (I should probably look this up in one of my econometrics books, but that's how I remember it).
Let me add something else: OLS is unbiased with non-normal data, but it may not be efficient - i.e., a weighted least squares procedure may produce estimates that are closer to the truth with less data. I'm not sure I ever learned when OLS is guaranteed to be efficient, but it seems likely to me that normality would be a consideration (in particular, I suspect that a weighted least squares procedure would generally be more efficient for highly-skewed data and I suspect that it would not be more efficient for normal data).
In academic finance, at least, if we have data that is highly skewed but becomes normalish when logged (firm size, for example), we log it. That might be for efficiency or it might be because we expect the assumptions behind our standard errors to hold up better for logged data than for the unlogged version. So I'm not saying that normality isn't a good feature for the data to have, I'm just saying it isn't required for OLS to be consistent and unbiased ("consistent" just means that if you get an arbitrarily large data set, the errors in your estimates become arbitrarily small).
Brian, I don't think you'll ever acquire enough data. Performance norms, yes, but salary norms, no -- because QBs only negotiate 2 or 3 prime contracts in their careers.