Summary: Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to.
A few months ago I wrote a blog post on using causal graphs to understand missingness and how to deal with it, which concluded on a rather positive note:
“While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic…”
Well, today I write about another statistical issue with a basically optimistic message. So I decided this could be like the second installment in a “some things are not fucked” series, in which I look at a few issues/methods that some people have claimed are fucked, but which are in fact not fucked.1
In this installment we consider logistic regression — or, more generally, Binomial regression (including logit and probit regression).2
Wait, who said logistic regression was fucked?
The two most widely-read papers sounding the alarm bells seem to be Allison (1999) and Mood (2010). The alleged problems are stated most starkly by Mood (2010, pp. 67-68):
- “It is problematic to interpret [logistic regression coefficients] as substantive effects, because they also reflect unobserved heterogeneity.
- It is problematic to compare [logistic regression coefficients] across models with different independent variables, because the unobserved heterogeneity is likely to vary across models.
- It is problematic to compare [logistic regression coefficients] across samples, across groups within samples, or over time—even when we use models with the same independent variables—because the unobserved heterogeneity can vary across the compared samples, groups, or points in time.”
These are pretty serious allegations.3 These concerns have convinced some people to abandon logistic regression in favor of the so-called linear probability model, which just means using classical regression directly on the binary outcome, although usually with heteroskedasticity-robust standard errors. To the extent that’s a bad idea—which, to be fair, is a matter of debate, but is probably ill-advised as a default method at the very least—it’s important that we set the record straight.
The allegations refer to “unobserved heterogeneity.” What exactly is this unobserved heterogeneity and where does it come from? There are basically two underlying lines of argument here that we must address. Both arguments lead to a similar conclusion—and previous sources have sometimes been a bit unclear by drawing from both arguments more or less interchangeably in order to reach this conclusion—but they rely on fundamentally distinct premises, so we must clearly distinguish these arguments and address them separately. The counterarguments I describe below are very much in the same spirit as those of Kuha and Mills (2017).
First argument: Heteroskedasticity in the latent outcome
A standard way of motivating the probit model for binary outcomes (e.g., from Wikipedia) is the following. We have an unobserved/latent outcome variable that is normally distributed, conditional on the predictor
. Specifically, the model for the
th observation is
.
The latent variable is subjected to a thresholding process, so that the discrete outcome we actually observe is
.
This leads the probability of given
to take the form of a Normal CDF, with mean and standard deviation a function of
,
, and the deterministic threshold
. So the probit model is basically motivated as a way of estimating
from this latent regression of
on
, although on a different scale. This latent variable interpretation is illustrated in the plot below, from Thissen & Orlando (2001). These authors are technically discussing the normal ogive model from item response theory, which looks pretty much like probit regression for our purposes.
We can interpret logistic regression in pretty much exactly the same way. The only difference is that now the unobserved continuous follows not a normal distribution, but a similarly bell-shaped logistic distribution given
. A theoretical argument for why
might follow a logistic distribution rather than a normal distribution is not so clear, but since the resulting logistic curve looks essentially the same as the normal CDF for practical purposes (after some rescaling), it won’t tend to matter much in practice which model you use. The point is that both models have a fairly straightforward interpretation involving a continuous latent variable
and an unobserved, deterministic threshold
.
So the first argument for logistic regression being fucked due to “unobserved heterogeneity” comes from asking: What if the residual variance is not constant, as assumed in the model above, but instead is different at different values of
? Well, it turns out that, as far as our observable binary outcome
is concerned, this heteroskedasticity can be totally indistinguishable from the typically assumed situation where
is constant and
is increasing (or decreasing) with
. This is illustrated in Figure 2 below.
Suppose we are comparing the proportions of positive responses (i.e., ) between two groups of observations, Group 1 and Group 2. To give this at least a little bit of context, maybe Group 2 are human subjects of some social intervention that attempts to increase participation in local elections, Group 1 is a control group, the observed binary
is whether the person voted in a recent election, and the latent continuous
is some underlying “propensity to vote.” Now we observe a voting rate of 10% in the control group (Group 1) and 25% in the experimental group (Group 2). In terms of our latent variable model, we’d typically assume this puts us in Scenario A from Figure 2: The intervention increased people’s “propensity to vote” (
) on average, which pushed more people over the threshold
in Group 2 than in Group 1, which led to a greater proportion of voters in Group 2.
The problem is that these observed voting proportions can be explained equally well by assuming that the intervention had 0 effect on the mean propensity to vote, but instead just led to greater variance in the propensities to vote. As illustrated in Scenario B of Figure 2, this could just as well have led the voting proportion to increase from 10% in the control group to 25% in the experimental group, and it’s a drastically different (and probably less appealing) conceptual interpretation of the results.
Another possibility that would fit the data equally well (but isn’t discussed as often) is that the intervention had no effect at all on the distribution of , but instead just lowered the threshold for Group 2, so that even people with a lower “propensity to vote” were able to drag themselves to the polls. This is illustrated in Scenario C of Figure 2.
So you can probably see how this supports the three allegations cited earlier. When we observe a non-zero estimate for a logistic regression coefficient, we can’t be sure this actually reflects a shift in the mean of the underlying continuous latent variable (e.g., increased propensity to vote), because it also reflects latent heteroskedasticity, and we can’t tell these two explanations apart. And because the degree of heteroskedasticity could easily differ between models, between samples, or over time, even comparing logistic regression coefficients to one another is problematic…if shifts in the underlying mean are what we care about.
Are shifts in an underlying mean what we care about?
There’s the crux of the matter. This entire first line of argument presupposes that all of the following are true for the case at hand:
- It makes any conceptual sense to think of the observed binary
as arising from a latent continuous
and a deterministic threshold.
- We actually care how much of the observed effect on
is due to mean shifts in
vs. changes in the variance of
.
- We’ve observed only a single binary indicator of
, so that Scenarios A and B from Figure 2 are empirically indistinguishable.
In my experience, usually at least one of these is false. For example, if the observed binary indicates survival of patients in a medical trial, what exactly would an underlying
represent? It could make sense for the patients who survived—maybe it represents their general health or something—but surely all patients with
are equally dead! Returning to the voting example, we can probably grant that #1 is true: it probably does make conceptual sense to think about an underlying, continuous “propensity to vote.” But #2 is probably false: I couldn’t care less if the social intervention increased voting by increasing propensity to vote, spreading out the distribution of voting propensities, or just altering the threshold that turns the propensity into voting behavior… I just want people to vote!
Finally, when #1 and #2 are true, so that the investigator is primarily interested not in the observed but rather in some underlying latent
, in my experience the investigator will usually have taken care to collect data on multiple binary indicators of
—in other words, #3 will be false. For example, if I were interested in studying an abstract
like “political engagement,” I would certainly view voting as a binary indicator of that, but I would also try to use data on things like whether that person donated money to political campaigns, whether they attended any political conventions, and so on. And when there are multiple binary indicators of
, it then becomes possible to empirically distinguish Scenario A from Scenario B in Figure 2, using, for example, statistical methods from item response theory.
These counterarguments are not to say that this first line of argument is invalid or irrelevant. The premises do lead to the conclusion, and there are certainly situations where those premises are true. If you find yourself in one of those situations, where #1-#3 are all true, then you do need to heed the warnings of Allison (1999) and Mood (2010). The point of these counterarguments is to say that, far more often than not, at least one of the premises listed above will be false. And in those cases, logistic regression is not fucked.
Okay, great. But we’re not out of the woods yet. As I mentioned earlier, there’s a second line of argument that leads us to essentially the same conclusions, but that makes no reference whatsoever to a continuous latent .
Second argument: Omitted non-confounders in logistic regression
To frame the second argument, first think back to the classical regression model with two predictors, a focal predictor and some covariate
:
.
Now suppose we haven’t observed the covariate , so that the regression model we actually estimate is
.
In the special case where is uncorrelated with
, we know that
, so that our estimate of the slope for
will, on average, be the same either way. The technical name for this property is collapsibility: classical regression coefficients are said to be collapsible over uncorrelated covariates.
It turns out that logistic regression coefficients do not have this collapsibility property. If a covariate that’s correlated with the binary outcome is omitted from the logistic regression equation, then the slopes for the remaining observed predictors will be affected, even if the omitted covariate is uncorrelated with the observed predictors. Specifically, in the case of omitting an uncorrelated covariate, the observed slopes will be driven toward 0 to some extent.
This is all illustrated below in Figure 3, where the covariate is shown as a binary variable (color = red vs. blue) for the sake of simplicity.
So now we can lay out the second line of argument against logistic regression. Actually, the most impactful way to communicate the argument is not to list out the premises, but instead to use a sort of statistical intuition pump. Consider the data in the right-hand panel of Figure 3. The slope (logistic regression coefficient) of on
is, let’s say,
for both the red group and the blue group. But suppose the color grouping factor is not observed, so that we can only fit the simple/unconditional logistic regression that ignores the color groups. Because of the non-collapsibility of logistic regression coefficients, the slope from this regression (shown in black in Figure 2) is shallower, say,
. But if the slope is
among both the red and the blue points, and if every point is either red or blue, then who exactly does this
slope apply to? What is the substantive interpretation of this slope?
For virtually every logistic regression model that we estimate in the real world, there will be some uncorrelated covariates that are statistically associated with the binary outcome, but that we couldn’t observe to include in the model. In other words, there’s always unobserved heterogeneity in our data on covariates we couldn’t measure. But then—the argument goes—how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?
These are rhetorical questions. The implication is that no meaningful interpretation is possible—or, as Mood (2010, p. 67) puts it, “it is problematic to interpret [logistic regression coefficients] as substantive effects.” I beg to differ. As I argue next, we can interpret logistic regression coefficients perfectly well even in the face of non-collapsibility.
Logistic regression coefficients are about conditional probabilities
Specifically, we can write logistic regression models directly analogous to models A and B from above as:
,
,
where is the logit link function. As the left-hand-sides of these regression equations make clear,
tells us about differences in the probability of
as
increases conditional on the covariate
being fixed at some value
, while
tells us about differences in the probability of
as
increases marginal over
. There is no reason to expect these two things to coincide in general unless
, which we know from probability theory is only true when
and
are conditionally independent given
—in terms of our model, when
.
So now let’s return to the red vs. blue example of Figure 3. We supposed, for illustration’s sake, a slope of overall, ignoring the red vs. blue grouping. Then the first rhetorical question from before asked, “who exactly does this
slope apply to?” The answer is that it applies to a population in which we know the
values but we don’t know the
values, that is, we don’t know the color of any of the data points. There’s an intuition that if
among both the red and blue points, then for any new point whose color we don’t know, we ought to guess that the slope that applies to them is also
. But that presupposes that we were able to estimate slopes among both the red and blue groups, which would imply that we did observe the colors of at least some of the points. On the contrary, let me repeat: the
slope applies to a population in which we know the
values but we don’t know any of the
values. Put more formally, the
slope refers to changes in
; there is an intuition that these probabilities ought to equal
, but these are not the same because the latter still require conditioning on
.
The second rhetorical question from above asked, “how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?” The answer is that we interpret them conditional on all and only the covariates that were included in the model. Again, conceptually speaking, the coefficients refer to a population in which we know the values of the covariates represented in the model and nothing more. There’s no problem with comparing these coefficients between samples or over time as long as these coefficients refer to the same population, that is, populations where the same sets of covariates are observed.
As for comparing coefficients between models with different covariates? Here we must agree with Mood and Allison that, in most cases, these comparisons are probably not informative. But this is not because of “unobserved heterogeneity.” It’s because these coefficients refer to different populations of units. In terms of models A and B from above, and
represent completely different conceptual quantities and it’s a mistake to view estimates of
as somehow being deficient estimates of
. As a more general rule, parameters from different models usually mean different things—compare them at your peril. In the logistic regression case, there may be situations where it makes sense to compare estimates of
with estimates of
, but not because one thinks they ought to be estimating the same quantity.
Footnotes and References
1 Or which, at least, are not fucked for the given reasons, although they could still be fucked for unrelated reasons.
2 This stuff is also true for some survival analysis models, notably Cox regression.
3 At least, I think they are… a definition of “substantive effects” is never given (are they like causal effects?), but presumably they’re something we want in an interpretation.
Allison, P. D. (1999). Comparing logit and probit coefficients across groups. Sociological methods & research, 28(2), 186-208.
Kuha, J., & Mills, C. (2017). On group comparisons with logistic regression models. Sociological Methods & Research, 0049124117747306.
Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European sociological review, 26(1), 67-82.
Pang, M., Kaufman, J. S., & Platt, R. W. (2013). Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Statistical methods in medical research, 25(5), 1925-1937.
Rohwer, G. (2012). Estimating effects with logit models. NEPS Working Paper 10, German National Educational Panel Study, University of Bamberg.
Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Hi Jake – Happy to see that you followed up our prior correspondence with this post – this is really great. Your writing on the topic is the most cogent I’ve seen, and I love the general back-to-basics approach of re-examining whether some of our “simplest” tools are maybe not so simple after all.
However, and I don’t mean to be flippant, I’d like to give a “glass half empty” recap of the way I understand the post, and then you can correct me (/ cheer me up):
You basically agree that all of the criticisms are logically correct and suggest only that the premises are limited in applicability. So, we’re still fucked, just not completely! Specifically, the people who *aren’t* fucked are those that (1) don’t care about why the outcome variable is changing OR (2) pursue a strategy that could be understood to compound the problem – i.e., using many measures and many logistic regressions (IRT) – to somehow escape the collapsibility & unobserved heterogeneity problems.
I feel like (1) is highly subjective (I think many people do run multiple logistic regression to understand what elements are responsible for change in an outcome) and (2) is at least not obviously true (are you aware of any quantitative demonstration that IRT solves this problem, and under what conditions?). So, I still feel a little bit fucked.
Two perhaps more positive/constructive comments: (A) I am a bit surprised to not see a suggestion to use a linear probability model as a sensitivity analysis for binomial regression, given these issues; is there a reason you do not mention that possibility? (B) would you suggest recasting binomial outcomes as theoretically meaningful to allow the use of other models (e.g., time to event rather than event occurrence, enabling use of Cox etc) without these problems, or are these models also possibly fucked?
Again, fantastic stuff here.
Hey Chris, thanks for the kind words about the post. It was fun to learn more about this stuff while writing it.
So there are two separate arguments I’m responding to in this post, and correct me if I’m wrong, but it looks like in your comment you’re only referring to the first argument (about latent heteroskedasticity as I’ve called it). So I won’t mention the collapsibility argument either here.
Yes, I find argument #1 to be valid but to have limited applicability. I mean “limited” in two separate senses here. First, it’s limited compared to the apparently universal applicability suggested by Mood & Allison. I hope you at least agree with this. Second, it’s limited in that it likely only applies in a minority of cases. I agree that this is a much more subjective assessment, and one that almost certainly depends on the research in question and so forth. But, well, it’s my assessment, and I’ve offered counter-arguments to try to back this assessment up (by questioning the premises that I labeled 1-3).
Now let me respond a little more specifically. You wrote:
“the people who *aren’t* fucked are those that (1) don’t care about why the outcome variable is changing…”
Well, no… two responses here.
First, I don’t think it’s fair to characterize people who don’t care about distinguishing Scenarios A and B in my Figure 2 as “not caring WHY the outcome variable is changing.” One can be indifferent about distinguishing Scenarios A and B and yet still care about, e.g., explaining the causal path that links X to Y, for example by studying mediators and moderators of the relationship. In social sciences, explanations at this causal level are usually what people mean when they talk about “why” an outcome variable is changing.
Second, you left out another important condition, which is that argument #1 only goes through if the latent variable interpretation is correct! That is, if the thresholding process that it’s based on is the true data generating process. If in fact there is no thresholding mechanism working behind the scenes for the case at hand, then argument #1 is irrelevant. Of course, we can’t directly verify whether the thresholding process is true; we can only judge whether that would make conceptual sense and is a priori likely. And in a lot of cases, the idea of an underlying latent variable and deterministic threshold is a bit of a stretch; in such cases, basically the outcome either happened or it didn’t, and it’s not clear what the interpretation of a continuous outcome would even be (as in the survival example I gave). That said, I do accept that there are plenty of cases where such an interpretation would make good sense.
So, again, for us to accept the conclusion of argument #1, both of the following (as well as the multiple indicators thing, addressed below) have to be true: the latent variable + threshold interpretation is quite plausible, and if it were true then we’d only be interested in one of the scenarios from Figure 2 — usually Scenario A.
About the multiple indicators issue, you wrote:
“people who *aren’t* fucked are those that […] (2) pursue a strategy that could be understood to compound the problem – i.e., using many measures and many logistic regressions (IRT) – to somehow escape the collapsibility & unobserved heterogeneity problems. […] (2) is at least not obviously true.”
I think I can convince you pretty easily that it’s true. Basically you need to have multiple thresholds, which you can get either by having multiple binary outcomes (as in IRT) or a single ordinal response with >2 categories (as in the unequal-variance signal detection model). Then the latent variance parameter (or, more precisely, a variance ratio) is identifiable.
To see this more concretely, consider Figure 2 again. Now imagine that, in addition to the threshold already drawn, there were a second threshold that sat right exactly in the middle of the Group 1 distribution, so that half the mass lie on either side of the threshold. How would this new threshold carve up the Group 2 distribution? The key is that it would do so differently in Scenario A vs. Scenario B. In Scenario B, clearly this new threshold would carve up Group 2 in the same way it did Group 1, that is, with half the probability mass on either side of the threshold. But in Scenario A, this threshold would carve up Group 2 such that about 25% of the mass would be below the new threshold and 75% of the mass above the new threshold. So if there are multiple thresholds, then Scenarios A and B entail different observed probabilities across the 4 response combinations for Group 2, which makes it possible to statistically distinguish the two scenarios.
As for potentially “compounding the problem,” I guess I’m not sure what you mean. And to answer your question, no, I haven’t read a paper that used IRT methodology to distinguish Scenarios A and B — which probably makes sense because I haven’t read that many papers that use IRT methodology period. But I think it’s pretty easy to see how this would work, as I hope I’ve convinced you above.
“(A) I am a bit surprised to not see a suggestion to use a linear probability model as a sensitivity analysis for binomial regression, given these issues; is there a reason you do not mention that possibility?”
No particular reason. The idea of doing both analyses and checking to see whether they give similar answers seems fine to me. In cases where they do, arguably you don’t need to worry too much about the issues in this blog post. In cases where they don’t agree, there’s a question of which model to trust more. In those cases I’d argue that logistic regression generally has a more statistically sound basis, considering (a) the problems the linear probability model (LPM) has with predicting probabilities outside of the [0, 1] range (and the resulting inconsistency of its parameter estimates) and (b) the implausibility of a truly linear response function near the 0/1 boundaries. Both of these issues would especially be a problem when predicting a substantially unbalanced binary response.
Certainly if you’re not in one of the situations implicated by argument #1, then I can’t see a good reason to trust the LPM results more than the logistic regression results. But even if you are in one of the situations implicated in argument #1, it’s not at all clear how the LPM can save you. In those situations, you believe in and care about a latent continuous variable, but the LPM contains no statistical mechanism for telling you anything about that latent variable. So I can’t say I really see how the LPM would help un-fuck you there.
“(B) would you suggest recasting binomial outcomes as theoretically meaningful to allow the use of other models (e.g., time to event rather than event occurrence, enabling use of Cox etc) without these problems, or are these models also possibly fucked?”
Hopefully I’m interpreting you correctly here — but yes, if you have other relevant data about the response, like time-to-event data, then by all means you should probably use it! With that said, certain time-to-event models (but not all) also exhibit the non-collapsibility of argument #2, notably Cox regression as mentioned in Footnote 2.