Criteria for A Confounding Factor

We can summarize thus far with the observation that for a variable to be a confounder, it must have three necessary (but not sufficient or defining) characteristics, which we will discuss in detail. We will then point out some limitations of these characteristics in defining and identifying confounding.

A confounding factor must be extraneous risk factor for the disease.

As mentioned earlier, a potential confounding factor need not be an actual cause of the disease, but if it is not, it must be a surrogate for an actual cause of the disease other than exposure. This condition implies that the association between the potential confounder and the disease must occur within levels of the study exposure. In particular, a potentially confounding factor must be a risk factor within the reference level of the exposure under study. The data may serve as a guide to the relation between the potential confounder and the disease, but it is the actual relation between the potentially confounding factor and disease, not the apparent relation observed in the data, that determines whether confounding can occur. In large studies, which are subject to less sampling error, we expect the data to reflect more closely the underlying relation, but in small studies the data are a less reliable guide, and one must consider other, external evidence (“prior knowledge”) regarding the relation of the factor to the disease.

The following example illustrates the role that prior knowledge can play in evaluating confounding. Suppose that in a cohort study of airborne glass fibers and lung cancer, the data show more smoking and more cancers among the heavily exposed but no relation between smoking and lung cancer within exposure levels. The latter absence of a relation does not mean that an effect of smoking was not confounded (mixed) with the estimated effect of glass fibers: It may be that some or all of the excess cancers in the heavily exposed were produced solely by smoking, and that the lack of a smoking-cancer association in the study cohort was produced by an unmeasured confounder of that association in this cohort, or by random error.

As a converse example, suppose that we conduct a cohort study of sunlight exposure and melanoma. Our best current information indicates that, after controlling for age and geographic area of residence, there is no relation between Social Security number and melanoma occurrence. Thus, we would not consider Social Security number a confounder, regardless of its association with melanoma in the reference exposure cohort, because we think it is not a risk factor for melanoma in this cohort, given age and geographic area (i.e., we think Social Security numbers do not affect melanoma rates and are not markers for some melanoma risk factor other than age and area). Even if control of Social Security number would change the effect estimate, the resulting estimate of effect would be less accurate than one that ignores Social Security number, given our prior information about the lack of real confounding by social security number.

Nevertheless, because external information is usually limited, investigators often rely on their data to infer the relation of potential confounders to the disease. This reliance can be rationalized if one has good reason to suspect that the external information is not very relevant to one’s own study. For example, a cause of disease in one population will be causally unrelated to disease in another population that lacks complementary component causes. A discordance between the data and external information about a suspected or known risk factor may therefore signal an inadequacy in the detail of information about interacting factors rather than an error in the data. Such an explanation may be less credible for variables such as age, sex, and smoking, whose joint relation to disease are often thought to be fairly stable across populations. In a parallel fashion, external information about the absence of an effect for a possible risk factor may be considered inadequate, if the external information is based on studies that had a considerable bias toward the null.

A confounding factor must be associated with the exposure under study in the source population (the population at risk from which the cases are derived).

To produce confounding, the association between a potential confounding factor and the exposure must be in the source population of the study cases. In a cohort study, the source population corresponds to the study cohort and so this proviso implies only that the association between a confounding factor and the exposure exists among subjects that compose the cohort. Thus, in cohort studies, the exposure-confounder association can be determined from the study data alone and does not even theoretically depend on prior knowledge if no measurement error is present.

When the exposure under study has been randomly assigned, it is sometimes mistakenly thought that confounding cannot occur because randomization guarantees exposure will be independent of (unassociated with) other factors. Unfortunately, this independence guarantee is only on average across repetitions of the randomization procedure. In almost any given single randomization (allocation), including those in actual studies, there will be random associations of the exposure with extraneous risk factors. As a consequence, confounding can and does occur in randomized trials. Although this random confounding tends to be small in large randomized trials, it will often be large within small trials and within small subgroups of large trials. Furthermore, heavy non adherence or noncompliance (failure to follow the assigned treatment protocol) or drop-out can result in considerable nonrandom confounding, even in large randomized trials.

In a case-control study, the association of exposure and the potential confounder must be present in the source population that gave rise to the cases. If the control  series is large and there is no selection bias or measurement error, the controls will provide a reasonable  estimate of the association between the potential confounding variable and the exposure in the source population and can be checked with the study data. In general, however, the controls may not adequately estimate the degree of association between the potential confounder and the exposure in the source population that produced the study cases. If information is available on this population association, it can be used to adjust findings from the control series. Unfortunately, reliable external information about the associations among risk factors in the source population is seldom available. Thus, in case-control studies, concerns about the control group will have to be considered in estimating the association between the exposure and the potentially confounding factor, for example, via bias analysis.

Consider a nested case-control study of occupational exposure to airborne glass fibers and the occurrence of lung cancer that randomly sampled cases and controls from cases and persons at risk in an occupational cohort. Suppose that we knew the association of exposure and smoking in the full cohort, as we might if this information were recorded for the entire cohort. We could then use the discrepancy between the true association and the exposure-smoking association observed in the controls as a measure of the extent to which random sampling had failed to produce representative controls. Regardless of the size of this discrepancy, if there were no association between smoking and exposure in the source cohort, smoking would not be a true confounder (even if it appeared to be one in the case-control data), and the the unadjusted estimate would be the best  available estimate. More, generally, we could use any information on the entire cohort to make adjustments to the case-control estimate, in a fashion analogous to two-stage studies.


Unfinished, keep updating …