## Linear Regression

The Regression Equation

When analyzing data, it is essential to first construct a graph of the data. A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called bivariate quantitative data.

To measure quantitatively how well a line fits teh data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the least-squares criterion is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used Syy in Formula 14.1, we will use it later.

For a linear regression y = b0 + b1x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the response variable and x the predictor variable or explanatory variable (because it is used to predict or explain the values of the response variable).

Extrapolation

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

Outliers and Influential Observations

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

A Warning on the Use of Linear Regression

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of b0 and b1 to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

The Coefficient of Determination

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, SST. Thus, SST = π΄(yiy[bar])2. If we divide SST by n – 1, we get the sample variance of the observed values of the response variable. So, SST really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (xi, yi). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, yiy[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, y^y[bar], and the remaining unexplained deviation, yiy^. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is π΄(yi^y[bar])2. This measure of variation is called the regression sum of squares, SSR. Thus, SSR = π΄(yi^y[bar])2.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, SSR / SST. This quantity is called the coefficient of determination and is denoted r2. Thus, r2 = SSR/SST. In a same defintion, the deviation not explained by the regression, yiyi^. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is π΄(yi – yi^)2. This measure of variation is called the error sum of squares, SSE. Thus, SSE = π΄(yi – yi^)2.

In summary, check Definition 14.6

And the coefficient of detrmination, r2, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of r2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r2 near 1 suggests that the regression equation is quite useful for making predictions.

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares: r2 = SSR / SST = (SSTSSE) / SST = 1 – SSE / SST. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean, y(bar), to predict the observed values of the response variable.

Correlation and Causation

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called lurking variables, that cause the changes in the two variables under consideration.

The Regression Model; Analysis of Residuals

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line y = π½0 + π½1x – on which the conditional means of the response variable lie – as the population regression line and to its equation as the population regression equation. Observed that π½0 is the y-intercept of the population regression line and π½1 is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

Estimating the Regression Parameters

Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants π½0, π½1, and π so that, for each value x of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean π½0 + π½1x and standard deviation π.

Because the parameters π½0, π½1, and π are usually unknown, we must estimate them from sample data. We use the y-intercept and slop of a sample regression line as point estimates of the y-intercept and slop, respectively, of the population regression line; that is, we use b0 to estimate π½0 and we use b1 to estimate π½1. We note that b0 is an unbiased estimator of π½0 and that b1 is an unbiased estimator of π½1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation π is called the standard error of the estimate. The standard error of the estimate could be compute by

Analysis of Residuals

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual, generically denoted e. Thus,

Residual = ei = yiyi^

We can show that the sum of the residuals is always 0, which, in turn, implies that e(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by n – 1 instead of n – 2). Thus, the standard error of the estimate is sometimes called the residual standard deviation.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the x-aixs. In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the x-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation π. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a residual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the x-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as x increases, indicating that Assumption 2 may be violated.

Inferences for the Slope of the Population Regression Line

Suppose that the variables x and y satisfy the assumptions for regression inferences. Then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean π½0 + π½1x and standard deviation π. Of particular interest is whether the slope, π½1, of the population regression line equals 0. If π½1 = 0, then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean π½0 and standard deviation π. Because x does not appear in either of those two parameters, it is useless as a predictor of y.

Of note, although x alone may not be useful for predicting y, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is not useful for predicting y, we really mean that the regression equation with x as the only predictor variable is not useful for predicting y. Conversely, although x alone may be useful for predicting y, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is useful for predicting y, we really mean that the regression equation with x as the only predictor variable is useful for predicting y.

We can decide whether x is useful as a (linear) predictor of y – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for π½1 on the statistic b1. From the assumptions for regression inferences, we can show that the sampling distribution of the slop of the regression line is a normal distribution whose mean is the slope, π½1, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, π, is unknown. We therefore replace π with its sample estimate Se, the standard error of the estimate. As you might be suspect, the resulting variable has a t-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis H0: π½1 = 0, we can use the variable t as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the regression t-test.

Confidence Intervals for the Slop of the Population Regression Line

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for π½1 is provided by b1. To determine a confidence-interval estimate for π½1, we apply Key Fact 15.4 to obtain Procedure 15.2, called the regression t-interval procedure.

Estimating and Prediction

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter π, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace π by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Recalling that π½0 + π½1x is the conditional mean of the response variable corresponding to the value xp of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the conditional mean t-interval procedure.

Prediction Intervals

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable yp – yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter π, it cannot be used as a basis for prediction-interval formula. So we replace π by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the predicted value t-interval procedure.

Inferences in Correlation

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, π. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, π actually describes the strength of the linear relationship between two variables; r is only an estimate of π obtained from sample data.

The population linear correlation coefficient of two variables x and y always lies between -1 and 1. Values of π near -1 or 1 indicate a strong linear relationship between the variables, whereas values of π near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, r, is an estimate of the population linear correlation coefficient, π. Consequently, we can use r as a basis for performing a hypothesis test for π.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis H0: π = 0, we use the t-score as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the correlation t-test.

## Inferences for Population Standard Deviations

Inferences for One Population Standard Deviation

Suppose that we want to obtain information about a population standard deviation. If the population is small, we can often determine π exactly by first taking a census and then computing π from the population data. However, if the population is large, which is usually the case, a census is generally not feasible, and we must use inferential methods to obtain the required information about π.

Logic Behind

Recall that to perform a hypothesis test with null hypothesis H0: π = π0 for the mean, π, of a normally distributed variable, we do not use the variable x(bar) as the test statistic; rather, we use the variable t score. Similarly, when performing a hypothesis test with null hypothesis H0: π = π0 for the standard deviatio, π, of a normally distributed variable, we do not use the variable s as the test statistic; rather, we use a modified version of that variable:

This variable has a chi-square distribution.

In light of Key Fact 11.2, for a hypothesis test with null hypothesis H0: π = π0, we can use the variable π2 as the test statistic and obtain the critical value(s) form the π2-table. We call this hypothesis-testing procedure the one-standard-deviation π2-test.

Procedure 11.1 gives a step-by-step method for performing a one-standard-deviation π2-test by using either the critical-value approach or the P-value, but do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-test for one and two population means, the one-standard-deviation π2-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed or very nearly so.

Consequently, before applying Procedure 11.1, construct a normal probability plot. If the plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.1. We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for a population standard deviation. If you have doubts about the normality of the variable under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for a population standard deviation.

In addition, using Key Fact 11.2, we can also obtain a confidence-interval procedure for a population standard deviation. We call this procedure the one-standard-deviation π2-interval procedure and present it as Procedure 11.2. This procedure is also known as the π2-interval procedure for one population standard deviation. This confidence-interval procedure is often formulated in terms of variance instead of standard deviation. Like the one-standard-deviation π2-test, this procedure is not at all robust to violations of the normality assumption.

Inferences for Two Population Standard Deviation, Using Independent Samples

We now introduce hypothesis tests and confidence intervals for two population standard deviations. More precisely, we examine inferences to compare the standard deviations of one variable of two different populations. Such inferences are based on a distribution called the Fdistribution. In many statistical analyses that involve the F-distribution, we also need to determine F-values having areas 0.005, 0.01, 0.025, and 0.10 to their left. Although such F-values aren't available directly from Table VIII, we can obtain them indirectly from the table by using Key Fact 11.4.

Logic Behind

To perform hypothesis tests and obtain confidence intervals for two population standard deviations, we need Key Fact 11.5, that is, the distribution of the F-statistic for comparing two population standard deviations. By definition, the F-statistic.

In light of Key Fact 11.5, for a hypothesis test with null hypothesis H0: π1 = π2 (population standard deviations are equal), we can use the variable F = S12 / S22 as the test statistic and obtain the critical value(s) from the F-table. We call this hypothesis-testing procedure the two-standard-deviations F-test. Procedure 11.3 gives a step-by-step method for performing a two-standard-deviations F-test by using either critical-value approach or the P-value approach.

For the P-value approach, we could use F-table to estimate the P-value, but to do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-tests for one and two population means, the two-standard-deviation F-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed, or very nearly so, on each population.

Consequently, before applying Procedure 11.3, construct a normal probability plot of each sample. If either plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.3.

We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for comparing two population standard deviations. If you have doubts about the normality of the variable on the two populations under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for two population standard deviations.

Using Key Fact 11.5, we can also obtain a confidence-interval procedure, Procedure 11.4, for the ratio of two population standard deviations. We call it the two-standard-deviations F-interval procedure. Also it is known as the F-interval procedure for two population standard deviations and the two-sample F-interval procedure. This confidence-interval procedure is often formulated in terms of variances instead of standard deviations.

To interpret confidence intervals for the ratio π1 / π2, of two population standard deviations, considering three cases is helpful.

Case 1: The endpoints of the confidence interval are both greater than 1.

To illustrate, suppose that a 95% confidence interval for π1 / π2 is from 5 to 8. Then we can be 95% confident that π1 / π2 lies somewhere between 5 and 8 or, equivalently, 5π2 < π1 < 8π2. Thus, we can be 95% confident that π1 is somewhere between 5 and 8 times greater than π2.

Case 2: The endpoints of the confidence interval are both less than 1.

To illustrate, suppose that a 95% confidence interval for π1 / π2 is from 0.5 to 0.8. Then we can be 95% confident that π1 / π2 lies somewhere between 0.5 and 0.8 or, equivalently, 0.5π2 < π1 < 0.8π2. Thus, noting that 1/0.5 = 2 and 1/0.8 = 1.25, we can be 95% confident that π1 < is somewhere between 1.25 and 2 times less than π2.

Case 3: One endpoint of the confidence interval is less than 1 and the other is greater than 1.

To illustrate, suppose that a 95% confience interval for 5π2 < π1 < 8π2 is from 0.5 to 8. Then we can be 95% confident that 5π2 < π1 < 8π2 lies somewhere between 0.5 and 8 or, equivalentluy, 0.5π2 < π1 < 8π2. Thus, we can be 95% confident that π1 is somewhere between 2 time less than and 8 times greater than π2.

## Stage, Expression, and Causal Model of Diseases

Natural History of Disease

Stage of Disease

The natural history of disease refers to the progression of a disease in an individual over time. This includes all relevant phenomena from before initiation of the disease (the stage of susceptibility) until its resolution. In the period following exposure to the causal factor, the individual enters a stage of subclinical disease (also called the preclinical phase). For infectious agents, this corresponds to the incubation period during which the agent multiplies within the body but has not yet produced discernible signs or symptoms. For noninfectious diseases, this corresponds to the induction period between a causal action and disease initiation.

The stage of clinical disease begins with a patient's first symptoms and ends with resolution of the disease. Be aware that the onset of symptoms marks the beginning of this stage, not the time of diagnosis. The time-lag between the onset of symptoms and diagnosis of disease can be considerable. Resolution of the disease may come by means of recovery or death. When recovery is incomplete the individual may be left with disability.

Incubation periods of infectious diseases vary considerably. Some infectious diseases are characterized by short incubation periods. Others are characterized by intermediate incubation periods. Still others are characterized by extended incubation periods. Note that even for a given infectious disease, the incubation period may vary considerably. For example, the incubation period for human immunodeficiency virus (HIV) and AIDS ranges from 3 to more than 20 years.

Induction periods for noninfectious diseases also exhibit a range. For example, the induction period for leukemia following exposure to fallout from the atomic bomb blast in Hiroshima ranged from 2 to more than 12 years. Variability in incubation is due to differences in host resistance, pathogenicity of the agent, the exposure dose, and the prevalence and availability of cofactors responsible for disease.

Understanding the natural history of a disease is essential when studying its epidemiology. For example, the epidemiology of HIV/AIDS can only be understood after identify its multifarious stages. Exposure to HIV is followed by an acute response that may be accompanied by unrecognized flu-like symptoms. During this acute viremic phase, prospective cases do not exhibit detectable antibodies in their serum, yet may still transmit the agent. During a lengthy induction, CD4+ lymphocyte counts decline while the patient is still free from symptoms. The risk of developing AIDS is low during these initial years, but increase over time as the immune response is progressively destroyed, after which AIDS then may express itself in different forms (e.g., opportunistic infections, encephalitis, Kaposi's sarcoma, dementia, wasting syndrome).

A slightly more sophisticated view of the natural history of disease divides the subclinical stage of disease into an induction period and a latent period. Induction occurs in the interval between a causal action and the point at which the occurrence of the disease becomes inevitable. A latent period follows after the disease becomes inevitable but before clinical signs arise. During this latent phase, various causal factors may promote or retard the progression of the disease. The induction and promotion stages combined are referred to as the empirical induction period. This more sophisticated view better suits the consideration of multi-facored disease, where multiple factors must act together to result in a cause.

Stage of Prevention

Disease prevention efforts are classifed according to the stage of disease at which they occur. Primary prevention is directed toward the stage of susceptibility. The goal of primary prevention is to prevent the disease from occuring in the first place. Examples of primary preventiion include needle-exchange programs to prevent the spread of HIV, vaccination programs, and smoking prevention programs.

Secondary prevention is directed toward the subclinical stage of disease, after which the individual is exposed to the causal factor. The goal of secondary prevention is to prevent the disease from emerging or delay its emergence by extending the induction period. It also aims to reduce the severity of the disease once it emerges. Treating asymptomatic HIV-positive patients with antiretroviral agents to delay the onset of AIDS is a form of secondary prevention.

Tertiary prevention is directed toward the clinical stage of disease. The aim of tertiary prevention is to prevent or minimize the progression of the disease or its sequelae. For example, screening and treating diabetics for diabetic retinopathy to avert progression to blindness is a form of tertiary prevention.

Variability in The Expression of Disease

Spectrum of Disease

Diseases often display a broad range of manifestations and severities. This is referred to as the spectrum of disease. Both infectious and noninfectious diseases exhibit spectrums. When considering infectious diseases, there is a gradient of infection. As an example, HIV infection ranges from inapparent, to mild (e.g., AIDS-related complex), to severe (e.g., wasting syndrome). As an example of a noninfectious disease's spectrum, consider that coronary artery disease exists in as asymptomatic form (atherosclerosis), transient myocardial ischemia, and myocardial infarctions of various severities.

The epidemiologic iceberg

The bulk of a health problem in a population may be hidden from view. This phenomenon, referred to as the "epidemiologic iceberg", applies to infectious, noninfectious, acute, and chronic diseases alike.

Uncovering disease that might otherwise be "below sea level" by screening and better detection often allows for better control of health problems. Consider that for every successful suicide attempt there are dozens of unsuccessful attempts and a still larger number of people with depressive illness that might be severe enough to have them wish to end their lives. With appropriate treatment, individuals with suicidal tendencies would be less likely to have suicidal ideation and be less likely to attempt suicide. As another example: reported cases of AIDS represent only the tip of HIV infections. With proper antiretroviral therapy, clinical illness may be delayed and transmission averted.

Causal Models

Definition of Cause

A cause of a disease event is an event, condition or characteristic that preceded a disease without which the disease event either would not have occurred at all or would not have occurred until some later time. On a population basis, we expect that an increase in the level of a causal factor in inhabitants will be accompanied by an increase in the incidence of disease in that population, caeteris parabus (all other things being equal). We also expect that if the causal factor can be eliminated or diminished, the frequency of disease or its severity will decline.

Component cause model (causal pies)

Most diseases are caused by the cumulative effect of multiple causal components acting ("interacting") together. Thus, a causal interaction occurs when two or more causal factors act together to bring about an effect. Causal interactons apply to both infectious and noninfectious diseases and explains, for example, why two people exposed to the same cold virus will not necessarily experience the same outcome: one person may develop a cold while the other person may experience no ill effects.

Rothman's causal pies helps to clarify the contribution of causal components in disease etiology. Figure 2.6 displays two causal mechanisms for a disease. Wedges of each pie represent components of each causal mechanism, corresponding to risk factors we hope to identify. Each pie represents a sufficient causal mechanism, defined as a set of factors that in combination makes disease occurrence inevitable. Each casual componet (wedge) plays an essential role in a given causal mechanism (pie); a specific disease may result from a number of different causal combination mechanisms.

A causal factor is said to be necessary when it is a component cause member of every sufficient mechanism. In other words, the component cause is necessary if the disease cannot occur in its absence. In Figure 2.6, Component A is a necessary cause, since it is evident in all possible mechanisms – the disease cannot occur in its absence. Causal components that do not occur in every sufficient mechanism yet are still essential in some cases are said to be contributing component causes. In Figure 2.6, B, C, and D are nonnecessary contributing causal components. Component causes that complete a given causal mechanism (pie) are said to be causal complements. In Figure 2.6, for example, the causal complements of factor A in Mechanism 1 is (B + C). In mechanism 2, the causal complement of factor A is D. Factors that work together to form sufficient causal mechanism are said to interact causally.

Causal interactions have direct health relevance. For example, when a person develops an infectious disease, the causal agent must interact with the causal complement known as "susceptibility" to cause the disease. When considering hip fractures in elderly patients, the necessary element of trauma interacts with the causal complement of osteoporosis to cause the hip fracture. In similar veins, smoking interacts with genetic susceptibility and other environmental factors in causing lung cancer, and dietary excess interact with lack of exercise, genetic susceptibility, atherosclerosis and various clotting factors to cause heart attacks. Causal factors rarely act alone.

Causal pies demonstrate that individual risk is an all-or-none phenomenon. In a given individual, either a causal mechanism is or is not completed. This makes it impossible to directly estimate individual risk. In contrast, the notion of average risk is a different matter. Average risk can be estimated directly as the proportion of individuals regarded as a member of a recognizable group that develops a particular condition. For example, if one in ten smokers develop lung cancer over their lifetime, we can say that this population has a lifetime risk for this outcome of one in ten (10%). The effects of a given cause in a population depend on the prevalence of causal complements in that population. The effect of phenylketanines, for instance, depends not only on the prevalence of an inborn error of metabolism marked by the absence of phenylalanine hydroxylase, but depends also on the environmental prevalence of foods high in phenylalanine. Simiarly, the effects of falls in the elderly depend not only on the opportunity for falling, but also on the prevalence of osteoporosis. The population-wide effects of a pathological factor cannot be predicted without knowledge of the prevalence of its causal complements in the population.

Hogben's example of yellow shank disease in chickens provides a memorable example of how population effects of a given causal agent cannot be separated from the prevalence of its causal complements. The trait of yellow shank in poultry is a condition expressed only in certain genetic strains of fowl when fed yellow corn. A farmer with a susceptible flock who switches from white corn to yellow corn will perceive the disease to be caused by yellow corn. A farmer who feeds only yellow corn to a flock with mulltiple strains of chickens, some of which are susceptible to the yellow shank condition, will perceive the condition to be caused by genetics. In fact, the effects of yellow corn cannot be separated from the genetic makeup of the flock, and the effect of the genetic makeup of the flock cannot be separated from the presence of yellow corn in the environment. To ask whether yellow shank disease is environmental or genetic is like asking whether the sound of a faraway drum is caused by the drum or the drummer – one does not act without the other. This is what we mean by causal interaction.

Agent, Host, and Environment

Causal components can be classified as agent, host, or environmental factors. Agent are biological, physical, and chemical factors whose presence, absence, or relative amount (too much or too little) are necessary for disease to occur. Host factors include personal characteristics and behaviros, genetic predispositions, and immunologic and other susceptibility-related factors that influence the likelihoood or severity of disease. Host factors can be physiological, anatomical, genetic, behavioral, occupational, or constitutional. Environmental factors are external conditions other than the agent that contribute to the disease process. Environmental factors can be physical, biological, social, economic, or political in nature.

## Basic Concepts in Epidemiology

Risk

Risk, sometimes also referred to as cumulative incidence, is an indicator of the proportion of persons within a specified population who develop the outcome of interest (all persons under consideration must be free of the outcome of interest at the beginning), within a defined time period.

R = New cases / Persons at risk = A/N

where R is the estimated risk; A is the number of new instances of the outcome of interest, often described as new cases; and N is the number of unaffected persons at the beginning of the observation period. It is important to emphasize that at the outset, all persons under consideration must be free of the outcome of interest. The risk of developing the outcome then can range anywhere between 0 and 1. For simplicity, risk often is presented as a percentage by multiplying the proportion by 100.

Example: Vekeman and colleagues were interested in the risk of VTE after total hip or knee arthroplasty and whether the use of anticoagulants to prevent VTEs might induce an unacceptable number of episodes of serious bleeding. Through a large national database, the investigators were able to identify more than 820,000 inpatient hospital stays for adults age 18 years or older who underwent one of these procedures between 2000 and 2008. A total of 8042 VTEs were observed during these hospital stays. The risk of a VTE among total hip or knee replacement admissions, therefore, is:

R = 8042/820,197 = 0.0098 = 0.98%

Prevalence

The proportion of persons within a population who have the condition of interest is referred to as prevalence. Sometimes we designate this proportion further as relating to a specific point in time (point prevalence) or alternately, to a particular time period (period prevalence). The prevalence is calculated by dividing the number of affected persons (cases) by the number of persons in the source population.

P = C/N

where P is the prevalence, C is the number of cases, and N is the size of the source population. As with risk, prevalence can range from 0 to 1. We can also express prevalence as a percentage, by multiplying by 100.

Example: Deitelzweig and colleagues were interested in estimating the prevalence of VTE in the United States. For that purpose, they accessed a database that combined commercial insurance claims with those of medicare beneficiaries for the 5-year period 2002 to 2006. The source population of these databases included 12.7 million persons. Of these persons, 200,007 had VTE, so the 5-year period prevalence was:

P = 200,007/12.7 million = 0.016 = 1.6%

The investigators calculated the 5-year period prevalence separately for DVT, PE, and both DVT and PE. The annual prevalence of VTE was observed to increase progressively over the 5-year study period, with a low of 0.32% in 2002, rising to a high of 0.42% in 2006.

Incidence Rate

The incidence rate measures the rapidity with which newly diagnosed cases of a disease develop. To estimate the incidence rate, one follows a source population of unaffected person over time, counting the number of individuals who become newly affected (cases), and expresses it relative to person-time, which is a combination of the size of the source population and the time period of observation.

The quantification of person-time may seem a little confusing at first, so let us explore how it is calculated. The goal is to estimate the total amount of disease-free time that subjects in the source population are observed. For example, an individual who is followed for Β 1 year without developing the condition of interest contributes 1 year of observation. Another person may develop the condition of interest 6 months into the study. Although this individual may be followed for a full year, he or she only contributes a half year of disease-free observation, which then can be summed over all persons in the source population, yielding a total person-time of observation. Then, we can calculate the incidence rate as:

IR = A/PT

where IR is incidence rate, A is the number of newly diagnosed occurrences of the condition of interest, and PT is the total amount of disease-free observation within the source population.

Example: To estimate the incidence rate of VTE in the Canadian province of Quebec, Tagalakis and colleagues accessed health care administrative databases to identify all new cases of DVT or PE between 2000 and 2009. The overall incidence of VTE was found to be:

IR = 91,761 cases/74,297,764 person-year = 0.00124 cases/person-year

To express the incidence rate with fewer decimal places, it is convenient to convert it to 1.24 cases/1000 person-years of observation. An equivalent expression would be 124 cases per 100,000 person-years of observation. In other words, among residents of the province of Quebec, during the decade of 2000 to 2009, the overall incidence rate of newly diagnosed VTEs was a little more than one 1000 persons followed for 1 year.

It is important to note the the incidence rate relates to the first occurrence of the disease or condition of interest. VTE is a disorder that can recur, so if all episodes of VTE in a population (both initial and recurrent) are counted, the estimate of the VTE incidence rate will be inflated. To avoid this problem, the investigator must be able to exclude prior diagnoses of VTE when identify incident cases.

Survival

For disease, such as VTE, that can have serious impacts on an affected person’s well-being, we may wish to characterize the likelihood of remaining alive, or survival, after a diagnosis. Mathematically, we would measure survival as:

S = (A – D)/A

where A is the number of newly diagnosed patients with the condition of interest and D is the number of deaths. Survival is, therefore, a proportion that can range from 0 to 1. We can convert survival to a percentage by multiplying by 100. It is important to recognize that survival is a time-dependent phenomenon, therefore, it is essential to specify a time period for the measurement of survival, such as the 30-day survival, or the 1-year survival.

Example: In the study by Tagalakis and colleagues of VTE in Quebec province, patients were followed for survival after their initial diagnosis. Among the 33,447 persons with a PE, there were 5654 deaths within the first 30 days after diagnosis. The 30-day survival, therefore, is calculated as:

S = (33,447 – 5654)/33,447 = 0.83 = 83%

Case-Fatality

Another measure Β of prognosisΒ after a diagnosisΒ is theΒ case-fatality. This metric refers to the proportion (or percentage) of persons with a particular condition who die within a specified period of time. Often, case-fatality is incorrectly referred to as a rate or ratio, but it is more accurately described as a risk or probability. It is calculated mathematically as:

CF = Number of deaths / Number of diagnosed persons = D/A

where CF is case-fatality, D is the number of deaths, and A is the number of persons with the condition of interestΒ at the beginning. The case-fatality can range from 0 when there are no deaths observed during the specified timeframe to 1 when all affected persons (with the diagnosis[es]) of interest die during the specified timeframe.