SSE

Analysis of Variance

March 4, 2018 Medical Statistics No comments , , , , , ,

Analysis-of-variance procedures rely on a distribution called the F-distribution, named in honor of Sir Ronald Fisher. A variable is said to have an F-distribution if its distribution has the shape of a special type of right-skewed curve, called an F-curve. There are infinitely many F-distribution (and F-curve) by its number of degrees of freedom, just as we did for t-distributions and chi-square distributions.

Screen Shot 2018 03 03 at 10 07 34 PMAn F-distribution, however, has two numbers of degrees of freedom instead of one. Figure 16.1 depicts two different F-curves; one has df = (10, 2), and the other has df = (9, 50). The first number of degrees of freedom for an F-curve is called the degree of freedom for the numerator, and the second is called the degrees of freedom for the denominator.

Basic properties of F-curves:

  • The total area under an F-curve equals 1.
  • An F-curve starts at 0 on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis as it does so.
  • An F-curve is right skewed.

One-Way ANOVA: The Logic

In older threads, you learned how to compare two population means, that is, the means of a single variable for two different populations. You studies various methods for making such comparisons, one being the pooled t-procedure.

Analysis of variance (ANOVA) provides methods for comparing several population means, that is, the means of a single variable for several populations. In this section we present the simplest kind of ANOVA, one-way analysis of variance. This type of ANOVA is called one-way analysis of variance because it compares the means of a variable for populations that result from a classification by one other variable, called the factor. The possible values of the factor are referred to as the levels of the factor.

For example, suppose that you want to compare the mean energy consumption by households among the four regions of the United States. The variable under consideration is “energy consumption,” and there are four populations: households in the Northeast, Midwest, South, and West. The four populations result from classifying households in the United States by the factor “region,” whose levels are Northeast, Midwest, South, and West.

One-way analysis of variance is the generalization to more than two populations of the pooled t-procedure (i.e., both procedures give the same results when applied to two populations). As in the pooled t-procedure, we make the following assumptions.Screen Shot 2018 03 03 at 10 48 45 PMRegarding Assumptions 1 and 2, we note that one-way ANOVA can also be used as a method for comparing several means with a designed experiment. In addition, like the pooled t-procedure, one-way ANOVA is robust to moderate violations of Assumption 3 (normal populations) and is also robust to moderate violations of Assumption 4 (equal standard deviations) provided the sample sizes are roughly equal.

How can the conditions of normal populations and equal standard deviations be checked? Normal probability plots of the sample data are effective in detecting gross violations of normality. Checking equal population standard deviations, however, can be difficult, especially when the sample sizes are small; as a rule of thumb, you can consider that condition met if the ratio of the largest to the smallest sample standard deviation is less than 2. We call that rule of thumb the rule of 2.

Another way to assess the normality and equal-standard-deviations assumptions is to perform a residual analysis. In ANOVA, the residual of an observation is the difference between the observation and the mean of the sample containing it. If the normality and equal-standard-deviations assumptions are met, a normal probability plot of (all) the residuals should be roughly linear. Moreover, a plot of the residuals against the sample means should fall roughly in a horizontal band centered and symmetric about the horizontal axis.

The Logic Behind One-Way ANOVA

The reason for the word variance in analysis of variance is that the procedure for comparing the means analyzes the variation in the sample data. To examine how this procedure works, let’s suppose that independent random samples are taken from two populations – say, Populations 1 and 2 – with means 𝜇1 and 𝜇2. Further, let’s suppose that the means of the two samples are xbar1 = 20 and xbar2 = 25. Can we reasonably conclude from these statistics that 𝜇1 ≠ 𝜇2, that is, that the population means are (significantly) different? To answer this question, we must consider the variation within the samples.

The basic idea for performing a one-way analysis of variance to compare the means of several populations:

  • Take independent simple random samples from the populations.
  • Compute the sample means.
  • If the variation among the sample means is large relative to the variation within the samples, conclude that the means of the populations are not all equal (significantly different).

To make this process precise, we need quantitative measures of the variation among the sample means and the variation within the samples. We also need an objective method for deciding whether the variation among the sample means is large relative to the variation within the samples.

Mean Squares and F-Statistic in One-Way ANOVA

As before, when dealing with several population, we use subscripts on parameters and statistics. Thus, for Population j, we use 𝜇j ,xbarj, sj, and nj to denote the population mean, sample mean, sample standard deviation, and sample size, respectively.

We first consider the measure of variation among the sample means. In hypothesis tests for two population means, we measure the variation between the two sample means by calculating their different, xbar1 – xbar2. When more than two populations are involved, we cannot measure the variation among the sample means simply by taking a difference. However, we can measure that variation by computing the standard deviation or variance of the sample means or by computing any descriptive statistic that measures variation.

In one-way ANOVA, we measure the variation among the sample means by a weighted average of their squared deviations about he mean, bxar, of alll the sample data. That measure of variation is called the treatment mean square, MSTR, and is defined as

MSTR = SSTR / (k -1)

where k denotes the number of populations being sampled and

SSTR = n1(xbar1 -xbar)^2 + n2(xbar2 – xbar)^2 + … + nk(xbark – xbar)^2

The quantity SSTR is called the treatment sum of squares.

We note that MSTR is similar to the sample variance of the sample mans. In fact, if all the sample sizes are identical, then MSTR equals that common sample size times the sample variance of the sample means.

Next we consider the measure of variation within the samples. This measure is the pooled estimate of the common population variance, 𝜎^2. It is called the error mean square, MSE, and is defined as

MSE = SSE / (n – k)

where n denotes the total number of observations and 

SSE = (n1 -1)S1^2 + (n2 -1)S2^2 + … + (nk -1)Sk^2

The quantity SSE is called the error sum of squares. Finally, we consider how to compare the variation among the sample means, MSTR, to the variation within the samples, MSE. To do so, we use the statistic F = MSTR/MSE, which we refer to as the F-statistic. Large values of F indicate that the variation among the sample means is large relative to the variation within the samples and hence that the null hypothesis of equal population means should be rejected.

In summary,

Screen Shot 2018 03 04 at 5 08 51 PM

Screen Shot 2018 03 24 at 7 13 35 PM

Linear Regression

October 16, 2017 Clinical Trials, Epidemiology, Evidence-Based Medicine, Medical Statistics, Research No comments , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

The Regression Equation

When analyzing data, it is essential to first construct a graph of the data. A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called bivariate quantitative data.

To measure quantitatively how well a line fits teh data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the least-squares criterion is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used Syy in Formula 14.1, we will use it later.

For a linear regression y = b0 + b1x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the response variable and x the predictor variable or explanatory variable (because it is used to predict or explain the values of the response variable).

Extrapolation

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

Outliers and Influential Observations

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

A Warning on the Use of Linear Regression

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of b0 and b1 to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

The Coefficient of Determination

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, SST. Thus, SST = 𝛴(yiy[bar])2. If we divide SST by n – 1, we get the sample variance of the observed values of the response variable. So, SST really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (xi, yi). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, yiy[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, y^y[bar], and the remaining unexplained deviation, yiy^. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is 𝛴(yi^y[bar])2. This measure of variation is called the regression sum of squares, SSR. Thus, SSR = 𝛴(yi^y[bar])2.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, SSR / SST. This quantity is called the coefficient of determination and is denoted r2. Thus, r2 = SSR/SST. In a same defintion, the deviation not explained by the regression, yiyi^. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is 𝛴(yi – yi^)2. This measure of variation is called the error sum of squares, SSE. Thus, SSE = 𝛴(yi – yi^)2.

In summary, check Definition 14.6

And the coefficient of detrmination, r2, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of r2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r2 near 1 suggests that the regression equation is quite useful for making predictions.

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares: r2 = SSR / SST = (SSTSSE) / SST = 1 – SSE / SST. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean, y(bar), to predict the observed values of the response variable.

Correlation and Causation

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called lurking variables, that cause the changes in the two variables under consideration.


The Regression Model; Analysis of Residuals

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line y = 𝛽0 + 𝛽1x – on which the conditional means of the response variable lie – as the population regression line and to its equation as the population regression equation. Observed that 𝛽0 is the y-intercept of the population regression line and 𝛽1 is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

Estimating the Regression Parameters

Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants 𝛽0, 𝛽1, and 𝜎 so that, for each value x of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎.

Because the parameters 𝛽0, 𝛽1, and 𝜎 are usually unknown, we must estimate them from sample data. We use the y-intercept and slop of a sample regression line as point estimates of the y-intercept and slop, respectively, of the population regression line; that is, we use b0 to estimate 𝛽0 and we use b1 to estimate 𝛽1. We note that b0 is an unbiased estimator of 𝛽0 and that b1 is an unbiased estimator of 𝛽1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation 𝜎 is called the standard error of the estimate. The standard error of the estimate could be compute by

Analysis of Residuals

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual, generically denoted e. Thus,

Residual = ei = yiyi^

We can show that the sum of the residuals is always 0, which, in turn, implies that e(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by n – 1 instead of n – 2). Thus, the standard error of the estimate is sometimes called the residual standard deviation.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the x-aixs. In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the x-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation 𝜎. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a residual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the x-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as x increases, indicating that Assumption 2 may be violated.


Inferences for the Slope of the Population Regression Line

Suppose that the variables x and y satisfy the assumptions for regression inferences. Then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎. Of particular interest is whether the slope, 𝛽1, of the population regression line equals 0. If 𝛽1 = 0, then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean 𝛽0 and standard deviation 𝜎. Because x does not appear in either of those two parameters, it is useless as a predictor of y.

Of note, although x alone may not be useful for predicting y, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is not useful for predicting y, we really mean that the regression equation with x as the only predictor variable is not useful for predicting y. Conversely, although x alone may be useful for predicting y, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is useful for predicting y, we really mean that the regression equation with x as the only predictor variable is useful for predicting y.

We can decide whether x is useful as a (linear) predictor of y – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for 𝛽1 on the statistic b1. From the assumptions for regression inferences, we can show that the sampling distribution of the slop of the regression line is a normal distribution whose mean is the slope, 𝛽1, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, 𝜎, is unknown. We therefore replace 𝜎 with its sample estimate Se, the standard error of the estimate. As you might be suspect, the resulting variable has a t-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis H0: 𝛽1 = 0, we can use the variable t as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the regression t-test.

Confidence Intervals for the Slop of the Population Regression Line

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for 𝛽1 is provided by b1. To determine a confidence-interval estimate for 𝛽1, we apply Key Fact 15.4 to obtain Procedure 15.2, called the regression t-interval procedure.

Estimating and Prediction

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Recalling that 𝛽0 + 𝛽1x is the conditional mean of the response variable corresponding to the value xp of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the conditional mean t-interval procedure.

Prediction Intervals

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable yp – yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for prediction-interval formula. So we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the predicted value t-interval procedure.


Inferences in Correlation

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, 𝜌. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, 𝜌 actually describes the strength of the linear relationship between two variables; r is only an estimate of 𝜌 obtained from sample data.

The population linear correlation coefficient of two variables x and y always lies between -1 and 1. Values of 𝜌 near -1 or 1 indicate a strong linear relationship between the variables, whereas values of 𝜌 near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, r, is an estimate of the population linear correlation coefficient, 𝜌. Consequently, we can use r as a basis for performing a hypothesis test for 𝜌.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis H0: 𝜌 = 0, we use the t-score as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the correlation t-test.