**The Regression Equation**

When analyzing data, it is essential to first construct a graph of the data. A **scatterplot** is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called **bivariate quantitative data**.

To measure quantitatively how well a line fits teh data, we first consider the errors, *e*, made in using the line to predict the y-values of the data points. In general, an error, *e*, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the **least-squares criterion** is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used S* _{yy}* in Formula 14.1, we will use it later.

For a linear regression y = *b*_{0} + *b*_{1}x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the **response variable** and x the **predictor variable** or **explanatory variable** (because it is used to predict or explain the values of the response variable).

**Extrapolation**

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

**Outliers and Influential Observations**

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

**A Warning on the Use of Linear Regression**

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of *b*_{0} and *b*_{1} to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

**The Coefficient of Determination**

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, * SST*. Thus,

*SST*= 𝛴(

*y*

_{i}–

*y*[bar])

^{2}. If we divide

*SST*by n – 1, we get the sample variance of the observed values of the response variable. So,

*SST*really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (*x*_{i}, *y*_{i}). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, *y*_{i} – *y*[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, *y*^{^} – *y*[bar], and the remaining unexplained deviation, *y*_{i} – *y*^{^}. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is 𝛴(*y*_{i}^{^} – *y*[bar])^{2}. This measure of variation is called the regression sum of squares, * SSR*. Thus,

*SSR*= 𝛴(

*y*

_{i}

^{^}–

*y*[bar])

^{2}.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, *SSR* / *SST*. This quantity is called the coefficient of determination and is denoted *r*^{2}. Thus, ** r^{2} = SSR/SST**. In a same defintion, the deviation not explained by the regression,

*y*

_{i}–

*y*

_{i}

^{^}. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is 𝛴(

*y*

_{i}–

*y*

_{i}

^{^})

^{2}. This measure of variation is called the error sum of squares,

**. Thus,**

*SSE**SSE*= 𝛴(

*y*

_{i}–

*y*

_{i}

^{^})

^{2}.

In summary, check Definition 14.6

And the coefficient of detrmination, *r*^{2}, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of *r*^{2} near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of *r*^{2} near 1 suggests that the regression equation is quite useful for making predictions.

**Regression Identity**

**The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE**. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares:

*r*

^{2}=

*SSR*/

*SST*= (

*SST*–

*SSE*) /

*SST*= 1 –

*SSE*/

*SST*. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean,

*y*(bar), to predict the observed values of the response variable.

**Correlation and Causation**

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called **lurking variables**, that cause the changes in the two variables under consideration.

**The Regression Model; Analysis of Residuals**

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line *y* = 𝛽_{0} + 𝛽_{1}*x *– on which the conditional means of the response variable lie – as the **population regression line** and to its equation as the population regression equation. Observed that 𝛽_{0} is the *y*-intercept of the population regression line and 𝛽_{1} is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

**Estimating the Regression Parameters**

Suppose that we are considering two variables, *x* and *y*, for which the assumptions for regression inferences are met. Then there are constants 𝛽_{0}, 𝛽_{1}, and 𝜎 so that, for each value *x* of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean 𝛽_{0} + 𝛽_{1}*x* and standard deviation 𝜎.

Because the parameters 𝛽_{0}, 𝛽_{1}, and 𝜎 are usually unknown, we must estimate them from sample data. We use the *y*-intercept and slop of a sample regression line as point estimates of the *y*-intercept and slop, respectively, of the population regression line; that is, we use *b*_{0} to estimate 𝛽_{0} and we use *b*_{1} to estimate 𝛽_{1}. We note that *b*_{0} is an unbiased estimator of 𝛽_{0} and that *b*_{1} is an unbiased estimator of 𝛽_{1}.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation 𝜎 is called the **standard error of the estimate**. The standard error of the estimate could be compute by

**Analysis of Residuals**

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a **residual**, generically denoted *e*. Thus,

Residual = *e*_{i} = *y*_{i} – *y*_{i}^{^}

We can show that the sum of the residuals is always 0, which, in turn, implies that *e*(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by *n* – 1 instead of *n* – 2). Thus, the standard error of the estimate is sometimes called the **residual standard deviation**.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of **Assumption 1**, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the *x*-aixs. In light of **Assumption 2**, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of **Assumption 3**, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the *x*-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation 𝜎. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a **residual plot**, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the *x*-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as *x* increases, indicating that Assumption 2 may be violated.

**Inferences for the Slope of the Population Regression Line**

Suppose that the variables *x* and *y* satisfy the assumptions for regression inferences. Then, for each value *x* of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean 𝛽_{0} + 𝛽_{1}*x* and standard deviation 𝜎. Of particular interest is whether the slope, 𝛽_{1}, of the population regression line equals 0. If 𝛽_{1} = 0, then, for each value *x* of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean 𝛽_{0} and standard deviation 𝜎. Because *x* does not appear in either of those two parameters, it is useless as a predictor of *y*.

Of note, although *x* alone may not be useful for predicting *y*, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that *x* is not useful for predicting *y*, we really mean that the regression equation with *x* as the only predictor variable is not useful for predicting *y*. Conversely, although *x* alone may be useful for predicting *y*, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that *x* is useful for predicting *y*, we really mean that the regression equation with *x* as the only predictor variable is useful for predicting *y*.

We can decide whether *x* is useful as a (linear) predictor of *y* – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for 𝛽_{1} on the statistic *b*_{1}. From the assumptions for regression inferences, we can show that the **sampling distribution of the slop of the regression line** is a normal distribution whose mean is the slope, 𝛽_{1}, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, 𝜎, is unknown. We therefore replace 𝜎 with its sample estimate *S*_{e}, the standard error of the estimate. As you might be suspect, the resulting variable has a *t*-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis *H*_{0}: 𝛽_{1} = 0, we can use the variable *t* as the test statistic and obtain the critical values or *P*-value from the *t*-table. We call this hypothesis-testing procedure the **regression t-test**.

**Confidence Intervals for the Slop of the Population Regression Line**

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for 𝛽_{1} is provided by *b*_{1}. To determine a confidence-interval estimate for 𝛽_{1}, we apply Key Fact 15.4 to obtain Procedure 15.2, called the **regression t-interval procedure**.

**Estimating and Prediction**

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable *y*_{p}^{^}, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace 𝜎 by its estimate *s*_{e}, the standard error of the estimate. The resulting variable has a *t*-distribution.

Recalling that 𝛽_{0} + 𝛽_{1}*x* is the conditional mean of the response variable corresponding to the value *x*_{p} of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the **conditional mean t-interval procedure**.

**Prediction Intervals**

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable *y*p – *y*_{p}^{^}, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for prediction-interval formula. So we replace 𝜎 by its estimate *s*_{e}, the standard error of the estimate. The resulting variable has a *t*-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the **predicted value t-interval procedure**.

**Inferences in Correlation**

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the **population linear correlation coefficient**, 𝜌. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, *r*, measures the linear correlation of a sample of pairs. Thus, 𝜌 actually describes the strength of the linear relationship between two variables; *r* is only an estimate of 𝜌 obtained from sample data.

The population linear correlation coefficient of two variables *x* and *y* always lies between -1 and 1. Values of 𝜌 near -1 or 1 indicate a strong linear relationship between the variables, whereas values of 𝜌 near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, *r*, is an estimate of the population linear correlation coefficient, 𝜌. Consequently, we can use *r* as a basis for performing a hypothesis test for 𝜌.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis *H*_{0}: 𝜌 = 0, we use the *t*-score as the test statistic and obtain the critical values or *P*-value from the *t*-table. We call this hypothesis-testing procedure the **correlation t-test**.