Month: October 2017

Linear Regression

October 16, 2017 Clinical Trials, Epidemiology, Evidence-Based Medicine, Medical Statistics, Research No comments , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

The Regression Equation

When analyzing data, it is essential to first construct a graph of the data. A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called bivariate quantitative data.

To measure quantitatively how well a line fits teh data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the least-squares criterion is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used Syy in Formula 14.1, we will use it later.

For a linear regression y = b0 + b1x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the response variable and x the predictor variable or explanatory variable (because it is used to predict or explain the values of the response variable).

Extrapolation

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

Outliers and Influential Observations

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

A Warning on the Use of Linear Regression

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of b0 and b1 to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

The Coefficient of Determination

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, SST. Thus, SST = 𝛴(yiy[bar])2. If we divide SST by n – 1, we get the sample variance of the observed values of the response variable. So, SST really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (xi, yi). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, yiy[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, y^y[bar], and the remaining unexplained deviation, yiy^. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is 𝛴(yi^y[bar])2. This measure of variation is called the regression sum of squares, SSR. Thus, SSR = 𝛴(yi^y[bar])2.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, SSR / SST. This quantity is called the coefficient of determination and is denoted r2. Thus, r2 = SSR/SST. In a same defintion, the deviation not explained by the regression, yiyi^. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is 𝛴(yi – yi^)2. This measure of variation is called the error sum of squares, SSE. Thus, SSE = 𝛴(yi – yi^)2.

In summary, check Definition 14.6

And the coefficient of detrmination, r2, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of r2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r2 near 1 suggests that the regression equation is quite useful for making predictions.

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares: r2 = SSR / SST = (SSTSSE) / SST = 1 – SSE / SST. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean, y(bar), to predict the observed values of the response variable.

Correlation and Causation

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called lurking variables, that cause the changes in the two variables under consideration.


The Regression Model; Analysis of Residuals

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line y = 𝛽0 + 𝛽1x – on which the conditional means of the response variable lie – as the population regression line and to its equation as the population regression equation. Observed that 𝛽0 is the y-intercept of the population regression line and 𝛽1 is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

Estimating the Regression Parameters

Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants 𝛽0, 𝛽1, and 𝜎 so that, for each value x of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎.

Because the parameters 𝛽0, 𝛽1, and 𝜎 are usually unknown, we must estimate them from sample data. We use the y-intercept and slop of a sample regression line as point estimates of the y-intercept and slop, respectively, of the population regression line; that is, we use b0 to estimate 𝛽0 and we use b1 to estimate 𝛽1. We note that b0 is an unbiased estimator of 𝛽0 and that b1 is an unbiased estimator of 𝛽1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation 𝜎 is called the standard error of the estimate. The standard error of the estimate could be compute by

Analysis of Residuals

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual, generically denoted e. Thus,

Residual = ei = yiyi^

We can show that the sum of the residuals is always 0, which, in turn, implies that e(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by n – 1 instead of n – 2). Thus, the standard error of the estimate is sometimes called the residual standard deviation.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the x-aixs. In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the x-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation 𝜎. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a residual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the x-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as x increases, indicating that Assumption 2 may be violated.


Inferences for the Slope of the Population Regression Line

Suppose that the variables x and y satisfy the assumptions for regression inferences. Then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎. Of particular interest is whether the slope, 𝛽1, of the population regression line equals 0. If 𝛽1 = 0, then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean 𝛽0 and standard deviation 𝜎. Because x does not appear in either of those two parameters, it is useless as a predictor of y.

Of note, although x alone may not be useful for predicting y, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is not useful for predicting y, we really mean that the regression equation with x as the only predictor variable is not useful for predicting y. Conversely, although x alone may be useful for predicting y, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is useful for predicting y, we really mean that the regression equation with x as the only predictor variable is useful for predicting y.

We can decide whether x is useful as a (linear) predictor of y – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for 𝛽1 on the statistic b1. From the assumptions for regression inferences, we can show that the sampling distribution of the slop of the regression line is a normal distribution whose mean is the slope, 𝛽1, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, 𝜎, is unknown. We therefore replace 𝜎 with its sample estimate Se, the standard error of the estimate. As you might be suspect, the resulting variable has a t-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis H0: 𝛽1 = 0, we can use the variable t as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the regression t-test.

Confidence Intervals for the Slop of the Population Regression Line

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for 𝛽1 is provided by b1. To determine a confidence-interval estimate for 𝛽1, we apply Key Fact 15.4 to obtain Procedure 15.2, called the regression t-interval procedure.

Estimating and Prediction

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Recalling that 𝛽0 + 𝛽1x is the conditional mean of the response variable corresponding to the value xp of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the conditional mean t-interval procedure.

Prediction Intervals

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable yp – yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for prediction-interval formula. So we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the predicted value t-interval procedure.


Inferences in Correlation

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, 𝜌. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, 𝜌 actually describes the strength of the linear relationship between two variables; r is only an estimate of 𝜌 obtained from sample data.

The population linear correlation coefficient of two variables x and y always lies between -1 and 1. Values of 𝜌 near -1 or 1 indicate a strong linear relationship between the variables, whereas values of 𝜌 near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, r, is an estimate of the population linear correlation coefficient, 𝜌. Consequently, we can use r as a basis for performing a hypothesis test for 𝜌.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis H0: 𝜌 = 0, we use the t-score as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the correlation t-test.

Inferences for Population Standard Deviations

October 5, 2017 Clinical Trials, Epidemiology, Evidence-Based Medicine, Medical Statistics No comments , , , ,

Inferences for One Population Standard Deviation

Suppose that we want to obtain information about a population standard deviation. If the population is small, we can often determine 𝜎 exactly by first taking a census and then computing 𝜎 from the population data. However, if the population is large, which is usually the case, a census is generally not feasible, and we must use inferential methods to obtain the required information about 𝜎.

Logic Behind

Recall that to perform a hypothesis test with null hypothesis H0: 𝜇 = 𝜇0 for the mean, 𝜇, of a normally distributed variable, we do not use the variable x(bar) as the test statistic; rather, we use the variable t score. Similarly, when performing a hypothesis test with null hypothesis H0: 𝜎 = 𝜎0 for the standard deviatio, 𝜎, of a normally distributed variable, we do not use the variable s as the test statistic; rather, we use a modified version of that variable:

This variable has a chi-square distribution.

In light of Key Fact 11.2, for a hypothesis test with null hypothesis H0: 𝜎 = 𝜎0, we can use the variable 𝜒2 as the test statistic and obtain the critical value(s) form the 𝜒2-table. We call this hypothesis-testing procedure the one-standard-deviation 𝜒2-test.

Procedure 11.1 gives a step-by-step method for performing a one-standard-deviation 𝜒2-test by using either the critical-value approach or the P-value, but do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-test for one and two population means, the one-standard-deviation 𝜒2-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed or very nearly so.

Consequently, before applying Procedure 11.1, construct a normal probability plot. If the plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.1. We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for a population standard deviation. If you have doubts about the normality of the variable under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for a population standard deviation.

In addition, using Key Fact 11.2, we can also obtain a confidence-interval procedure for a population standard deviation. We call this procedure the one-standard-deviation 𝜒2-interval procedure and present it as Procedure 11.2. This procedure is also known as the 𝜒2-interval procedure for one population standard deviation. This confidence-interval procedure is often formulated in terms of variance instead of standard deviation. Like the one-standard-deviation 𝜒2-test, this procedure is not at all robust to violations of the normality assumption.


Inferences for Two Population Standard Deviation, Using Independent Samples

We now introduce hypothesis tests and confidence intervals for two population standard deviations. More precisely, we examine inferences to compare the standard deviations of one variable of two different populations. Such inferences are based on a distribution called the Fdistribution. In many statistical analyses that involve the F-distribution, we also need to determine F-values having areas 0.005, 0.01, 0.025, and 0.10 to their left. Although such F-values aren't available directly from Table VIII, we can obtain them indirectly from the table by using Key Fact 11.4.

Logic Behind

To perform hypothesis tests and obtain confidence intervals for two population standard deviations, we need Key Fact 11.5, that is, the distribution of the F-statistic for comparing two population standard deviations. By definition, the F-statistic.

In light of Key Fact 11.5, for a hypothesis test with null hypothesis H0: 𝜎1 = 𝜎2 (population standard deviations are equal), we can use the variable F = S12 / S22 as the test statistic and obtain the critical value(s) from the F-table. We call this hypothesis-testing procedure the two-standard-deviations F-test. Procedure 11.3 gives a step-by-step method for performing a two-standard-deviations F-test by using either critical-value approach or the P-value approach.

For the P-value approach, we could use F-table to estimate the P-value, but to do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-tests for one and two population means, the two-standard-deviation F-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed, or very nearly so, on each population.

Consequently, before applying Procedure 11.3, construct a normal probability plot of each sample. If either plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.3.

We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for comparing two population standard deviations. If you have doubts about the normality of the variable on the two populations under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for two population standard deviations.

Using Key Fact 11.5, we can also obtain a confidence-interval procedure, Procedure 11.4, for the ratio of two population standard deviations. We call it the two-standard-deviations F-interval procedure. Also it is known as the F-interval procedure for two population standard deviations and the two-sample F-interval procedure. This confidence-interval procedure is often formulated in terms of variances instead of standard deviations.

To interpret confidence intervals for the ratio 𝜎1 / 𝜎2, of two population standard deviations, considering three cases is helpful.

Case 1: The endpoints of the confidence interval are both greater than 1.

To illustrate, suppose that a 95% confidence interval for 𝜎1 / 𝜎2 is from 5 to 8. Then we can be 95% confident that 𝜎1 / 𝜎2 lies somewhere between 5 and 8 or, equivalently, 5𝜎2 < 𝜎1 < 8𝜎2. Thus, we can be 95% confident that 𝜎1 is somewhere between 5 and 8 times greater than 𝜎2.

Case 2: The endpoints of the confidence interval are both less than 1.

To illustrate, suppose that a 95% confidence interval for 𝜎1 / 𝜎2 is from 0.5 to 0.8. Then we can be 95% confident that 𝜎1 / 𝜎2 lies somewhere between 0.5 and 0.8 or, equivalently, 0.5𝜎2 < 𝜎1 < 0.8𝜎2. Thus, noting that 1/0.5 = 2 and 1/0.8 = 1.25, we can be 95% confident that 𝜎1 < is somewhere between 1.25 and 2 times less than 𝜎2.

Case 3: One endpoint of the confidence interval is less than 1 and the other is greater than 1.

To illustrate, suppose that a 95% confience interval for 5𝜎2 < 𝜎1 < 8𝜎2 is from 0.5 to 8. Then we can be 95% confident that 5𝜎2 < 𝜎1 < 8𝜎2 lies somewhere between 0.5 and 8 or, equivalentluy, 0.5𝜎2 < 𝜎1 < 8𝜎2. Thus, we can be 95% confident that 𝜎1 is somewhere between 2 time less than and 8 times greater than 𝜎2.

Sampling

October 2, 2017 Clinical Trials, Medical Statistics, Research No comments , , , , , , , , , , , , , ,

If the information you need is not already available from a previous study, you might acquire it by conducting a census – that is, by obtaining information for the entire population of interest. However, conducting a census may be time consuming, costly, impractical, or even impossible.

Two methods other than a census for obtaining information are sampling and experimentation. If sampling is appropriate, you must decide how to select the sample; that is, you must choose the method for obtaining a sample from the population. Because the sample will be used to draw conclusions about the entire population, it should be a representative sample – that is, it should reflect as closely as possible the relevant characteristics of the population under consideration.

For instance, using the average weight of a sample of professional football players to make an inference about the average weight of all adult males would be unreasonable. Nor would it be reasonable to estimate the median income of California residents by sampling the incomes of Beverly Hills residents.

Most modern sampling procedures involve the use of probability sampling. In probability sampling, a random device – such as tossing a coin, consulting a table of random numbers, or employing a random-number generator – is used to decide which members of the population will constitute the sample instead of leaving such decisions to human judgment.

PS: Probability sampling is based on the fact that every member of a population has a known and equal chance of being selected. For example, if you had a population of 100 people, each person would have odds of 1 out of 100 of being chosen. With non-probability sampling, those odds are not equal. For example, a person might have a better chance of being chosen if they live close to the researcher or have access to a computer. Probability sampling gives you the best chance to create a sample that is truly representative of the population.

The use of probability sampling may still yield a nonrepresentative sample. However, probability sampling helps eliminate unintentional selection bias and permits the researcher to control the chance of obtaining a nonrepresentative sample. Furthermore, the use of probability sampling guarantees that the techniques of inferential statistics can be applied.

Simple Random Sampling

The inferential techniques considered most often are intended for use with only one particular sampling procedure: simple random sampling. A simple random sampling is a sampling procedure for which each possible sample of a given size is equally likely to be the one obtained. And simple random sample is a sample obtained by simple random sampling.

There are two types of simple random sampling. One is simple random sampling with replacement (SRSWR), whereby a member of the population can be selected more than once; the other is simple random sampling without replacement (SRS), whereby a member of the population can be selected at most once. Unless we specify otherwise, assume that simple random sampling is done without replacement. Technologies to do a simple random sampling include random-number tables and random-number generators.

Systematic Random Sampling

Simple random sampling is the most natural and easily understood method of probability sampling – it corresponds to our intuitive notion of random selection by lot. However, simple random sampling does have drawbacks. For instance, it may fail to provide sufficient coverage when information about subpopulations is required and may be impractical when the members of the population are widely scattered geographically.

One method that takes less effort to implement than simple random sampling is systematic random sampling. Proceudre 1.1 presents a step-by-step method for implementing systematic random sampling.

Systematic random sampling is easier to execute than simple random sampling and usually provides comparable results. The exception is the presence of some kind of cyclical pattern in the listing of the members of the population (e.g., male, female, male, female, …), a phenomenon that is relatively rare.

Cluster Sampling

Another sampling method is cluster sampling, which is particularly useful when the members of the population are widely scattered geographically. Procedure 1.2 provides a step-by-step method for implementing cluster sampling.

Many years ago, citizens' groups pressured the city council of Tempe, Arizona, to install bike paths in the city. The council members wanted to be sure that they were supported by a majority of the taxpayers, so they decided to poll the city's homeowners. Their first survey of public opinion was a questionnaire mailed out with the city's 18,000 homeowner water bills. Unfortunately, this method did not work very well. Only 19.4% of the questionnaires were returned, and a large number of those had written comments that indicated they came from avid bicyclists or from people who stronglye resented bicyclists. The city council realized that the questionnaire generally had not been returned by the average homeowner.

An employee in the city's planning department had sample surveyt experience, so the council asked her to do a survey. She was given two assistants to help her interview 300 homeowners and 10 days to complete the project. The planner first considered taking a simple random sample of 300 homes: 100 interviews for herself and for each of her two assistants. However, the city was so spread out that an interviewer of 100 randomly scattered homeowners would have to drive an average of 18 minutes from one interview to the next. Doing so would require approximately 30 hours of driving time for each interviewer and could delay completion of the report. The planner needed a different sampling design.

Although cluster sampling can save time and money, it does have disadvantages. Ideally, each cluster should mirror the entire population. In practice, however, members of a cluster may be more homogeneous than the members of the entire population, which can cause problems.

Stratified Sampling

Another sampling method, known as stratified sampling, is often more reliable than cluster sampling. In stratified sampling, the population is first divided into subpopulations, called strata, and then sampling is done from each stratum. Ideally, the members of each stratum should be homogeneous relative to the characteristic under consideration.

In stratified sampling, the strata are often sampled in proportion to their size, which is called proportional allocation. Procedure 1.3 presents a step-by-step method for implementing stratified (random) sampling with proportional allocation.

Multistage Sampling

Most large-scale surveys combine one or more of simple random sampling, systematic random sampling, cluster sampling, and stratified sampling. Such multistage sampling is used frequently by pollsters and government agencies.