## Type I and Type II Error in Statistics

We often use inferential statistics to make decisions or judgements about the value of a parameter, such as a population mean. For example, we might need to decide whether the mean weight, 𝜇, of all bags of pretzels packaged by a particular company differs from the advertised weight of 454 grams, or we might want to determine whether the mean age, 𝜇, of all cars in use has increased from the year 2000 mean of 9.0 years. One of the most commonly used methods for making such decisions or judgments is to perform a hypothesis test. A hypothesis is a statement that something is true. For example, the statement “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g” is a hypothesis. Typically, a hypothesis test involves two hypotheses: the null hypothesis and the alternative hypothesis (or research hypothesis), which we define as follows. For instance, in the pretzel packaging example, the null hypothesis might be “the mean weight of all bags of pretzels packaged equals the advertised weight of 454 g,” and the alternative hypothesis might be “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g.”

The first step in setting up a hypothesis test is to decide on the null hypothesis and the alternative hypothesis. Generally, the null hypothesis for a hypothesis test concerning a population mean, 𝜇, alway specifies a single value for that parameter. Hence, we can express the null hypothesis as

H0: 𝜇 = 𝜇0

The choice of the alternative hypothesis depends on and should reflect the purpose of the hypothesis test. Three choices are possible for the alternative hypothesis.

• If the primary concern is deciding whether a population mean, 𝜇, is different from a specific value 𝜇0, we express the alternative hypothesis as, Ha ≠ 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a two-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is less than a specific value 𝜇0, we express the alternative hypothesis as, Ha < 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a left-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is greater than a specified value 𝜇0, we express the alternative hypothesis as, Ha > 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a right-tailed test.

PS: A hypothesis test is called a one-tailed test if it is either left tailed or right tailed. After we have chosen the null and alternative hypotheses, we must decide whether to reject the null hypothesis in favor of the alternative hypothesis. The procedure for deciding is roughly as follows. In practice, of course, we must have a precise criterion for deciding whether to reject the null hypothesis, which involves a test statistic, that is, a statistic calculated from the data that is used as a basis for deciding whether the null hypothesis should be rejected.

Type I and Type II Errors

In statistics, type I error is to reject the null hypothesis when it is in fact true; whereas type II error is not to reject the null hypothesis when it is in fact false. The probabilities of both type I and type II errors are useful (and essential) to evaluating the effectiveness of a hypothesis test, which involves analyzing the chances of making an incorrect decision. A type I error occurs if a true null hypothesis is rejected. The probability of that happening, the type I error probability, commonly called the significance level of the hypothesis test, is denote 𝛼. A type II error occurs if a false null hypothesis is not rejected. The probability of that happening, the type II error probability, is denote 𝛽. Ideally, both type I and Type II errors should have small probabilities. Then the chance of making an incorrect decision would be small, regardless of whether the null hypothesis is true or false. We can design a hypothesis test to have any specified significance level. So, for instance, of not rejecting a true null hypothesis is important, we should specify a small value for 𝛼. However, in making our choice for 𝛼, we must keep Key Fact 9.1 in mind. Consequently, we must always assess the risks involved in committing both types of errors and use that assessment as a method for balancing the type I and type II error probabilities.

The significance level, 𝛼, is the probability of making type I error, that is, of rejecting a true null hypothesis. Therefore, if the hypothesis test is conducted at a small significance level (e.g., 𝛼 = 0.05), the chance of rejecting a true null hypothesis will be small. Thus, if we do reject the null hypothesis, we can be reasonably confident that the null hypothesis is false. In other words, if we do reject the null hypothesis, we conclude that the data provide sufficient evidence to support the alternative hypothesis.

However, we usually do not know the probability, 𝛽, of making a type II error, that is, of not rejecting a false null hypothesis. Consequently, if we do not reject the null hypothesis, we simply reserve judgement about which hypothesis is true. In other words, if we do not reject the null hypothesis, we conclude only that the data do not provide sufficient evidence to support the alternative hypothesis; we do not conclude that the data provide sufficient evidence to support the null hypothesis. In short, it might be true that there is a true difference but the power of the statistic procedure is not high enough to detect it.

## Assumptions for Common Statistic Procedures

March 11, 2018 Medical Statistics, Research No comments

Observational Studies and Designed Experiments

Besides classifying statistical studies as either descriptive or inferential, we often need to classify them as either observational studies or designed experiments. In an observational study, researchers simply observe characteristics and take measurements, as in a sample survey. In a designed experiment, researchers impose treatments and controls and then observe characteristics and take measurements. Observational studies can reveal only association, whereas designed experiments can help establish causation.

Census, Sampling, and Experimentation

If the information you need is not already available from a previous study, you might acquire it by conducting a census – that is, by obtaining information for the entire population of interest. However, conducting a census may be time consuming, costly, impractical, or even impossible.

Two methods other than a census for obtaining information are sampling and experimentation. If sampling is appropriate, you must decide how to select the sample; that is, you must choose the method for obtaining a sample from the population. Because the sample will be used to draw conclusions about the entire population, it should be a representative sample – that is, it should reflect as closely as possible the relevant characteristics of the population under consideration.

Three basic principles of experimental design are: control, randomization, and replication. In a designed experiment, the individuals or items on which the experiment is performed are called experimental units. When the experimental units are humans, the term subject is often used in place of experimental unit. Generally, each experimental condition is called a treatment, of which there may be several.  Sampling

Most modern sampling procedure involve the use of probability sampling. In probability sampling, a random device – such as tossing a coin, consulting a table of random numbers, or employing a random-number generator – is used to decide which members of the population will constitute the sample instead of leaving such decisions to human judgement. The use of probability sampling may still yield a non representative sample. However, probability sampling helps eliminate unintentional selection bias and permits the researcher to control the chance of obtaining a non representative sample. Furthermore, the use of probability sampling guarantees that the techniques of inferential statistics can be applied.

Simple Random Sampling

Simple random sampling is a sampling procedure for which each possible sample of a given size is equally likely to be the one obtained. There are two types of simple random sampling. One is simple random sampling with replacement (SRSWR), whereby a member of the population can be selected more than once; the other is simple random sampling without replacement (SRS), whereby a member of the population can be selected at most once.

Simple random sampling is the most natural and easily understood method of probability sampling – it corresponds to our intuitive notion of random selection by lot. However, simple random sampling does have drawbacks. For instance, it may fail to provide sufficient coverage when information about subpopulations is required and may be impractical when the members of the population are widely scattered geographically.

Systematic Random Sampling

One method that takes less effort to implement than simple random sampling is systematic random sampling. Cluster Sampling

Another sampling method is cluster sampling, which is particular useful when the members of the population are widely scattered geographically. Stratified Sampling

Another sampling method, known as stratified sampling, is often more reliable than cluster sampling. In stratified sampling, the population is first divided into subpopulations, called strata, and then sampling is done from each stratum. Ideally, the members of each stratum should be homogenous relative to the characteristic under consideration. In stratified sampling, the strata are often sampled in proportion to their size, which is called proportional allocation. Basic Study Design

We discuss several major clinical trial designs here. Most trials use the so-called parallel design. That is, the intervention and control groups are followed simultaneously from the time of allocation to one or the other. Exceptions to the simultaneous follow-up are historical control studies. These compare a group of participants on a new intervention with a previous group of participants on standard or control therapy. A modification of the parallel design is the cross-over trial, which uses each participant at least twice, at least once as a member of the control group and at least once as a member of one or more intervention groups. Another modification is a withdrawal study, which starts with all participants on the active intervention and then, usually randomly, assigns a portion to be followed on the active intervention and the remainder to be followed off the intervention. Factorial design trials employ two or more independent assignments to intervention or control.

Randomized Control Trials

Randomized control trials are comparative studies with an intervention group and a control group; the assignment of the participant to a group is determined by the formal procedure of randomization. Randomization, in the simplest case, is a process by which all  participants are equally likely to be assigned to either the intervention group or the control group. The features of this technique are discuss detail below. Not all clinical studies can use randomized controls. Occasionally, the prevalence of the disease is so rare that a large enough population cannot be readily obtained. In such an instance, only case-control studies might be possible. Such studies, are not clinical trials however.

Nonrandomized Concurrent Control Studies

Controls in this type of study are participants treated without the new intervention at approximately the same time as the intervention group is treated. Participants are allocated to one of the two groups, but by definition this is not a random process. An example of a nonrandomized concurrent control study would be a comparison of survival results of patients treated at two institutions, one institution using a new surgical procedure and the other using more traditional medical care. Another example is when patients are offered either of two treatments and the patient selects the one that he or she thinks is preferable. Comparisons between the two groups is then made, adjusting for any observed baseline imbalances.

To some investigators, the nonrandomized concurrent control design has advantages over the randomized control design. Those who object to the idea of ceding to chance the responsibility for selecting a person’s treatment may favor this design. It is also difficult for some investigators to convince potential participants of the need for randomization. They find it easier to offer the intervention to some and the control to others, hoping to match on key characteristics. The major weakness of the nonrandomized concurrent control study is the potential that the intervention group and control group are not strictly comparable. It is difficult to prove comparability because the investigator must assume that she has information on all the important prognostic factors. Selecting a control group by matching on more than a few factors is impractical and the comparability of a variety of other characteristics would still need to be evaluated. In small studies, an investigator is unlikely to find real differences which may exist between groups before the initiation of intervention since there is poor sensitivity statistically to detect such differences (e.g., high 𝛽 and not enough power). Even for large studies that could detect most differences of real clinical importance, the uncertainty about the unknown or unmeasured factors is still of concern.

Historical Controls and Databases

In historical control studies, a new intervention is used in a series of participants and the results are compared to the outcome in a previous series of comparable participants. Historical controls are thus, by this definition, nonrandomized and nonconcurrent. Typically, historical control data can be obtained from two sources. First, control group data may be available in the literature. These data are often undesirable because it is difficult, and perhaps impossible, to establish whether the control and intervention groups are comparable in key characteristics at the onset. Even if such characteristics were measured in the same way, the information may not be published and for all practical purposes it will be lost. Second, data may not have been published but may be available on computer files or in medical charts. Such data on control participants, for example, might be found in a large center which has several ongoing clinical investigations. When one study is finished, the participants in that study may be used as a control group for some future study. Centers which do successive studies, as in cancer research, will usually have a system for storing and retrieving the data from past studies for use at some future time. The advent of electronic medical records may also facilitate access to historical data from multiple sources, although it does not solve the problem of nonstandard and variable assessment or missing information.

Despite the time and cost benefits, as well as the ethical considerations, historical control studies have potential limitations which should be kept in mind. They are particularly vulnerable to bias. An improvement in outcome for a given disease may be attributed to a new intervention when, in fact, the improvement may stem from a change in the patient population or patient management. Shifts in patient population can be subtle and perhaps undetectable. In a Veterans Administration Urological Research Group study of prostate cancer, people were randomized to placebo or estrogen treatment groups over a 7-year period. For those enrolled during the last 2-3 years, no differences were found between the placebo and estrogen groups. However, those assigned to placebo entering in the first 2-3 years had a shorter survival time than those assigned to estrogen entering in the last 2-3 years of the study. The reason for the early apparent difference is probably that the people randomized earlier were older than the later group and thus were at higher risk of death during the period of observation.

Cross-Over Designs

Statistical Designs

Once we have chosen the treatments, we must decide how the experimental units are to be assigned to the treatments (or vice versa). In a completely randomized design, all the experimental units are assigned randomly among all the treatments. In a randomized block design, experimental units are similar in ways that are expected to affect the response variable are grouped in blocks. Then the random assignment of experimental units to the treatment is made block by block. Or, the experimental units are assigned randomly among all the treatments separately within each block.

Randomized Block Design and Randomized Block ANOVA

In this section we introduce a design that has its basic focus on a single factor, but uses an additional factor (called a blocking factor) to account for the effects of dissimilar groups of experimental units on the value of the response variable. Suppose we are interested in a single factor with k treatments (levels). Sometimes there is no much variation in the values of the response variable within each treatment that use of a completely randomized design will fail to detect differences among the treatment means when such difference exist. This is because it is often not possible to decide whether the variation among the sample means for the different treatments is due to differences among the treatment means or whether it is due to variation within the treatments (i.e., variation in the values of the response variable within each treatment).

If a large portion of the variation within the treatments is due to one extraneous variable, then it is often appropriate to use a randomized block design instead of a completely randomized design. In a randomized block design, the extraneous source of variation is isolated and removed so that it is easier to detect differences among the treatment means when such differences exist. Although a randomized block design is not always appropriate or feasible, it is often a viable alternative to a completely randomized design in the presence of a single extraneous source of variability. In a randomized block design, the experimental units within each block should be randomly assigned among all the treatments. Compared with two-way ANOVA:

• The blocking factor is not a factor of interest to the experimenter; only one factor is of real interest to the experimenter, namely, the treatment factor.
• There is a restriction in the way the randomization is performed in assigning the experimental units to the treatments. The experimental units are not assigned to the treatments completely at random; rather the experimental units within each block are assigned randomly to the treatments so that each treatment occurs once and only once within each block.

It is important to remember that including a blocking factor in our design is meant to account for another source of variation in the values of the response variable and thus reduce the variation that is due to “experimental error.” A properly selected blocking factor will make the test for the treatment effect more sensitive by reducing the error sum of squares.   Statistic Procedures

Sample Size for Estimating 𝜇 Sample Size for Estimating p Sample Size Calculation for Continuous Response Variables where 2N = total sample size (N participants / group), 𝜎 = the pooled population standard deviation, 𝛿 = 𝜇1 – 𝜇2

Sample Size Calculation for Proportions where 2N = total sample size (N participants / group), pbar = (pc + pi) / 2

Sample Size Calculation for Survival Functions where 2N = total sample size (N participants / group), 𝜆 = population hazard function

One-Mean z-Interal Procedure  One-Mean t-Interval Procedure One-Mean z-Test  One-Mean t-Test Wilcoxon Signed-Rank Test Note: The following points may be relevant when performing a Wilcoxon signed-rank test:

• If an observation equals 𝜇0 (the value for the mean in the null hypothesis), that observation should be removed and the sample size reduced by 1.
• If two or more absolute differences are tied, each should be assigned the mean of the ranks they would have had if there were no ties.

Pooled t-Test Pooled t-Interval Procedure Nonpooled t-Test Nonpooled t-Interval Procedure Mann-Whitney Test (Wilcoxon rank-sum test, Mann-Whitney-Wilcoxon test) Note: When there are ties in the sample data, ranks are assigned in the same way as in the Wilcoxon signed-rank test. Namely, if two or more observations are tied, each is assigned the mean of the ranks they would have had if there had been no ties.

Paired t-Test Paired t-Interval Procedure Paired Wilcoxon Signed-Rank Test  Kruskal-Wallis Test One-Proportion z-Interval Procedure One-Proportion z-Test Two-Proportions z-Test Two-Proportions z-Interval Procedure Chi-Square Goodness-of-Fit Test Chi-Square Independence Test Chi-Square Homogeneity Test One-Standard-Deviation Chi-Square Test One-Standard-Deviation Chi-Square Interval Procedure Two-Standard-Deviation F-Test Two-Standard-Deviations F-Interval Procedure Turkey Multiple-Comparison Method Simple Linear Regression

Coefficient of Determination The coefficient of determination is a descriptive measure of the utility of the regression equation for making predictions. The coefficient of determination always lies between 0 and 1. A value of r^2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r^2 near 1 suggests that the regression equation is quite useful for making predictions.

Correlation Coefficient

• r reflects the slope o the scatterplot
• The magnitude of r indicates the strength of the linear relationship
• The sign of r suggests the type of linear relationship
• The sign of r and the sign of the slop of the regression line are identical

Assumption Before Linear Regression  Standard Error of the Estimate Regression t-Test Regression t-Interval Procedure Conditional Mean t-Interval Procedure Two-Way ANOVA Friedman Test Meta-Analysis: Which Model Should We Use?

Fix effect model

It makes sense to use the fixed-effect model if two conditions are met. First, we believe that all the studies included in the analysis are functionally identical. Second, our goal is to compute the common effect size for the identified population, and not to generalize to other populations. For example, suppose that a pharmaceutical company will use a thousand patients to compare a drug versus placebo. Because the staff can work with only 100 patients at a time, the company will run a series of ten trials with 100 patients in each. The studies are identical in the sense that any variable which can have an impact on the outcome are the same across the ten studies. Specifically, the studies draw patients from a common pool, using the same researchers, dose, measure, and so on.

Random effects

By contrast, when the researcher is accumulating data from a series of studies that had been performed by researchers operating independently, it would be unlikely that all the studies were functionally equivalent. Typically, the subjects or interventions in these studies would have differed in ways that would have impacted on the results, and therefore we should not assume a common effect size. Therefore, in these cases the random-effects model is more easily justified than the fixed-effect model. Additionally, the goal of this analysis is usually to generalize to a range of scenarios. Therefore, if one did make the argument that all the studies used an identical, narrowly defined population, then it would not be possible to extrapolate from this population to others’ nd the utility of the analysis would be severely limited.

Heterogeneity

To understand the problem, suppose for a moment that all studies in the analysis shared the same true effect size, so that the (true) heterogeneity is zero. Under this assumption, we would not expect the observed effect to be identical to each other. Rather, because of within-study error, we would expect each to fall within some range of the common effect. Now, assume that the true effect size does vary from one study to the next. In this case, the observed effects vary from one another for two reasons. One is the real heterogeneity in effect size, and the other is the within-study error. If we want to quantify the heterogeneity we need to partition the observed variation into these two components, and then focus on the former.

The mechanism that we use to extract the true between-studies variation from the observed variation is as follows:

• We compute the total amount of study-to-study variation actually observed.
• We estimate how much the observed effects would be expected to vary from each other if the true effect was actually the same in all studies.
• The excess variation (if any) is assumed to reflect real differences in effect size (that is, the heterogeneity)

Clinical Trials

Randomization

The function of randomization include:

• Randomization removes the potential of bias in the allocation of participants to the intervention group or to the control group. Such selection bias could easily occur, and cannot be necessarily prevented, in the non-randomziared concurrent or historical control study because the investigator or the participant may influence the choice of intervention. The direction of the allocation bias may go either way and can easily invalidate the comparison. This advantage of randomization assumes that the procedure is performed in a valid manner and that the assignment cannot be predicted.
• Some what related to the first, is that randomization tends to produce comparable groups; that is, measured as well as unknown or unmeasured prognostic factors and other characteristics of the participants at the time of randomization will be, on the average, evenly balanced between the intervention and control groups. This dose not mean that in any single experiment all such characteristics, sometimes called baseline variables or covariates, will be perfectly balanced between the two groups. However, it does mean that for independent covariates, whatever the detected or undetected differences that exist between the groups, the overall magnitude and direction of the differences will tend to be equally divided between the two groups. Of course, many covariates are strongly associated; thus, any imbalance in one would tend to produce imbalances in the others.
• The validity of statistical tests of significance is guaranteed. The process of randomization makes it possible to ascribe a probability distribution to the difference in outcome between treatment groups receiving equally effective treatments and thus to assign significance levels to observed differences. The validity of the statistical tests of significance is not dependent on the balance of the prognostic factors between the randomized groups. The chi-square test for two-by-two tables and Student’s t-test for comparing two means can be justified on the basis of randomization alone without making further assumptions concerning the distribution of baseline variables. If randomization is not used, further assumptions conceding the comparability of the groups and the appropriateness fo the statistical models must be made before the comparisons will be valid. Establishing the validity of these assumptions may be difficult.

In the simplest case, randomization is a process by which each participant has the same chance of being assigned to either intervention or control. An example would be the toss of a coin, in which heads indicates intervention group and tails indicates control group. Even in the more complex randomization strategies, the element of chance underlies the allocation process. Of course, neither trial participant nor investigator should know what the assignment will be before the participant’s decision to enter the study. Otherwise, the benefits of randomization can be lost.

The Randomization Process

Two forms of experimental bias are of concern. The first, selection bias, occurs if the allocation process is predictable. In this case, the decision to enter a participant into a trial may be influenced by the anticipated treatment assignment. If any bias exists as to what treatment particular types of participants should receive, then a selection bias might occur. A second bias, accidental bias, can arise if the randomization procedure does not achieve balance on risk factors or prognostic covariates. Some of the allocation procedures are more vulnerable to accidental bias, especially for small studies. For large studies, however, the chance of accidental bias is negligible.

Fixed Allocation Randomization

Fixed allocation procedures assign the interventions to participants with a respecified probability, usually equal (e.g., 50% for two arms, 33% for 3, or 25% for 4, etc.) and that allocation probability is not altered as the study progresses. Three methods of randomization belong to the fixed allocation, including: simple, blacked, and stratified randomization. The most elementary form of randomization is referred to as simple or complete randomization. One simple method is to toss an unbiased coin each time a participant is eligible to be randomized (for two treatment combinations). Using this procedure, approximately one half of the participants will be in group A and one half in group B. In practice, for small studies, instead of tossing a coin to generate a randomization schedule, a random digit table on which the equally likely digits 0 to 9 are arranged by tows and columns is usually used to accomplish simple randomization. For large studies, a more convenient method for producing a randomization schedule is to use a random number producing algorithm, available on most computer systems. Another simple randomization is to use a uniform random number algorithm to produce random numbers in the interval from 0.0 to 1.0. Using a uniform random number generator, a random number can be produced for each participant. If the random number is between 0 and p, the participant would be assigned to group A; otherwise to group B. For equal allocation, the probability cut point, p, is one-half (i.e., p = 0.50). If equal allocation between A and B is not desired, then p can be set to the desired proportion in the algorithm and the study will have, on the average, a proportion p of the participants in group A. In addition, this strategy could be adapted easily to more than two groups.

Blocked randomization, sometimes called permuted block randomization, avoids serious imbalance in the number of participants assigned to each group, an imbalance which could occur in the simple randomization procedure. More importantly, blocked randomization guarantees that at no time during randomization will the imbalance be large and that at certain points the number of participants in each group will be equal. This protects against temporal trends during enrollment, which is often a concern for larger trials with long enrollment phases. If participants are randomly assigned with equal probability to groups A or B, then for each block of even size (for example, 4, 6, or 8) one half of the participants will be assigned to A and the other half to B. The order in which the interventions are assigned in each block is randomized, and this process is repeated for consecutive blocks of participants until all participants are randomized.

Survival Analysis

Censored Data

Many researchers consider survival data analysis to be merely the application of two conventional statistical methods to a special type of problem: parametric if the distribution of survival times is known to be normal and nonparametric if the distribution is unknown. This assumption would be true if the survival times of all the subjects were exact and known; however, some survival times are not. Further, the survival distribution is often skewed, or far from being normal. Thus there is a need for new statistical techniques. One of the most important developments is due to a special feature of survival data in the life sciences that occurs when some subjects in the study or time of analysis. For example, some patients may still be alive or disease-free at the end of the study period. The exact survival times of these subjects are unknown. These are called censored observations or censored times and can also occur when people are lost to follow-up after a period of study. When these are not censored observation, the set of survival times is complete.

Type I Censoring

Animal studies usually start with a fixed number of animals, to which the treatment or treatments is given. Because of time and/or cost limitations, the researcher often cannot wait for the death of all the animals. One option is to observe for a fixed period of time, say six months, after which the surviving animals are sacrificed. Survival times recorded for the animals that died during the study period are the times from the start of the experiment to their death. These are called exact or uncensored observations. The survival times of the sacrificed animals are not known exactly but are recored as at east the length of the study period. These are called censored observations. Some animals could be lost or die accidentally. Their survival times, from the start of experiment to loss or death, are also censored observations. In type I censoring, if there are no accidental losses, all censored observations equal the length of the study period.

Type II Censoring

Another option in animal studies is to wait until a fixed portion of the animals have died, say 80 to 100, after which the surviving animals are sacrificed. In this case, type II censoring, if there are no accidental losses, the censored observations equal the largest uncensored observation.

Type III Censoring

In most clinical and epidemiological studies

There are generally three reasons why censoring may occur:

• A person does not experience the event before the study ends;
• A person is lost to follow-up during the study period;
• A person withdraws from the study because of death or some other reason.

## Chi-Square Goodness-of-Fit Test

The statistical-inference procedures discussed in this thread rely on a distribution called the chi-square distribution. A variable has a chi-square distribution if its distribution has the shape of a special type of right-skewed curve, called a chi-square curve. Actually, there are infinitely many chi-square distributions, and we identify the chi-square distribution in question by its number of degrees of freedom, just as we did for t-distributions.

Basic properties of chi-square curves

• The total area under a chi-square-curve equals 1.
• A chi-square-curve starts at 0 on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis.
• A chi-square-curve is right skewed.
• As the number of degrees of freedom becomes larger, chi-square-curves look increasingly like normal curves.

Chi-Square Goodness-of-Fit Test

Our first chi-square procedure is called the chi-square goodness-of-fit test. We can use this procedure to perform a hypothesis test about the distribution of a qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values. Next, let we describe the logic behind the chi-square goodness-of-fit test by an example. The FBI compiles data on crimes and crime rates and publishes the information in Crime in United States. A violent crime is classified by the FBI as murder, forcible rape, robbery, or aggravated assault. Table 13.1 gives a relative-frequency distribution for (reported) violent crimes in 2010. For instance, in 2010, 29.5% of violent crimes were robberies.

A simple random sample of 500 violent-crime reports from last year yielded the frequency distribution shown in Table 13.2. Suppose that we want to sue the data in Table 13.1 and 13.2 decide whether last year’s distribution of violent crimes is changed from the 2010 distribution.

Solution

The idea behind the chi-square goodness-of-fit test is to compare the observed frequencies in the second column of Table 13.2 to the frequencies that would be expected – the expected frequencies – if last year’s violent-crime distribution is the same as the 2010 distribution. If the observed and expected frequencies match fairly well (i.e., each observed frequency is roughly equal to its corresponding expected frequency), we do not reject the null hypothesis; otherwise, we reject the null hypothesis.

To formulate a precise procedure for carrying out the hypothesis test, we need to answer two questions: 1) What frequencies should we expect from a random sample of 500 violent-crime reports from last year if last year’s violent-crime distribution is the same as the 2010 distribution? 2) How do we decide whether the observed and expected frequencies match fairly well? The first question is easy to answer, which we illustrate with robberies. If last year’s violent-crime distribution is the same as the 2010 distribution, then, according to Table 13.1, 29.5% of last year’s violent crimes would have been robberies. Therefore, in a random sample of 500 violent-crime reports from last year, we would expect about 29.5% of the 500 to be robberies. In other words, we would expect the number of robberies to be 500 * 0.295, or 147.5.

In general, we compute each expected frequency, denoted E, by using the formula, E = np, where n is the sample size and p is the appropriate relative frequency from the second column of Table 13.1. Using this formula, we calculated the expected frequencies for all four types of violent crime. The results are displayed in the second column of Table 13.3.

The second column of Table 13.3 answer the first question. It gives the frequencies that we would expect if last year’s violent-crime distribution is the same as the 2010 distribution. The second question – whether the observed and expected frequencies match fairly well is harder to answer. We need to calculate a number that measures the goodness-of-fit.

In Table 13.4, the second column repeats the observed frequencies from the second column of Table 13.2. The third column of Table 13.4 reports the expected frequencies from the second column of Table 13.3. To measure the goodness of fit of the observed and expected frequencies, we look at the differences, OE, shown in the fourth column of Table 13.4. Summing these differences to obtain a measure of goodness of fit isn’t very useful because the sum is 0. Instead, we square each difference (shown in the fifth column) and then divided by the corresponding expected frequency. Doing so gives the values (OE)^2 / E, called chi-square subtotals, shown in the sixth column. The sum of the chi-square subtotals, 𝛴(OE)^2 / E = 6.529, is the statistic used to measure the goodness of fit of the observed and expected frequencies. If the null hypothesis is true, the observed and expected frequencies should be roughly equal, resulting in a small value o the test statistic, 𝛴(OE)^2 / E. As we have seen, that test statistic is 6.529. Can this value be reasonably attributed to sampling error, or is it large enough to suggest that the null hypothesis is false? To answer this question, we need to know the distribution of the test statistic 𝛴(OE)^2 / E. ## Factorial Designs

In this section we will describe the completely randomized factorial design. This design is commonly used when there are two or more factors of interest. Recall, in particular, the difference between an observational study and a designed experiment. Observational studies involve simply observing characteristics and taking measurements, as in a sample survey. A designed experiment involves imposing treatments on experimental units, controlling extraneous sources of variation that might affect the experiment, and then observing characteristics and taking measurement on the experimental units.

Also recall that in an experiment, the response variable is the characteristic of the experimental outcome that is measured or observed. A factor is a variable whose effect on the response variable is of interest to the experimenter. Generally a factor is a categorical variable whose possible values are referred to as the levels of the factor. In a single factor experiment, we will assign experimental unit to the treatments (or vice versa). Experimental units should be assigned to the treatments in such a way as to eliminate any bias that might be associated with the assignment. This is generally accomplished by randomly assigning the experimental units to the treatments.

In certain medical experiments, called clinical trials, randomization is essential. To compare two or more methods of treating illness, it is important to eliminate any bias that could be introduced by medical personnel assigning patients to the treatments in a nonrandom fashion. For example, a doctor might erroneously assign patients who exhibit less severe symptoms of the illness to a less risky treatment.

PS: Advantages of randomized design over other methods for selecting controls

• First, randomization removes the potential of bias in the allocation of participants to the intervention group or to the control group. Such selection bias could easily occur, and cannot be necessarily prevented, in the non-randomized concurrent or historical control study because the investigator or the participant may influence the choice of intervention. This influence can be conscious or subconscious and can be due to numerous factors, including the prognosis of the participant. The direction of the allocation bias may go either way and can easily invalidate the comparison. This advantage of randomization assumes that the procedure is performed in a valid manner and that the assignment cannot be predicted.
• Second, somewhat related to the first, is that randomization tends to produce comparable groups; that is, measured as well as unknown or unmeasured prognostic factors and other characteristics of the participants at the time of randomization will be, on the average, evenly balanced between the intervention and control groups. This does not mean that in any single experiment all such characteristics, sometimes called baseline variables or covariates, will be perfectly balanced between the two groups. However, it does mean that for independent covariates, whatever the detected or undetected differences that exist between the groups, the overall magnitude and direction of the differences will tend to be equally divided between the two groups. Of course, many covariates are strongly associated; thus, any imbalance in one would tend to produce imbalances in the others.
• Third, the validity of statistical tests of significance is guaranteed. As has been stated, “although groups compared are never perfectly balanced for important covariates in any single experiment, the process of randomization makes it possible to ascribe a probability distribution to the difference in outcome between treatment groups receiving equal effective treatments and thus to assign significance levels to observed differences.” The validity of the statistical tests of significance is not dependent on the balance of prognostic factors between the randomized groups.

Often in clinical trials, double blind studies are used. In this type of study, patients (the experimental units) are randomly assigned to treatments, and neither the doctor nor the patient knows which treatment has been assigned to the patient. This is an effective way to eliminate bias in treatment assignment so that the treatment effects are not confounded (associated) with other non experimental and uncontrolled factors.

Factorial design involve two or more factors. Consider the experiment in this example. There the researchers studied the effects of two factors (hydrophilic polymer and irrigation regimen) on weight gain (the response variable) of Golden Torch cacti. The two levels of the polymer factor were: used and not used. The irrigation regimen had five levels to indicate the amount of water usage: none, light, medium, heavy, and very heavy. This is an example of a two-factor or two-way factorial design.

In this experiment every level of polymer occurred with every level of irrigation regimen, for a total of 2 * 5 = 10 treatments. Often these 10 treatments are called treatment combinations to indicate that we combine the levels of the various factors together to obtain the actual collection of treatments. Since, in this case, every level of one factor is combined with every level of the other factor, we say that the levels of one factor are crossed with the levels of the other factor. When all the possible treatment combinations obtained by crossing the levels of the factors are included in the experiment, we call the design a complete factorial design, or simply a factorial design.

It is possible to extend the two-way factorial design to include more factors. For example, in the Golden Torch cacti experiment, the amount of sunlight the cacti receive could have an effect on weight gain. If the amount of sunlight is controlled in the two-way study so that all plants receive the same amount sunlight, then the amount of sunlight would not be considered a factor in the experiment.

However, since the amount of sunlight a cactus receives might have an effect on its growth, the experimenter might want to introduce this additional factor. Suppose we consider three levels of sunlight: high, medium, and low. The levels of sunlight could be achieved by placing screens of various mesh sizes over the cacti. If amount of sunlight is added as a third factor, there would be 2 * 5 * 3 = 30 different treatment combinations in a complete factorial design.

Possibly we could add even more factors to the experiment to take into account other factors that might affect weight gain of the cacti. Adding more factors will increase the number of treatment combinations for the experiment (unless the level of that factor is 1). In general, the total number of treatment combinations for a complete factorial design is the product of the number of levels of all factors in the experiment.

Obviously, as the number of factors increases, the number of treatment combinations increases. A large number of factors can result in so many treatment combinations that the experiment is unwieldy, too costly, or too time consuming to carry out. Most complete factorial designs involve only two or three factors.

To handle many factors, statisticians have devised experimental designs that use only a fraction of the total number of possible treatment combinations. These designs are called fractional factorial designs and are usually restricted to the case of all factors having two or three levels each. Fractional factorial designs cannot provide as much information as a complete factorial design, but they are very useful when a large number of factors is involved and the number of experimental units is limited by availability, cost, time, or other considerations. Fractional factorial designs are beyond the scope of this thread.

Once the treatment combinations are determined, the experimental units need to be assigned to the treatment combinations. In a completely randomized design, the experimental units are randomly assigned to the treatment combinations. If this random assignment is not done or is not possible, the treatment effects might become confounded with other uncontrolled factors that would make it difficult or impossible to determine whether an effect is due to the treatment or due to the confounding with uncontrolled factors.

Besides the random assignment of experimental units to treatment combinations, it is important that we use randomization in other ways when conducting an experiment. Often experiments are conducted in sequence. One treatment combination is applied to an experimental unit, and then the next treatment combination is applied to the next experimental unit, and so forth. It is essential that the order in which the experiments are conducted be randomized.

For example, consider an experiment in which measurements are made that are sensitive to heat or humidity. If all experiments associated with the first level of a factor are conducted on a hot and humid day, all experiments are associated with the second level of the factor are conducted on a cooler, less humid day, and so on, then the factor effect is confounded with the heat/humidity conditions on the days that the experiments are conducted. If the analysis indicates an effect due to the factor, we do not know whether there is actually a factor effect or a heat/humidity effect (or both). Randomization of the order in which the experiments are conducted would help keep the heat/humidity effect from being confounded with any factor effect.

Experimental and Classification Factors

In the description of designing experiments for factorial designs, we emphasized the idea of being able to assign experimental units to treatment combinations. If the experimental units are assigned randomly to the levels of a factor, the factor is called an experimental factor. If all the factors of a factorial design are experimental factors, we consider the study a designed experiment.

In some factorial studies, however, the experimental units cannot be assigned at random to the levels of a factor, as in the case when the levels of the factor are characteristics associated with the experimental units. A factor whose levels are characteristics of the experimental unit is called a classification factor. If all the factors of a factorial design are classification factors, we consider the study an observation study.

Consider, for instance, in the household energy consumption study, the response variable is household energy consumption and the factor of interest is the region of the United States in which a household is located. A household cannot be randomly assigned to a region of the country. The region of the country is a characteristic of the household and, thus, a classification factor. If we were to add home type as a second factor, the levels of this factor would also be a characteristic of a household, and, hence, home type would also be a classification factor. This two-way factorial design would be considered an observational study, since both of its factors are classification factors.

There are many studies that involve a mixture of experimental and classification factors. For example, in studying the effect of four different medications on relieving headache pain, the age of an individual might play a role in how long it takes before headache pain dissipates. Suppose a researcher decides to consider four age groups: 21 to 35 years old, 36 to 50 years old, 51 to 65 years old, and 66 years and older. Obviously, since age is a characteristic of an individual, age group is a classification factor.

Suppose that the researcher randomly selects 40 individuals from each age group and then randomly assigns 10 individuals in each age group to one of the four medications. Since each person is assigned at random to a medication, the medication factor is an experimental factor. Although one of the factors here is a classification factor and the other is an experimental factor, we would consider this designed experiment.

Fixed and Random Effect Factors

There is another important way to classify factors that depends on the way the levels of a factor are selected. If the levels of a factor are the only levels of interest to the researcher, then the factor is called a fixed effect factor. For example, in the Golden Torch cacti experiment, both factors (polymer and irrigation regimen) are fixed effect factors because the levels of each factor are the only levels of interest to the experimenter.

In the levels of a factor are selected at random from a collection of possible levels, and if the researcher wants to make inferences to the entire collection of possible levels, the factor is called a random effect factor. For example, consider a study to be done on the effect of different types of advertising on sales of a new sandwich at a national fast-food chain. The marketing group conducting the study feels that the city in which a franchise store is located might have an effect on sales. So they decide to include a city factor in the study, and randomly select eight cities from the collection of cities in which the company’s stores are located. They are not interested in these eight cities alone, but want to make inferences to the entire collection of cities. In this case the city factor is a random effect factor.

## Analysis of Variance

Analysis-of-variance procedures rely on a distribution called the F-distribution, named in honor of Sir Ronald Fisher. A variable is said to have an F-distribution if its distribution has the shape of a special type of right-skewed curve, called an F-curve. There are infinitely many F-distribution (and F-curve) by its number of degrees of freedom, just as we did for t-distributions and chi-square distributions. An F-distribution, however, has two numbers of degrees of freedom instead of one. Figure 16.1 depicts two different F-curves; one has df = (10, 2), and the other has df = (9, 50). The first number of degrees of freedom for an F-curve is called the degree of freedom for the numerator, and the second is called the degrees of freedom for the denominator.

Basic properties of F-curves:

• The total area under an F-curve equals 1.
• An F-curve starts at 0 on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis as it does so.
• An F-curve is right skewed.

One-Way ANOVA: The Logic

In older threads, you learned how to compare two population means, that is, the means of a single variable for two different populations. You studies various methods for making such comparisons, one being the pooled t-procedure.

Analysis of variance (ANOVA) provides methods for comparing several population means, that is, the means of a single variable for several populations. In this section we present the simplest kind of ANOVA, one-way analysis of variance. This type of ANOVA is called one-way analysis of variance because it compares the means of a variable for populations that result from a classification by one other variable, called the factor. The possible values of the factor are referred to as the levels of the factor.

For example, suppose that you want to compare the mean energy consumption by households among the four regions of the United States. The variable under consideration is “energy consumption,” and there are four populations: households in the Northeast, Midwest, South, and West. The four populations result from classifying households in the United States by the factor “region,” whose levels are Northeast, Midwest, South, and West.

One-way analysis of variance is the generalization to more than two populations of the pooled t-procedure (i.e., both procedures give the same results when applied to two populations). As in the pooled t-procedure, we make the following assumptions. Regarding Assumptions 1 and 2, we note that one-way ANOVA can also be used as a method for comparing several means with a designed experiment. In addition, like the pooled t-procedure, one-way ANOVA is robust to moderate violations of Assumption 3 (normal populations) and is also robust to moderate violations of Assumption 4 (equal standard deviations) provided the sample sizes are roughly equal.

How can the conditions of normal populations and equal standard deviations be checked? Normal probability plots of the sample data are effective in detecting gross violations of normality. Checking equal population standard deviations, however, can be difficult, especially when the sample sizes are small; as a rule of thumb, you can consider that condition met if the ratio of the largest to the smallest sample standard deviation is less than 2. We call that rule of thumb the rule of 2.

Another way to assess the normality and equal-standard-deviations assumptions is to perform a residual analysis. In ANOVA, the residual of an observation is the difference between the observation and the mean of the sample containing it. If the normality and equal-standard-deviations assumptions are met, a normal probability plot of (all) the residuals should be roughly linear. Moreover, a plot of the residuals against the sample means should fall roughly in a horizontal band centered and symmetric about the horizontal axis.

The Logic Behind One-Way ANOVA

The reason for the word variance in analysis of variance is that the procedure for comparing the means analyzes the variation in the sample data. To examine how this procedure works, let’s suppose that independent random samples are taken from two populations – say, Populations 1 and 2 – with means 𝜇1 and 𝜇2. Further, let’s suppose that the means of the two samples are xbar1 = 20 and xbar2 = 25. Can we reasonably conclude from these statistics that 𝜇1 ≠ 𝜇2, that is, that the population means are (significantly) different? To answer this question, we must consider the variation within the samples.

The basic idea for performing a one-way analysis of variance to compare the means of several populations:

• Take independent simple random samples from the populations.
• Compute the sample means.
• If the variation among the sample means is large relative to the variation within the samples, conclude that the means of the populations are not all equal (significantly different).

To make this process precise, we need quantitative measures of the variation among the sample means and the variation within the samples. We also need an objective method for deciding whether the variation among the sample means is large relative to the variation within the samples.

Mean Squares and F-Statistic in One-Way ANOVA

As before, when dealing with several population, we use subscripts on parameters and statistics. Thus, for Population j, we use 𝜇j ,xbarj, sj, and nj to denote the population mean, sample mean, sample standard deviation, and sample size, respectively.

We first consider the measure of variation among the sample means. In hypothesis tests for two population means, we measure the variation between the two sample means by calculating their different, xbar1 – xbar2. When more than two populations are involved, we cannot measure the variation among the sample means simply by taking a difference. However, we can measure that variation by computing the standard deviation or variance of the sample means or by computing any descriptive statistic that measures variation.

In one-way ANOVA, we measure the variation among the sample means by a weighted average of their squared deviations about he mean, bxar, of alll the sample data. That measure of variation is called the treatment mean square, MSTR, and is defined as

MSTR = SSTR / (k -1)

where k denotes the number of populations being sampled and

SSTR = n1(xbar1 -xbar)^2 + n2(xbar2 – xbar)^2 + … + nk(xbark – xbar)^2

The quantity SSTR is called the treatment sum of squares.

We note that MSTR is similar to the sample variance of the sample mans. In fact, if all the sample sizes are identical, then MSTR equals that common sample size times the sample variance of the sample means.

Next we consider the measure of variation within the samples. This measure is the pooled estimate of the common population variance, 𝜎^2. It is called the error mean square, MSE, and is defined as

MSE = SSE / (n – k)

where n denotes the total number of observations and

SSE = (n1 -1)S1^2 + (n2 -1)S2^2 + … + (nk -1)Sk^2

The quantity SSE is called the error sum of squares. Finally, we consider how to compare the variation among the sample means, MSTR, to the variation within the samples, MSE. To do so, we use the statistic F = MSTR/MSE, which we refer to as the F-statistic. Large values of F indicate that the variation among the sample means is large relative to the variation within the samples and hence that the null hypothesis of equal population means should be rejected.

In summary,  