Skip to content

Type I and Type II Error in Statistics


We often use inferential statistics to make decisions or judgements about the value of a parameter, such as a population mean. For example, we might need to decide whether the mean weight, 𝜇, of all bags of pretzels packaged by a particular company differs from the advertised weight of 454 grams, or we might want to determine whether the mean age, 𝜇, of all cars in use has increased from the year 2000 mean of 9.0 years. One of the most commonly used methods for making such decisions or judgments is to perform a hypothesis test. A hypothesis is a statement that something is true. For example, the statement “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g” is a hypothesis.

Screen Shot 2018 03 21 at 5 49 45 PM

Typically, a hypothesis test involves two hypotheses: the null hypothesis and the alternative hypothesis (or research hypothesis), which we define as follows. For instance, in the pretzel packaging example, the null hypothesis might be “the mean weight of all bags of pretzels packaged equals the advertised weight of 454 g,” and the alternative hypothesis might be “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g.”

The first step in setting up a hypothesis test is to decide on the null hypothesis and the alternative hypothesis. Generally, the null hypothesis for a hypothesis test concerning a population mean, 𝜇, alway specifies a single value for that parameter. Hence, we can express the null hypothesis as

H0: 𝜇 = 𝜇0

The choice of the alternative hypothesis depends on and should reflect the purpose of the hypothesis test. Three choices are possible for the alternative hypothesis.

  • If the primary concern is deciding whether a population mean, 𝜇, is different from a specific value 𝜇0, we express the alternative hypothesis as, Ha ≠ 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a two-tailed test.
  • If the primary concern is deciding whether a population mean, 𝜇, is less than a specific value 𝜇0, we express the alternative hypothesis as, Ha < 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a left-tailed test.
  • If the primary concern is deciding whether a population mean, 𝜇, is greater than a specified value 𝜇0, we express the alternative hypothesis as, Ha > 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a right-tailed test.

PS: A hypothesis test is called a one-tailed test if it is either left tailed or right tailed.

Screen Shot 2018 03 21 at 6 09 44 PMAfter we have chosen the null and alternative hypotheses, we must decide whether to reject the null hypothesis in favor of the alternative hypothesis. The procedure for deciding is roughly as follows. In practice, of course, we must have a precise criterion for deciding whether to reject the null hypothesis, which involves a test statistic, that is, a statistic calculated from the data that is used as a basis for deciding whether the null hypothesis should be rejected.

Type I and Type II Errors

In statistics, type I error is to reject the null hypothesis when it is in fact true; whereas type II error is not to reject the null hypothesis when it is in fact false. The probabilities of both type I and type II errors are useful (and essential) to evaluating the effectiveness of a hypothesis test, which involves analyzing the chances of making an incorrect decision. A type I error occurs if a true null hypothesis is rejected. The probability of that happening, the type I error probability, commonly called the significance level of the hypothesis test, is denote 𝛼. A type II error occurs if a false null hypothesis is not rejected. The probability of that happening, the type II error probability, is denote 𝛽.

Screen Shot 2018 03 21 at 6 29 58 PMIdeally, both type I and Type II errors should have small probabilities. Then the chance of making an incorrect decision would be small, regardless of whether the null hypothesis is true or false. We can design a hypothesis test to have any specified significance level. So, for instance, of not rejecting a true null hypothesis is important, we should specify a small value for 𝛼. However, in making our choice for 𝛼, we must keep Key Fact 9.1 in mind. Consequently, we must always assess the risks involved in committing both types of errors and use that assessment as a method for balancing the type I and type II error probabilities.

The significance level, 𝛼, is the probability of making type I error, that is, of rejecting a true null hypothesis. Therefore, if the hypothesis test is conducted at a small significance level (e.g., 𝛼 = 0.05), the chance of rejecting a true null hypothesis will be small. Thus, if we do reject the null hypothesis, we can be reasonably confident that the null hypothesis is false. In other words, if we do reject the null hypothesis, we conclude that the data provide sufficient evidence to support the alternative hypothesis.

However, we usually do not know the probability, 𝛽, of making a type II error, that is, of not rejecting a false null hypothesis. Consequently, if we do not reject the null hypothesis, we simply reserve judgement about which hypothesis is true. In other words, if we do not reject the null hypothesis, we conclude only that the data do not provide sufficient evidence to support the alternative hypothesis; we do not conclude that the data provide sufficient evidence to support the null hypothesis. In short, it might be true that there is a true difference but the power of the statistic procedure is not high enough to detect it.

Missing or Poor Quality Data in Clinical Trials


In most trials, participants have data missing for a variety of reasons. Perhaps they were not able to keep their scheduled clinic visits or were unable to perform or undergo the particular procedures or assessments. In some cases, follow-up of the participant was not completed as outlined in the protocol. The challenge is how to deal with missing data or data of such poor quality that they are in essence missing. One approach is to withdraw participants who have poor data completely from the analysis. However, the remaining subset may no longer be representative of the population randomized and there is no guarantee that the validity of the randomization has been maintained in this process.

Many methods to deal with this issue assume that the data are missing at random; that is, the probability of a measurement not being observed does not depend on what its value would have been. In some contexts, this may be a reasonable assumption, but for clinical trials, and clinical research in general, it would be difficult to confirm. It is, in fact, probably not a vlid assumption, as the reason the data are missing is often associated with the health status of the participant. Thus, during trial design and conduct, every effort must be made to minimize missing data. If the amount of missing data is relatively small, then the available analytic methods will probably be helpful. If the amount of missing data is substantial, there may be no method capable of rescuing the trial. Here, we discuss some of the issues that must be kept in mind when analyzing a trial with missing data.

Rubin provided a definition of missing data mechanisms. If data are missing for reasons unrelated to the measurement that would have been observed and unrelated to covariates, then the data are “missing completely at random.” Statistical analyses based on likelihood inference are valid when the data are missing at random or missing completely at random. If a measure or index allows a researcher to estimate the probability of having missing data, say in a participant with poor adherence to the protocol, then using methods proposed by Rubin and others might allow some adjustment to reduce bias. However, adherence, as indicated earlier, is often associated with a participant’s outcome and attempts to adjust for adherence can lead to misleading results.

If participants do not adhere to the intervention and also do not return for follow-up visits, the primary outcome measured may not be obtained unless it is survival or some easily ascertained event. In this situation, an intention-to-treat analysis is not feasible and no analysis is fully satisfactory. Because withdrawal of participants from the analysis is known to be problematic, one approach is to “impute” or fill in the missing data such that standard analyses can be conducted. This is appealing if the imputation process can be done without introducing bias. There are many procedures for imputation. Those based on multiple imputations are more robust than single imputation.

A commonly used single imputation method is to carry the last observed value forward. This method, also known as an endpoint analysis, requires the very strong and unverifiable assumption that all future observations, if they were available, would remain constant. Although commonly used, the last observation carried forward method is not generally recommended. Using the average value for all participants with available data, or using a regression model to predict the missing value are alternatives, but in either case, the requirement that the data be missing at random is necessary for proper inference.

A more complex approach is to conduct multiple imputations, typically using regression methods, and then perform a standard analysis for each imputation. The final analysis should take into consideration the variability across the imputations. As with single imputation, the inference based on multiple imputation depends on the assumption that the data are missing at random. Other technical approaches are not described here.

If the number of participants lost to follow-up differs in the study groups, the analysis of the data could be biased. For example, participants who are taking a new drug that has adverse effect may, as a consequence, miss scheduled clinic visits. Events may occur but be unobserved. These losses to follow-up would probably not be the same in the control group. In this situation, there may be a bias favoring the new drug. Even if the number lost to follow-up is the same in each study group, the possibility of bias still exists because the participants who are lost in one group may have quite different prognoses and outcomes than those in the other group.

An outlier is an extreme value significantly different from the remaining values. The concern is whether extreme values in the sample should be included in the analysis. This question may apply to a laboratory result, to the data from one of several areas in a hospital or from a clinic in a multi center trial. Removing outliers is not recommended unless the data can be clearly shown to be erroneous. Even though a value may be an outlier, it could be correct, indicating that on occasions an extreme result is possible. This fact could be very important and should not be ignored.

Systematic Review – Defining the Question


Eligibility Criteria

The acronym PICO helps to serve as a reminder of the essential components of review question. One of the features that distinguish a systematic review from a narrative review is the pre-specification of criteria for including and excluding studies in the review (eligibility criteria). Eligibility criteria are  a combination of aspects of the clinical question plus specification of the types of studies that have addressed these questions. The participants, interventions and comparisons in the clinical question usually translate directly into eligibility criteria for the review. Outcomes usually are not part of the criteria for including studies: a Cochrane review would typically seek all rigorous studies of a particular comparison of interventions in a particular population of participants, irrespective of the outcomes measured or reported. However, some reviews do legitimately restrict eligibly to specific outcomes.

Screen Shot 2018 03 17 at 7 34 09 PM













The criteria for considering types of people included in studies in a review should be sufficiently broad to encompass the likely diversity of studies, but sufficiently narrow to ensure that a meaningful answer can be obtained when studies are considered in aggregate. It is often helpful to consider the types of people that are of interest in two steps. First, the diseases or conditions of interest should be defined using explicit criteria for establishing their presence or not. Criteria that will force unnecessary exclusion of studies should be avoided. For example, diagnostic criteria that were developed more recently – which may be viewed as the current gold standard for diagnosing the condition of interest – will not have been used in earlier studies. Expensive or recent diagnostic tests may not be available in many countries or settings.

Second, the broad population and setting of interest should be defined. This involves deciding whether a special population group is of interest, determined by factors such as age, sex, race, educational status or the presence of a particular condition such as angina or shortness of breath. Interest may focus on a particular settings such as a community, hospital, nursing home, chronic care institution, or outpatient setting.

The types of participants of interest usually determine directly the participant-related eligibility criteria for including studies. However, pre-specification of rules for dealing with studies that only partially address the population of interest can be challenging.

Any restrictions with respect to specific population characteristics or settings should be based on a sound rationale. Focusing a review on a particular subgroup of people on the basis of their age, sex or ethnicity simply because of personal interests when there is no underlying biologic or sociological justification for doing so should be avoided.


The second key component of a well-formulated question is to specify the interventions of interest and the interventions against which these will be compared (comparisons). In particular, are the interventions to be compared with an inactive control intervention, or with an active control intervention? When specifying drug interventions, factors such as the drug preparation, route of administration, dose, duration, and frequency should be considered. For more complex interventions (such as educational or behavioral interventions), the common or core features of the interventions will need to be defined. In general, it is useful to consider exactly what is delivered, at what intensity, how often it is delivered, who delivers it, and whether people involved in delivery of the intervention need to be trained. Review authors should also consider whether variation in the intervention (i.e., based on dosage/intensity, mode of delivery, frequency, duration etc) is so great that it would have substantially different effects on the participants and outcomes of interest, and hence may be important to restrict.


Screen Shot 2018 03 17 at 9 33 46 PM

Although reporting of outcomes should rarely determine eligibility of studies for a review, the third key component of a well-formulated question is the delineation of particular outcomes that are of interest. In general, Cochrane reviews should include all outcomes that are likely to be meaningful to clinicians, patients, the general public, administrators and policy makers, but should not include outcomes reported in included studies if they are trivial or meaningless to decision makers. Outcomes considered to be meaningful and therefore addressed in a review will not necessarily have been reported in individual studies. For example, quality of life is an important outcome, perhaps the most important outcome, for people considering whether or not to use chemotherapy for advanced cancer, even if the available studies are found to report only survival. Including all important outcomes in a review will highlight gaps in the primary research and encourage researchers to address these gaps in future studies.

Outcomes may include survival (mortality), clinical events (e.g., strokes or myocardial infarction), patient-reported outcomes (e.g., symptoms, quality of life), adverse events, burdens (e.g., demands on caregivers, frequency of tests, restrictions on lifestyle) and economic outcomes (e.g., cost and resource use). It is critical that outcomes used to assess adverse effects as well as outcomes used to assess beneficial effects are among those addressed by a review. If combinations of outcomes will be considered, these need to be specified. For example, if a study fails to make a distinction between non-fatal and fatal strokes, will these data be included in a meta-analysis if the question specifically related to stroke death?

Review authors should consider how outcomes may be measured, both in terms of the type of scale likely to be used and the timing of measurement. Outcomes may be measured objectively (e.g., blood pressure, number of strokes) or subjectively as rated by a clinical, patient, or carer (e.g., disability scales). It may be important to specify whether measurement scales have been published or validated. When defining the timing of outcome measurement, authors may consider whether all time frames or only selected time-points will be included in the review. One strategy is to group time-points into pre-specified intervals to represent “short-term”, “medium-term” and “long-term” outcomes and to take no more than one of each from each study for any particular outcome. It is important to give the timing of outcome measure considerable thought as it can influence the results of the review.

While all important outcomes should be included in Cochrane reviews, trivial outcomes should not be included. Authors need to avoid overwhelming and potentially misleading readers with data that are of little or no importance. In addition, indirect or surrogate outcome measures, such as laboratory results or radiologic results, are potentially misleading and should be avoided or interpreted with caution because they may not predict clinically important outcomes accurately. Surrogate outcomes may provide information on how a treatment might work but not whether it actually does work. Many interventions reduce the risk for a surrogate outcome but have no effect or have harmful effects on clinically relevant outcomes, and some interventions have no effect on surrogate measures but improve clinical outcomes.Screen Shot 2018 03 18 at 6 49 15 PM

Main Outcomes

Once a full list of relevant outcomes has been complied for the review, authors should prioritize the outcomes and select the main outcomes of relevance to the review question. The main outcomes are the essential outcomes for decision-making, and are those that would form the basis of a “Summary of findings” table. “Summary of findings” tables provide key information about the amount of evidence for important comparisons and outcomes, the quality of the evidence and the magnitude of effect. There should be no more than seven main outcomes, which should generally not include surrogate or interim outcomes. They should not be chosen on the basis of any anticipated or observed magnitude of effect, or because they are likely to have been addressed in the studies to be reviewed.

Primary Outcomes

Primary outcomes for the review should be identified from among the main outcomes. Primary outcomes are the outcomes that would be expected to be analyzed should the review identify relevant studies, and conclusions about the effects of the interventions under review will be based largely on these outcomes. There should in general be no more than three primary outcomes and they should include at least one desirable and at least one undesirable outcome (to assess beneficial and adverse effects respectively).

Secondary Outcomes

Main outcomes not selected as primary outcomes would be expected to be listed as secondary outcomes. In addition, secondary outcomes may include a limited number of additional outcomes the review intends to address. These may be specific to only some comparisons in the review. For example, laboratory tests and other surrogate measures may not be considered as main outcomes as they are less important than clinical endpoints in informing decisions, but they may be helpful in explaining effect or determining intervention integrity.

Types of Study

Certain study designs are more appropriate than others for answering particular questions. Authors should consider a priori what study designs are likely to provide reliable data with which to address the objectives of their review.

Because Cochrane reviews address questions about the effects of health care, they focus primarily on randomized trials. Randomization is the only way to prevent systematic differences between baseline characteristics of participants in different intervention groups in terms of both known and unknown (or unmeasured) confounders. For clinical interventions, deciding who receives an intervention and who does not is influenced by many factors, including prognostic factors. Empirical evidence suggests that, on average, non-randomized studies produce effect estimates that indicate more extreme benefits of the effects of health care than randomized trials. However, the extent, and even the direction, of the bias is difficult to predict.

Specific aspects of study design and conduct should also be considered when defining eligibility criteria, even if the review is restricted to randomized trials. For example, decisions over whether cluster-randomized trials and cross-over trials are eligible should be made, as should thresholds for eligibility based on aspects such as use of a placebo comparison group, evaluation of outcomes blinded to allocation, or a minimum period of follow-up. There will always be a trade-off between restrictive study design criteria (which might result in the inclusion of studies with low risk of bias, but which are very small in number) and more liberal design criteria (which might result in the inclusion of more studies, but which are at a higher risk of bias). Furthermore, excessively broad criteria might result in the inclusion of misleading evidence. If, for example, interest focuses on whether a therapy improves survival in patients with a chronic condition, it might be inappropriate to look at studies of very short duration, except to make explicit the point that they cannot address the question of interest.

Scope of Review Question

The questions addressed by a review may be broad or narrow in scope. For example, a review might address a broad question regarding whether anti platelet agents in general are effective in preventing all thrombotic events in humans. Alternatively, a review might address whether a particular anti platelet agent, such as aspirin, is effective in decreasing the risk of a particular thrombotic event, stroke, in elderly persons with a previous history of stroke.

Determining the scope of a review question is a decision dependent upon multiple factors including perspectives regarding a question’s relevance and potential impact; supporting theoretical, biologic and epidemiological information; the potential generalizability and validity of answers to the questions; and available resources.

Assumptions for Common Statistic Procedures


Observational Studies and Designed Experiments

Besides classifying statistical studies as either descriptive or inferential, we often need to classify them as either observational studies or designed experiments. In an observational study, researchers simply observe characteristics and take measurements, as in a sample survey. In a designed experiment, researchers impose treatments and controls and then observe characteristics and take measurements. Observational studies can reveal only association, whereas designed experiments can help establish causation.

Census, Sampling, and Experimentation

If the information you need is not already available from a previous study, you might acquire it by conducting a census – that is, by obtaining information for the entire population of interest. However, conducting a census may be time consuming, costly, impractical, or even impossible.

Two methods other than a census for obtaining information are sampling and experimentation. If sampling is appropriate, you must decide how to select the sample; that is, you must choose the method for obtaining a sample from the population. Because the sample will be used to draw conclusions about the entire population, it should be a representative sample – that is, it should reflect as closely as possible the relevant characteristics of the population under consideration.

Three basic principles of experimental design are: control, randomization, and replication. In a designed experiment, the individuals or items on which the experiment is performed are called experimental units. When the experimental units are humans, the term subject is often used in place of experimental unit. Generally, each experimental condition is called a treatment, of which there may be several.

Screen Shot 2018 03 13 at 10 34 45 PM

Screen Shot 2018 03 13 at 10 39 38 PM



Most modern sampling procedure involve the use of probability sampling. In probability sampling, a random device – such as tossing a coin, consulting a table of random numbers, or employing a random-number generator – is used to decide which members of the population will constitute the sample instead of leaving such decisions to human judgement. The use of probability sampling may still yield a non representative sample. However, probability sampling helps eliminate unintentional selection bias and permits the researcher to control the chance of obtaining a non representative sample. Furthermore, the use of probability sampling guarantees that the techniques of inferential statistics can be applied.

Simple Random Sampling

Simple random sampling is a sampling procedure for which each possible sample of a given size is equally likely to be the one obtained. There are two types of simple random sampling. One is simple random sampling with replacement (SRSWR), whereby a member of the population can be selected more than once; the other is simple random sampling without replacement (SRS), whereby a member of the population can be selected at most once.

Simple random sampling is the most natural and easily understood method of probability sampling – it corresponds to our intuitive notion of random selection by lot. However, simple random sampling does have drawbacks. For instance, it may fail to provide sufficient coverage when information about subpopulations is required and may be impractical when the members of the population are widely scattered geographically.

Systematic Random Sampling

One method that takes less effort to implement than simple random sampling is systematic random sampling.

Screen Shot 2018 03 13 at 10 13 38 PM

Cluster Sampling

Another sampling method is cluster sampling, which is particular useful when the members of the population are widely scattered geographically.

Screen Shot 2018 03 13 at 10 15 26 PM

Stratified Sampling

Another sampling method, known as stratified sampling, is often more reliable than cluster sampling. In stratified sampling, the population is first divided into subpopulations, called strata, and then sampling is done from each stratum. Ideally, the members of each stratum should be homogenous relative to the characteristic under consideration. In stratified sampling, the strata are often sampled in proportion to their size, which is called proportional allocation.

Screen Shot 2018 03 13 at 10 21 34 PM

Statistical Designs

Once we have chosen the treatments, we must decide how the experimental units are to be assigned to the treatments (or vice versa). In a completely randomized design, all the experimental units are assigned randomly among all the treatments. In a randomized block design, experimental units are similar in ways that are expected to affect the response variable are grouped in blocks. Then the random assignment of experimental units to the treatment is made block by block. Or, the experimental units are assigned randomly among all the treatments separately within each block.


One-Mean z-Interal Procedure

Screen Shot 2018 03 11 at 2 42 25 PM


Screen Shot 2018 03 11 at 2 44 14 PM


One-Mean t-Interval Procedure

Screen Shot 2018 03 11 at 2 45 15 PM

One-Mean z-Test

Screen Shot 2018 03 11 at 2 50 52 PM


Screen Shot 2018 03 11 at 2 51 56 PM


One-Mean t-Test

Screen Shot 2018 03 11 at 2 53 56 PM


Wilcoxon Signed-Rank Test

Screen Shot 2018 03 11 at 4 25 25 PM

Note: The following points may be relevant when performing a Wilcoxon signed-rank test:

  • If an observation equals 𝜇0 (the value for the mean in the null hypothesis), that observation should be removed and the sample size reduced by 1.
  • If two or more absolute differences are tied, each should be assigned the mean of the ranks they would have had if there were no ties.


Pooled t-Test

Screen Shot 2018 03 11 at 5 41 31 PM


Pooled t-Interval Procedure

Screen Shot 2018 03 11 at 5 47 07 PM


Nonpooled t-Test

Screen Shot 2018 03 11 at 5 49 33 PM


Nonpooled t-Interval Procedure

Screen Shot 2018 03 11 at 5 50 25 PM


Mann-Whitney Test (Wilcoxon rank-sum test, Mann-Whitney-Wilcoxon test)

Screen Shot 2018 03 11 at 5 53 05 PM

Note: When there are ties in the sample data, ranks are assigned in the same way as in the Wilcoxon signed-rank test. Namely, if two or more observations are tied, each is assigned the mean of the ranks they would have had if there had been no ties.


Paired t-Test

Screen Shot 2018 03 12 at 4 40 49 PM


Paired t-Interval Procedure

Screen Shot 2018 03 12 at 4 42 25 PM


Paired Wilcoxon Signed-Rank Test

Screen Shot 2018 03 12 at 4 52 06 PM

Screen Shot 2018 03 12 at 4 57 29 PM


One-Proportion z-Interval Procedure

Screen Shot 2018 03 13 at 11 32 31 PM


One-Proportion z-Test

Screen Shot 2018 03 13 at 11 40 52 PM


Two-Proportions z-Test

Screen Shot 2018 03 13 at 11 47 36 PM

Meta-Analysis: Which Model Should We Use?

Fix effect model

It makes sense to use the fixed-effect model if two conditions are met. First, we believe that all the studies included in the analysis are functionally identical. Second, our goal is to compute the common effect size for the identified population, and not to generalize to other populations. For example, suppose that a pharmaceutical company will use a thousand patients to compare a drug versus placebo. Because the staff can work with only 100 patients at a time, the company will run a series of ten trials with 100 patients in each. The studies are identical in the sense that any variable which can have an impact on the outcome are the same across the ten studies. Specifically, the studies draw patients from a common pool, using the same researchers, dose, measure, and so on.

Random effects

By contrast, when the researcher is accumulating data from a series of studies that had been performed by researchers operating independently, it would be unlikely that all the studies were functionally equivalent. Typically, the subjects or interventions in these studies would have differed in ways that would have impacted on the results, and therefore we should not assume a common effect size. Therefore, in these cases the random-effects model is more easily justified than the fixed-effect model. Additionally, the goal of this analysis is usually to generalize to a range of scenarios. Therefore, if one did make the argument that all the studies used an identical, narrowly defined population, then it would not be possible to extrapolate from this population to others’ nd the utility of the analysis would be severely limited.


To understand the problem, suppose for a moment that all studies in the analysis shared the same true effect size, so that the (true) heterogeneity is zero. Under this assumption, we would not expect the observed effect to be identical to each other. Rather, because of within-study error, we would expect each to fall within some range of the common effect. Now, assume that the true effect size does vary from one study to the next. In this case, the observed effects vary from one another for two reasons. One is the real heterogeneity in effect size, and the other is the within-study error. If we want to quantify the heterogeneity we need to partition the observed variation into these two components, and then focus on the former.

The mechanism that we use to extract the true between-studies variation from the observed variation is as follows:

  • We compute the total amount of study-to-study variation actually observed.
  • We estimate how much the observed effects would be expected to vary from each other if the true effect was actually the same in all studies.
  • The excess variation (if any) is assumed to reflect real differences in effect size (that is, the heterogeneity)

Clinical Trials


The function of randomization include:

  • Randomization removes the potential of bias in the allocation of participants to the intervention group or to the control group. Such selection bias could easily occur, and cannot be necessarily prevented, in the non-randomziared concurrent or historical control study because the investigator or the participant may influence the choice of intervention. The direction of the allocation bias may go either way and can easily invalidate the comparison. This advantage of randomization assumes that the procedure is performed in a valid manner and that the assignment cannot be predicted.
  • Some what related to the first, is that randomization tends to produce comparable groups; that is, measured as well as unknown or unmeasured prognostic factors and other characteristics of the participants at the time of randomization will be, on the average, evenly balanced between the intervention and control groups. This dose not mean that in any single experiment all such characteristics, sometimes called baseline variables or covariates, will be perfectly balanced between the two groups. However, it does mean that for independent covariates, whatever the detected or undetected differences that exist between the groups, the overall magnitude and direction of the differences will tend to be equally divided between the two groups. Of course, many covariates are strongly associated; thus, any imbalance in one would tend to produce imbalances in the others.
  • The validity of statistical tests of significance is guaranteed. The process of randomization makes it possible to ascribe a probability distribution to the difference in outcome between treatment groups receiving equally effective treatments and thus to assign significance levels to observed differences. The validity of the statistical tests of significance is not dependent on the balance of the prognostic factors between the randomized groups. The chi-square test for two-by-two tables and Student’s t-test for comparing two means can be justified on the basis of randomization alone without making further assumptions concerning the distribution of baseline variables. If randomization is not used, further assumptions conceding the comparability of the groups and the appropriateness fo the statistical models must be made before the comparisons will be valid. Establishing the validity of these assumptions may be difficult.

In the simplest case, randomization is a process by which each participant has the same chance of being assigned to either intervention or control. An example would be the toss of a coin, in which heads indicates intervention group and tails indicates control group. Even in the more complex randomization strategies, the element of chance underlies the allocation process. Of course, neither trial participant nor investigator should know what the assignment will be before the participant’s decision to enter the study. Otherwise, the benefits of randomization can be lost.

The Randomization Process

Two forms of experimental bias are of concern. The first, selection bias, occurs if the allocation process is predictable. In this case, the decision to enter a participant into a trial may be influenced by the anticipated treatment assignment. If any bias exists as to what treatment particular types of participants should receive, then a selection bias might occur. A second bias, accidental bias, can arise if the randomization procedure does not achieve balance on risk factors or prognostic covariates. Some of the allocation procedures are more vulnerable to accidental bias, especially for small studies. For large studies, however, the chance of accidental bias is negligible.

Fixed Allocation Randomization

Fixed allocation procedures assign the interventions to participants with a respecified probability, usually equal (e.g., 50% for two arms, 33% for 3, or 25% for 4, etc.) and that allocation probability is not altered as the study progresses. Three methods of randomization belong to the fixed allocation, including: simple, blacked, and stratified randomization. The most elementary form of randomization is referred to as simple or complete randomization. One simple method is to toss an unbiased coin each time a participant is eligible to be randomized (for two treatment combinations). Using this procedure, approximately one half of the participants will be in group A and one half in group B. In practice, for small studies, instead of tossing a coin to generate a randomization schedule, a random digit table on which the equally likely digits 0 to 9 are arranged by tows and columns is usually used to accomplish simple randomization. For large studies, a more convenient method for producing a randomization schedule is to use a random number producing algorithm, available on most computer systems. Another simple randomization is to use a uniform random number algorithm to produce random numbers in the interval from 0.0 to 1.0. Using a uniform random number generator, a random number can be produced for each participant. If the random number is between 0 and p, the participant would be assigned to group A; otherwise to group B. For equal allocation, the probability cut point, p, is one-half (i.e., p = 0.50). If equal allocation between A and B is not desired, then p can be set to the desired proportion in the algorithm and the study will have, on the average, a proportion p of the participants in group A. In addition, this strategy could be adapted easily to more than two groups.

Blocked randomization, sometimes called permuted block randomization, avoids serious imbalance in the number of participants assigned to each group, an imbalance which could occur in the simple randomization procedure. More importantly, blocked randomization guarantees that at no time during randomization will the imbalance be large and that at certain points the number of participants in each group will be equal. This protects against temporal trends during enrollment, which is often a concern for larger trials with long enrollment phases. If participants are randomly assigned with equal probability to groups A or B, then for each block of even size (for example, 4, 6, or 8) one half of the participants will be assigned to A and the other half to B. The order in which the interventions are assigned in each block is randomized, and this process is repeated for consecutive blocks of participants until all participants are randomized.

Chi-Square Goodness-of-Fit Test


The statistical-inference procedures discussed in this thread rely on a distribution called the chi-square distribution. A variable has a chi-square distribution if its distribution has the shape of a special type of right-skewed curve, called a chi-square curve. Actually, there are infinitely many chi-square distributions, and we identify the chi-square distribution in question by its number of degrees of freedom, just as we did for t-distributions.

Basic properties of chi-square curves

  • The total area under a chi-square-curve equals 1.
  • A chi-square-curve starts at 0 on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis.
  • A chi-square-curve is right skewed.
  • As the number of degrees of freedom becomes larger, chi-square-curves look increasingly like normal curves.

Chi-Square Goodness-of-Fit Test

Our first chi-square procedure is called the chi-square goodness-of-fit test. We can use this procedure to perform a hypothesis test about the distribution of a qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values. Next, let we describe the logic behind the chi-square goodness-of-fit test by an example.

Screen Shot 2018 03 05 at 10 17 38 PMThe FBI compiles data on crimes and crime rates and publishes the information in Crime in United States. A violent crime is classified by the FBI as murder, forcible rape, robbery, or aggravated assault. Table 13.1 gives a relative-frequency distribution for (reported) violent crimes in 2010. For instance, in 2010, 29.5% of violent crimes were robberies.

A simple random sample of 500 violent-crime reports from last year yielded the frequency distribution shown in Table 13.2. Suppose that we want to sue the data in Table 13.1 and 13.2 decide whether last year’s distribution of violent crimes is changed from the 2010 distribution.


The idea behind the chi-square goodness-of-fit test is to compare the observed frequencies in the second column of Table 13.2 to the frequencies that would be expected – the expected frequencies – if last year’s violent-crime distribution is the same as the 2010 distribution. If the observed and expected frequencies match fairly well (i.e., each observed frequency is roughly equal to its corresponding expected frequency), we do not reject the null hypothesis; otherwise, we reject the null hypothesis.

To formulate a precise procedure for carrying out the hypothesis test, we need to answer two questions: 1) What frequencies should we expect from a random sample of 500 violent-crime reports from last year if last year’s violent-crime distribution is the same as the 2010 distribution? 2) How do we decide whether the observed and expected frequencies match fairly well?

Screen Shot 2018 03 05 at 10 35 27 PMThe first question is easy to answer, which we illustrate with robberies. If last year’s violent-crime distribution is the same as the 2010 distribution, then, according to Table 13.1, 29.5% of last year’s violent crimes would have been robberies. Therefore, in a random sample of 500 violent-crime reports from last year, we would expect about 29.5% of the 500 to be robberies. In other words, we would expect the number of robberies to be 500 * 0.295, or 147.5.

In general, we compute each expected frequency, denoted E, by using the formula, E = np, where n is the sample size and p is the appropriate relative frequency from the second column of Table 13.1. Using this formula, we calculated the expected frequencies for all four types of violent crime. The results are displayed in the second column of Table 13.3.

The second column of Table 13.3 answer the first question. It gives the frequencies that we would expect if last year’s violent-crime distribution is the same as the 2010 distribution. The second question – whether the observed and expected frequencies match fairly well is harder to answer. We need to calculate a number that measures the goodness-of-fit.

In Table 13.4, the second column repeats the observed frequencies from the second column of Table 13.2. The third column of Table 13.4 reports the expected frequencies from the second column of Table 13.3. To measure the goodness of fit of the observed and expected frequencies, we look at the differences, OE, shown in the fourth column of Table 13.4. Summing these differences to obtain a measure of goodness of fit isn’t very useful because the sum is 0. Instead, we square each difference (shown in the fifth column) and then divided by the corresponding expected frequency. Doing so gives the values (OE)^2 / E, called chi-square subtotals, shown in the sixth column. The sum of the chi-square subtotals, 𝛴(OE)^2 / E = 6.529, is the statistic used to measure the goodness of fit of the observed and expected frequencies.

Screen Shot 2018 03 05 at 10 59 03 PM

If the null hypothesis is true, the observed and expected frequencies should be roughly equal, resulting in a small value o the test statistic, 𝛴(OE)^2 / E. As we have seen, that test statistic is 6.529. Can this value be reasonably attributed to sampling error, or is it large enough to suggest that the null hypothesis is false? To answer this question, we need to know the distribution of the test statistic 𝛴(OE)^2 / E.

Screen Shot 2018 03 05 at 11 07 44 PM