Type I and Type II Error in Statistics

We often use inferential statistics to make decisions or judgements about the value of a parameter, such as a population mean. For example, we might need to decide whether the mean weight, 𝜇, of all bags of pretzels packaged by a particular company differs from the advertised weight of 454 grams, or we might want to determine whether the mean age, 𝜇, of all cars in use has increased from the year 2000 mean of 9.0 years. One of the most commonly used methods for making such decisions or judgments is to perform a hypothesis test. A hypothesis is a statement that something is true. For example, the statement “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g” is a hypothesis.

Typically, a hypothesis test involves two hypotheses: the null hypothesis and the alternative hypothesis (or research hypothesis), which we define as follows. For instance, in the pretzel packaging example, the null hypothesis might be “the mean weight of all bags of pretzels packaged equals the advertised weight of 454 g,” and the alternative hypothesis might be “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g.”

The first step in setting up a hypothesis test is to decide on the null hypothesis and the alternative hypothesis. Generally, the null hypothesis for a hypothesis test concerning a population mean, 𝜇, alway specifies a single value for that parameter. Hence, we can express the null hypothesis as

H0: 𝜇 = 𝜇0

The choice of the alternative hypothesis depends on and should reflect the purpose of the hypothesis test. Three choices are possible for the alternative hypothesis.

• If the primary concern is deciding whether a population mean, 𝜇, is different from a specific value 𝜇0, we express the alternative hypothesis as, Ha ≠ 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a two-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is less than a specific value 𝜇0, we express the alternative hypothesis as, Ha < 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a left-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is greater than a specified value 𝜇0, we express the alternative hypothesis as, Ha > 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a right-tailed test.

PS: A hypothesis test is called a one-tailed test if it is either left tailed or right tailed.

After we have chosen the null and alternative hypotheses, we must decide whether to reject the null hypothesis in favor of the alternative hypothesis. The procedure for deciding is roughly as follows. In practice, of course, we must have a precise criterion for deciding whether to reject the null hypothesis, which involves a test statistic, that is, a statistic calculated from the data that is used as a basis for deciding whether the null hypothesis should be rejected.

Type I and Type II Errors

In statistics, type I error is to reject the null hypothesis when it is in fact true; whereas type II error is not to reject the null hypothesis when it is in fact false. The probabilities of both type I and type II errors are useful (and essential) to evaluating the effectiveness of a hypothesis test, which involves analyzing the chances of making an incorrect decision. A type I error occurs if a true null hypothesis is rejected. The probability of that happening, the type I error probability, commonly called the significance level of the hypothesis test, is denote 𝛼. A type II error occurs if a false null hypothesis is not rejected. The probability of that happening, the type II error probability, is denote 𝛽.

Ideally, both type I and Type II errors should have small probabilities. Then the chance of making an incorrect decision would be small, regardless of whether the null hypothesis is true or false. We can design a hypothesis test to have any specified significance level. So, for instance, of not rejecting a true null hypothesis is important, we should specify a small value for 𝛼. However, in making our choice for 𝛼, we must keep Key Fact 9.1 in mind. Consequently, we must always assess the risks involved in committing both types of errors and use that assessment as a method for balancing the type I and type II error probabilities.

The significance level, 𝛼, is the probability of making type I error, that is, of rejecting a true null hypothesis. Therefore, if the hypothesis test is conducted at a small significance level (e.g., 𝛼 = 0.05), the chance of rejecting a true null hypothesis will be small. Thus, if we do reject the null hypothesis, we can be reasonably confident that the null hypothesis is false. In other words, if we do reject the null hypothesis, we conclude that the data provide sufficient evidence to support the alternative hypothesis.

However, we usually do not know the probability, 𝛽, of making a type II error, that is, of not rejecting a false null hypothesis. Consequently, if we do not reject the null hypothesis, we simply reserve judgement about which hypothesis is true. In other words, if we do not reject the null hypothesis, we conclude only that the data do not provide sufficient evidence to support the alternative hypothesis; we do not conclude that the data provide sufficient evidence to support the null hypothesis. In short, it might be true that there is a true difference but the power of the statistic procedure is not high enough to detect it.

Systematic Review – Defining the Question

Eligibility Criteria

The acronym PICO helps to serve as a reminder of the essential components of review question. One of the features that distinguish a systematic review from a narrative review is the pre-specification of criteria for including and excluding studies in the review (eligibility criteria). Eligibility criteria are  a combination of aspects of the clinical question plus specification of the types of studies that have addressed these questions. The participants, interventions and comparisons in the clinical question usually translate directly into eligibility criteria for the review. Outcomes usually are not part of the criteria for including studies: a Cochrane review would typically seek all rigorous studies of a particular comparison of interventions in a particular population of participants, irrespective of the outcomes measured or reported. However, some reviews do legitimately restrict eligibly to specific outcomes.

Population

The criteria for considering types of people included in studies in a review should be sufficiently broad to encompass the likely diversity of studies, but sufficiently narrow to ensure that a meaningful answer can be obtained when studies are considered in aggregate. It is often helpful to consider the types of people that are of interest in two steps. First, the diseases or conditions of interest should be defined using explicit criteria for establishing their presence or not. Criteria that will force unnecessary exclusion of studies should be avoided. For example, diagnostic criteria that were developed more recently – which may be viewed as the current gold standard for diagnosing the condition of interest – will not have been used in earlier studies. Expensive or recent diagnostic tests may not be available in many countries or settings.

Second, the broad population and setting of interest should be defined. This involves deciding whether a special population group is of interest, determined by factors such as age, sex, race, educational status or the presence of a particular condition such as angina or shortness of breath. Interest may focus on a particular settings such as a community, hospital, nursing home, chronic care institution, or outpatient setting.

The types of participants of interest usually determine directly the participant-related eligibility criteria for including studies. However, pre-specification of rules for dealing with studies that only partially address the population of interest can be challenging.

Any restrictions with respect to specific population characteristics or settings should be based on a sound rationale. Focusing a review on a particular subgroup of people on the basis of their age, sex or ethnicity simply because of personal interests when there is no underlying biologic or sociological justification for doing so should be avoided.

Interventions

The second key component of a well-formulated question is to specify the interventions of interest and the interventions against which these will be compared (comparisons). In particular, are the interventions to be compared with an inactive control intervention, or with an active control intervention? When specifying drug interventions, factors such as the drug preparation, route of administration, dose, duration, and frequency should be considered. For more complex interventions (such as educational or behavioral interventions), the common or core features of the interventions will need to be defined. In general, it is useful to consider exactly what is delivered, at what intensity, how often it is delivered, who delivers it, and whether people involved in delivery of the intervention need to be trained. Review authors should also consider whether variation in the intervention (i.e., based on dosage/intensity, mode of delivery, frequency, duration etc) is so great that it would have substantially different effects on the participants and outcomes of interest, and hence may be important to restrict.

Outcomes

Although reporting of outcomes should rarely determine eligibility of studies for a review, the third key component of a well-formulated question is the delineation of particular outcomes that are of interest. In general, Cochrane reviews should include all outcomes that are likely to be meaningful to clinicians, patients, the general public, administrators and policy makers, but should not include outcomes reported in included studies if they are trivial or meaningless to decision makers. Outcomes considered to be meaningful and therefore addressed in a review will not necessarily have been reported in individual studies. For example, quality of life is an important outcome, perhaps the most important outcome, for people considering whether or not to use chemotherapy for advanced cancer, even if the available studies are found to report only survival. Including all important outcomes in a review will highlight gaps in the primary research and encourage researchers to address these gaps in future studies.

Outcomes may include survival (mortality), clinical events (e.g., strokes or myocardial infarction), patient-reported outcomes (e.g., symptoms, quality of life), adverse events, burdens (e.g., demands on caregivers, frequency of tests, restrictions on lifestyle) and economic outcomes (e.g., cost and resource use). It is critical that outcomes used to assess adverse effects as well as outcomes used to assess beneficial effects are among those addressed by a review. If combinations of outcomes will be considered, these need to be specified. For example, if a study fails to make a distinction between non-fatal and fatal strokes, will these data be included in a meta-analysis if the question specifically related to stroke death?

Review authors should consider how outcomes may be measured, both in terms of the type of scale likely to be used and the timing of measurement. Outcomes may be measured objectively (e.g., blood pressure, number of strokes) or subjectively as rated by a clinical, patient, or carer (e.g., disability scales). It may be important to specify whether measurement scales have been published or validated. When defining the timing of outcome measurement, authors may consider whether all time frames or only selected time-points will be included in the review. One strategy is to group time-points into pre-specified intervals to represent “short-term”, “medium-term” and “long-term” outcomes and to take no more than one of each from each study for any particular outcome. It is important to give the timing of outcome measure considerable thought as it can influence the results of the review.

While all important outcomes should be included in Cochrane reviews, trivial outcomes should not be included. Authors need to avoid overwhelming and potentially misleading readers with data that are of little or no importance. In addition, indirect or surrogate outcome measures, such as laboratory results or radiologic results, are potentially misleading and should be avoided or interpreted with caution because they may not predict clinically important outcomes accurately. Surrogate outcomes may provide information on how a treatment might work but not whether it actually does work. Many interventions reduce the risk for a surrogate outcome but have no effect or have harmful effects on clinically relevant outcomes, and some interventions have no effect on surrogate measures but improve clinical outcomes.

Main Outcomes

Once a full list of relevant outcomes has been complied for the review, authors should prioritize the outcomes and select the main outcomes of relevance to the review question. The main outcomes are the essential outcomes for decision-making, and are those that would form the basis of a “Summary of findings” table. “Summary of findings” tables provide key information about the amount of evidence for important comparisons and outcomes, the quality of the evidence and the magnitude of effect. There should be no more than seven main outcomes, which should generally not include surrogate or interim outcomes. They should not be chosen on the basis of any anticipated or observed magnitude of effect, or because they are likely to have been addressed in the studies to be reviewed.

Primary Outcomes

Primary outcomes for the review should be identified from among the main outcomes. Primary outcomes are the outcomes that would be expected to be analyzed should the review identify relevant studies, and conclusions about the effects of the interventions under review will be based largely on these outcomes. There should in general be no more than three primary outcomes and they should include at least one desirable and at least one undesirable outcome (to assess beneficial and adverse effects respectively).

Secondary Outcomes

Main outcomes not selected as primary outcomes would be expected to be listed as secondary outcomes. In addition, secondary outcomes may include a limited number of additional outcomes the review intends to address. These may be specific to only some comparisons in the review. For example, laboratory tests and other surrogate measures may not be considered as main outcomes as they are less important than clinical endpoints in informing decisions, but they may be helpful in explaining effect or determining intervention integrity.

Types of Study

Certain study designs are more appropriate than others for answering particular questions. Authors should consider a priori what study designs are likely to provide reliable data with which to address the objectives of their review.

Because Cochrane reviews address questions about the effects of health care, they focus primarily on randomized trials. Randomization is the only way to prevent systematic differences between baseline characteristics of participants in different intervention groups in terms of both known and unknown (or unmeasured) confounders. For clinical interventions, deciding who receives an intervention and who does not is influenced by many factors, including prognostic factors. Empirical evidence suggests that, on average, non-randomized studies produce effect estimates that indicate more extreme benefits of the effects of health care than randomized trials. However, the extent, and even the direction, of the bias is difficult to predict.

Specific aspects of study design and conduct should also be considered when defining eligibility criteria, even if the review is restricted to randomized trials. For example, decisions over whether cluster-randomized trials and cross-over trials are eligible should be made, as should thresholds for eligibility based on aspects such as use of a placebo comparison group, evaluation of outcomes blinded to allocation, or a minimum period of follow-up. There will always be a trade-off between restrictive study design criteria (which might result in the inclusion of studies with low risk of bias, but which are very small in number) and more liberal design criteria (which might result in the inclusion of more studies, but which are at a higher risk of bias). Furthermore, excessively broad criteria might result in the inclusion of misleading evidence. If, for example, interest focuses on whether a therapy improves survival in patients with a chronic condition, it might be inappropriate to look at studies of very short duration, except to make explicit the point that they cannot address the question of interest.

Scope of Review Question

The questions addressed by a review may be broad or narrow in scope. For example, a review might address a broad question regarding whether anti platelet agents in general are effective in preventing all thrombotic events in humans. Alternatively, a review might address whether a particular anti platelet agent, such as aspirin, is effective in decreasing the risk of a particular thrombotic event, stroke, in elderly persons with a previous history of stroke.

Determining the scope of a review question is a decision dependent upon multiple factors including perspectives regarding a question’s relevance and potential impact; supporting theoretical, biologic and epidemiological information; the potential generalizability and validity of answers to the questions; and available resources.

The Logic Behind Meta-analysis – Random-effects Model

The fixed model starts with the assumption that true effect size is the same in all studies. However, in many systematic reviews this assumption is implausible. When we decide to incorporate a group of studies in a meta-analysis, we assume that the studies have enough in common that it makes sense to synthesize the information, but there is generally no reason to assume that they are identical in the sense that the true effect size is exactly the same in all the studies. For example, suppose that we are working with studies that compare the proportion of patients developing a disease in two groups (vaccinated versus placebo). If the treatment works we would expect the effect size (say, the risk ratio) to be similar but not identical across studies. The effect size might be higher (or lower) when the participants are older, or more educated, or healthier than others, or when a more intensive variant of an intervention is used, and so on. Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there maybe different effect sizes underlying different studies.

Or suppose that we are working with studies that assess the impact of an educational intervention. The magnitude of the impact might vary depending on the other resources available to the children, the class size, the age, and other factors, which are likely to vary from study to study. We might not have assessed these covariates in each study. Indeed, we might not even know what covariates actually are related to the size of the effect. Nevertheless, logic dictates that such factors do exist and will lead to variations in the magnitude of the effect.

One way to address this variation across studies is to perform a random-effects meta-analysis. In a random-effects meta-analysis we usually assume that the true effects are normally distributed. For example, in Figure 12.1 the mean of all true effect sizes is 0.60 but the individual effect sizes are distributed about this mean, as indicated by the normal curve. The width of the curve suggests that most of the true effects fall in the range of 0.50 to 0.70.

Suppose that our meta-analysis includes three studies drawn from the distribution of studies depicted by the normal curve, and that the true effects in these studies happen to be 0.50, 0.55, and 0.65. If each study had an infinite sample size the sampling error would be zero and the observed effect for each study would be the same as the true effect for that study. If we were to plot the observed effects rather than the true effects, the observed effects would exactly coincide with the true effects.

Of course, the sample size in any study is not infinite and therefore the sampling error is not zero. If the true effect size for a study is 𝜗i, then the observed effect for that study will be less than or greater than 𝜗i, because of sampling error. This figure also highlights the fact that the distance between the overall mean and the observed effect in any given study consists of two distinct parts: true variation in effect sizes (𝜁i) and sampling error (𝜀i). More generally, the observed effect Yi for any study is given by the grand mean, the deviation of the study’s true effect from the grand mean, and the deviation of the study’s observed effect from the study’s true effect. That is,

Therefore, to predict how far the observed effect Yi is likely to fall from 𝜇 in any given study we need to consider both the variance of 𝜁i and the variance of 𝜀i. The distance from 𝜇 to each 𝜗i depends on the standard deviation of the distribution of the true effects across studies, called 𝜏 (or 𝜏2 for its variance). The same value of 𝜏2 applies to all studies in the meta-analysis, and in Figure 12.4 is represented by the normal curve at the bottom, which extends roughly from 0.50 to 0.70. The distance from 𝜗i to Yi depends on the sampling distribution of the sample effects about 𝜗i. This depends on the variance of the observed effect size from each study, VYi, and so will vary from one study to the next. In Figure 12.4 the curve for Study 1 is relatively wide while the curve for Study 2 is relatively narrow.

Performing A Random-Effects Meta-Analysis

In an actual meta-analysis, of course, rather than start with the population effect and make projections about the observed effects, we start with the observed effects and try to estimate the population effect. In other words our goal is to use the collection of Yi to estimate the overall mean, 𝜇. In order to obtain the most precise estimate of the overall mean (to minimize the variance) we compute a weight mean, where the weight assigned to each study is the inverse of that study’s variance. To compute a study’s variance under the random-effects model, we need to know both the within-study variance and 𝜏2, since the study’s total variance is the sum of these two values.

The parameter 𝜏2 (tau-squared) is the between-studies variance (the variance of the effect size parameters across the population of studies). In other words, if we somehow knew the true effect size for each study, and computed the variance of these effect sizes (across an infinite number of studies), this variance would be 𝜏2. One method for estimating 𝜏2 is the method of moments (or the DerSimonian and Laird) method, as follows.

where

where k is the number of studies, and

In the fixed-effect analysis each study was weighted by the inverse of its variance. In the random-effects analysis, each study will be weighted by the inverse of its variance. The difference is that the variance now includes the original (within-studies) variance plus the estimate of the between-studies variance, T2. To highlight the parallel between the formulas here (random effects) and those in the previous threads (fixed effect) we use the same notations but add an asterisk (*) to represent the random-effects version. Under the random-effects model the weight assigned to each study is

where Vyi(*) is the within-study variance for study I plus the between-studies variance, T2. That is,

The weight mean, M(*), is then computed as

that is, the sum of the products (effect size multiplied by weight) divided by the sum of the weights.

The variance of the summary effect is estimated as the reciprocal of the sum of the weights, or

and the estimated standard error of the summary effect is then the square root of the variance,

Summary

• Under the random-effects model, the true effects in the studies are assumed to have been sampled from a distribution of true effects.
• The summary effect is our estimate of the mean of all relevant true effects, and the null hypothesis is that the mean of these effects is 0.0 (equivalent to a ratio fo 1.0 for ratio measures).
• Since our goal is to estimate the mean of the distribution, we need to take account of two sources of variance. First, there is within-study error in estimating the effect in each study. Second (even if we knew the true mean for each of our studies), there is variation in the true effects across studies. Study weights are assigned with the goal of minimizing both sources of variance.

The Logic Behind Meta-analysis – Fixed-ffect Model

Effect Size (Based on Means)

When the studies report means and standard deviations (more precisely, the sample standard error of the mean), the preferred effect size is usually the raw mean difference, the standardized mean difference mean difference, or the response ratio. When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta-analysis can be performed directly on the raw data.

Consider a study that reports means for two groups and (Treated and Control) and suppose we wish to compare the means of these two groups, the population mean difference (effect size) is defined as

Population mean difference = 𝜇1 – 𝜇2

Population standard error of mean difference (pooled) = Spooled*(Square Root of [1/n1 + 1/n2])

Overview

Most meta-analyses are based on one of two statistical models, the fixed-effect model or the random-effects model. Under the fixed-effect model we assume that there is one true effect size (hence the term fixed effect) which underlies all the studies in the analysis, and that all differences in observed effects are due to sampling error. While we follow the practice of calling this a fixed-effect model, a more descriptive term would be a common-effect model.

By contrast, under the random-effects model we allow that the true effect could vary from study to study. For example, the effect size might be higher (or lower) in studies where the participants are older, or more educated, or healthier than in others, or when a more intensive variant of an intervention is used, and so on. Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there may be different effect sizes underlying different studies.

Since all studies share the same true effect, it follows that the observed effect size varies from one study to the next only because of the random error inherent in each study. If each study had an infinite sample size the sampling error would be zero and the observed effect for each study would be the same as the true effect. If we were to plot the observed effects rather than the true effects, the observed effects would exactly coincide with the true effects.

In practice, of course, the sample size in each study in not infinite, and so there is sampling error and the effect observed in the study is not the same as the true effect. In Figure 11.2 the true effect for each study is still 0.60 but the observed effect differs from one study to the next.

While the error in any given study is random, we can estimate the sampling distribution of the errors. In Figure 11.3 we have placed a normal curve about the true effect size for each study, with the width of the curve being based on the variance in that study. In Study 1 the sample size was small, the variance large, and the observed effect is likely to fall anywhere in the relatively wide range of 0.20 to 1.00. By contrast, in Study 2 the sample size was relative large, the variance is small, and the observed effect is likely to fall in the relatively narrow range of 0.40 to 0.80. Note that the width of the normal curve is based on the square root of the variance, or standard error.

Meta-analysis Procedure

In an actual meta-analysis, of course, rather than starting with the population effect and making projections about the observed effects, we work backwards, starting with the observed effects and trying to estimate the population effect. In order to obtain the most precise estimate of the population effect (to minimize the variance) we compute a weighted mean, where the weight assigned to each study is the inverse of that study’s variance. Concretely, the weight assigned to each study in a fixed-effect meta-analysis is

Where VYi is the within-study variance for study (i). The weighted mean (M) is then computed as

That is, the sum of the products WiYi (effect size multiplied by weight) divided by the sum of the weights.

The variance of the summary effect is estimated as the reciprocal of the sum the weights, or

Once VM is estimated, the standard deviation of the weighted mean (or, standard error of the weighted mean) is computed as the square root of the variance of the summary effect. Now we know the distribution, the point estimation, and the standard deviation, of the weight mean. Thus, the confidence interval of the summary effect could be computed by the confidence interval Z-procedure.

Effect Sizes Measurements

Raw Mean Difference

When the studies report means and standard deviations (continuous variables), the preferred effect size is usually the raw mean difference, the standard mean difference (SMD), or the response ratio. When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta-analysis can be performed directly on the raw difference in means, or the raw mean difference. The primary advantage of the raw mean difference is that it is intuitively meaningful, either inherently or because of widespread use. Examples of raw mean difference include systolic blood pressure (mm Hg), serum LDL-C level (mg/dL), body surface area (m2), and so on.

We can estimate the mean difference D from a study that used two independent groups revealed by the inference procedure for two population means (independent samples). Let’s recall a little for the inference procedure for two population means. The sampling distribution of the difference between two sample meets these characteristics:

PS: All is based on the central limit theorem – if the sample size is large, the mean is approximately normally distributed, regardless of the distribution of the variable under consideration.

Once we know the sample mean difference, D, the standard deviation of the mean difference (or the standard error), and in the light of the central limit theorem, we could compute the variance of D. In addition to know the group mean, the standard deviation of group mean, and the group size, we also could compute the pooled sample standard deviation (Sp) or the nonpooled method. Therefore, we would have the value of variance of D, which will be used by meta-analysis procedures (fixed-effect, or random-effects model) to compute the weight (Wi = 1 / VYi). And once the standard error is known, the synthesized confidence interval could be computed.

Standardized Mean Difference, d and g

As noted, the raw mean difference is a useful index when the measure is meaningful, either inherently or because of widespread use. By contrast, when the measure is less well known, the use of a raw mean difference has less to recommend it. In any event, the raw mean difference is an option only if all the studies in the meta-analysis use the same scale. If different studies use different instruments to assess the outcome, then the scale of measurement will differ from study to study and it would not be meaningful to combine raw mean differences.

In such cases we can divide the mean difference in each study by that study’s standard deviation to create an index (the standard mean difference, SMD) that would be comparable across studies. This is the same approach suggested by Cohen in connection with describing the magnitude of effects in statistical power analysis. The standard mean difference can be considered as being comparable across studies based on either of two arguments (Hedges and Olkin, 1985). If the outcome measures in all studies are linear transformations of each other, the standardized mean difference can be seen as the mean difference that would have been obtained if all data were transformed to a scale where the standard deviation within-groups was equal to 1.0.

The other argument for comparability of standardized mean differences is the fact that the standardized mean difference is a measure of overlap between distributions. In this telling, the standardized mean difference reflects the difference between the distributions in the two groups (and how each represents a distinct cluster of scores) even if they do not measure exactly the same outcome.

Computing d and g from studies that use independent groups

We can estimate the standardized mean difference from studies that used two independent groups as

where Swithin is the pooled standard deviation across groups. And n1 and n2 are the sample sizes in the two groups, S1 and S2 are the standard deviations in the two groups. The reason that we pool the two sample estimates of the standard deviation is that even if we assume that the underlying population standard deviations are the same, it is unlikely that the sample estimates S1 and S2 will be identical. By pooling the two estimates of the standard deviation, we obtain a more accurate estimate of their common value.

The sample estimate of the standardized mean difference is often called Cohen’s d in research synthesis. Some confusion about the terminology has resulted from the fact that the index 𝛿, originally proposed by Cohen as a population parameter for describing the size of effects for statistical power analysis is also sometimes called d. The variance of d is given by,

Again, with the standard mean difference and variance of the standard mean difference known, we could compute the confidence interval of the standard mean difference. However, it turns out that d has a slight bias, tending to overestimate the absolute value of 𝛿 in small samples. This bias can be removed by a simple correction that yields an unbiased estimate of 𝛿, with the unbiased estimate sometimes called Hedges’ g (Hedges, 1981). To convert from d to Hedges’ g we use a correction factor, which is called J. Hedges (1981) gives the exact formula for J, but in common practice researchers use an approximation,

Summary

• Under the fixed-effect model all studies in the analysis share a common true effect.
• The summary effect is our estimate of this common effect size, and the null hypothesis is that this common effect is zero (for a difference) or one (for a ratio).
• All observed dispersion reflects sampling error, and study weights are assigned with the goal of minimizing this within-study error.

Converting Among Effect Sizes

Despite that widespread used outcome measures would be across studies under investigation, it is not uncommon that the outcome measures among individual studies are different. When we convert between different measures we make certain assumptions about the nature of the underlying traits or effects. Even if these assumptions do not hold exactly, the decision to use these conversions is often better than the alternative, which is to simply omit the studies that happened to use an alternate metric. This would involve loss of information, and possibly the systematic loss of information, resulting in a biased sample of studies. A sensitivity analysis to compare the meta-analysis results with and without the converted studies would be important. Figure 7.1 outlines the mechanism for incorporating multiple kinds of data in the same meta-analysis. First, each study is used to compute an effect size and variance of native index, the log odds ratio for binary data, d for continuous data, and r for correlational data. Then, we convert all of these indices to a common index, which  would be either the log odds ratio, d, or r. If the final index is d, we can move from there to Hedges’ g. This common index and its variance are then used in the analysis.

We can convert from a log odds ratio to the standardized mean difference d using

where 𝜋 is the mathematical constant. The variance of d would then be

where VlogOddsRatio is the variance of the log odds ratio. This method was originally proposed by Hasselblad and Hedges (1995) but variations have been proposed. It assumes that an underlying continuous trait exists and has a logistic distribution (which is similar to a normal distribution) in each group. In practice, it will be difficult to test this assumption.

Linear Regression

The Regression Equation

When analyzing data, it is essential to first construct a graph of the data. A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called bivariate quantitative data.

To measure quantitatively how well a line fits teh data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the least-squares criterion is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used Syy in Formula 14.1, we will use it later.

For a linear regression y = b0 + b1x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the response variable and x the predictor variable or explanatory variable (because it is used to predict or explain the values of the response variable).

Extrapolation

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

Outliers and Influential Observations

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

A Warning on the Use of Linear Regression

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of b0 and b1 to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

The Coefficient of Determination

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, SST. Thus, SST = 𝛴(yiy[bar])2. If we divide SST by n – 1, we get the sample variance of the observed values of the response variable. So, SST really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (xi, yi). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, yiy[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, y^y[bar], and the remaining unexplained deviation, yiy^. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is 𝛴(yi^y[bar])2. This measure of variation is called the regression sum of squares, SSR. Thus, SSR = 𝛴(yi^y[bar])2.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, SSR / SST. This quantity is called the coefficient of determination and is denoted r2. Thus, r2 = SSR/SST. In a same defintion, the deviation not explained by the regression, yiyi^. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is 𝛴(yi – yi^)2. This measure of variation is called the error sum of squares, SSE. Thus, SSE = 𝛴(yi – yi^)2.

In summary, check Definition 14.6

And the coefficient of detrmination, r2, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of r2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r2 near 1 suggests that the regression equation is quite useful for making predictions.

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares: r2 = SSR / SST = (SSTSSE) / SST = 1 – SSE / SST. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean, y(bar), to predict the observed values of the response variable.

Correlation and Causation

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called lurking variables, that cause the changes in the two variables under consideration.

The Regression Model; Analysis of Residuals

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line y = 𝛽0 + 𝛽1x – on which the conditional means of the response variable lie – as the population regression line and to its equation as the population regression equation. Observed that 𝛽0 is the y-intercept of the population regression line and 𝛽1 is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

Estimating the Regression Parameters

Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants 𝛽0, 𝛽1, and 𝜎 so that, for each value x of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎.

Because the parameters 𝛽0, 𝛽1, and 𝜎 are usually unknown, we must estimate them from sample data. We use the y-intercept and slop of a sample regression line as point estimates of the y-intercept and slop, respectively, of the population regression line; that is, we use b0 to estimate 𝛽0 and we use b1 to estimate 𝛽1. We note that b0 is an unbiased estimator of 𝛽0 and that b1 is an unbiased estimator of 𝛽1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation 𝜎 is called the standard error of the estimate. The standard error of the estimate could be compute by

Analysis of Residuals

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual, generically denoted e. Thus,

Residual = ei = yiyi^

We can show that the sum of the residuals is always 0, which, in turn, implies that e(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by n – 1 instead of n – 2). Thus, the standard error of the estimate is sometimes called the residual standard deviation.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the x-aixs. In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the x-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation 𝜎. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a residual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the x-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as x increases, indicating that Assumption 2 may be violated.

Inferences for the Slope of the Population Regression Line

Suppose that the variables x and y satisfy the assumptions for regression inferences. Then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎. Of particular interest is whether the slope, 𝛽1, of the population regression line equals 0. If 𝛽1 = 0, then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean 𝛽0 and standard deviation 𝜎. Because x does not appear in either of those two parameters, it is useless as a predictor of y.

Of note, although x alone may not be useful for predicting y, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is not useful for predicting y, we really mean that the regression equation with x as the only predictor variable is not useful for predicting y. Conversely, although x alone may be useful for predicting y, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is useful for predicting y, we really mean that the regression equation with x as the only predictor variable is useful for predicting y.

We can decide whether x is useful as a (linear) predictor of y – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for 𝛽1 on the statistic b1. From the assumptions for regression inferences, we can show that the sampling distribution of the slop of the regression line is a normal distribution whose mean is the slope, 𝛽1, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, 𝜎, is unknown. We therefore replace 𝜎 with its sample estimate Se, the standard error of the estimate. As you might be suspect, the resulting variable has a t-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis H0: 𝛽1 = 0, we can use the variable t as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the regression t-test.

Confidence Intervals for the Slop of the Population Regression Line

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for 𝛽1 is provided by b1. To determine a confidence-interval estimate for 𝛽1, we apply Key Fact 15.4 to obtain Procedure 15.2, called the regression t-interval procedure.

Estimating and Prediction

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Recalling that 𝛽0 + 𝛽1x is the conditional mean of the response variable corresponding to the value xp of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the conditional mean t-interval procedure.

Prediction Intervals

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable yp – yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for prediction-interval formula. So we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the predicted value t-interval procedure.

Inferences in Correlation

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, 𝜌. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, 𝜌 actually describes the strength of the linear relationship between two variables; r is only an estimate of 𝜌 obtained from sample data.

The population linear correlation coefficient of two variables x and y always lies between -1 and 1. Values of 𝜌 near -1 or 1 indicate a strong linear relationship between the variables, whereas values of 𝜌 near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, r, is an estimate of the population linear correlation coefficient, 𝜌. Consequently, we can use r as a basis for performing a hypothesis test for 𝜌.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis H0: 𝜌 = 0, we use the t-score as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the correlation t-test.