Confidence Interval

Inferences for Population Proportions

September 24, 2017 Evidence-Based Medicine, Medical Statistics No comments , , , , , , , , , ,

Confidence Intervals for One Population Proportion

Statisticicans often need to determine the proportion (percentage) of a population that has a specific attribute. Some examples are:

  • the percentage of U.S. adults who have health insurance
  • the percentage of cars in the United States that are imports
  • the percentage of U.S. adults who favor stricter clean air health standards
  • the percentage of Canadian women in the labor force

In the first case, the population consists of all U.S. adults and the specified attribute is "has health insurance." For the second case, the population consists of all cars in the United States and the specific attribute is "is an import." The population in the third case is all U.S. adults and the specified attribute is "favors stricter clean air health standards." In the fourth case, the population consists of all Canadian women and the specified attribute is "is in the labor force."

We know that it is often impractical or impossible to take a census of a large population. In practice, therefore, we use data from a sample to make inferences about the population proportion.

A sample proportion, p^, is computed by using the formula

p^ = x / n

where x denotes the number of members in the sample that have the specified attribute and, as usual, n denotes the sample size. For convenience, we sometimes refer to x as the number of successes and to nx as the number of failures.

The Sampling Distribution of the Sample Proportion

To make inferences about a population mean, 𝜇, we must know the sampling distribution of the sample mean, that is, the distribution of the variable x(bar) (see detail for confidence interval for one population mean at thread "Statistic Procedure – Confidence Interval" The same is true for proportions: To make inferences about a population proportion, p, we need to know the sampling distribution of the sample proportion, that is, the distribution of the variable p^. Because a proportin can always be regarded as a mean, we can use our knowledge of the sampling distribution of the sample mean to derive the sampling distribution of the sample proportion. In practice, the sample size usually is large, so we concentrate on that case.

The accuracy of the normal approximation depdends on n and p. If p is close to 0.5, the approximation is quite accurate, even for moderate n. The farther p is from 0.5, the larger n must be for the approximation to be accurate. As a rule of thumb, we use the normal approximation when np and n(1 – p) are both 5 or greater. Alternatively, another commonly used rule of thumb is that np and n(1 – p) are both 10 or greater; still another is that np(1 – p) is 25 or greater.

Below is the one-proportion z-interval procedure, which is also known as the one-sample z-interval procedure for a population proportion and the one-variable proportion interval procedure. Of note, as stated in Assumption 2 of Procedure 12.1, a condition for using that procedure is that "the number of successes, x, and the number of failures, nx, are both 5 or greater." We can restate this condition as "np^ and n(1 – p^) are both 5 or greather," which, for an unknown p, corresponds to the rule of thumb for using the normal approximation given after Key Fact 12.1.

Determining the Required Sample Size

If the margin of error (E) and confidence level are specified in advance, then we must determine the sample size required to meet those specifications. Solving for n in the formula for margin of error, we get

n = p^(1 – p^)(Z𝛼/2 / E)2

This formula cannot be used to obtain the required sample size because the sample proportion, p^, is not known prior to sampling. There are two ways around this problem. To begin, we examine the graph of p^(1 – p^) versus p^ shown in Figure 12.1. The graph reveals that the largest p^(1 – p^) can be is 0.25, which occurs when p^ = 0.5. The farther p^ is from 0.5, the smaller will be the value of p^(1 – p^). Because the largest possible value of p^(1 – p^) is 0.25, the most conservative approach for determining sample size is to use that value in the above equation. The sample size obtained then will generally be larger than necessary and the margin of error less than required. Nonetheless, this approach guarantees that the specifications will at least be met. In the same vein, if we have in mind a likely range for the observed value of p^, then, in light of Figure 12.1, we should take as our educated guess for p^ the value in the range closest to 0.5. In either case, we should be aware that, if the observed value of p^ is closer to 0.5 than is our educated guess, the margin of error will be larger than desired.

Hypothesis Tests for One Population Proportion

Just earlier, we showed how to obtain confidence intervals for a population proportion. Now we show how to perform hypothesis tests for a population proportion. This procedure is actually a special case of the one-mean z-test. For Key Fact 12.1, we deduce that, for large n, the standardized version of p^,

has approximately the standard normal distribution. Consequently, to perform a large sample hypothesis test with null hypothesis H0: p = p0, we can use the variable

at the test statistic and obtain the critical value(s) or P-value from the standard normal table. We call this hypothesis-testing procedure the one-proportion z-test.

Hypothesis Tests for Two Population Proportions

For independent samples of sizes n1 and n2 from the two populations, we have Key Fact 12.2

Now we can develop a hypothesis-testing procedure for comparing two population proportions. Our immediate goal is to identify a variable that we can use as the test statistic. From Key Fact 12.2, we know that, for large, independent samples, the standardized variabvle z has approximately the standard normal distribution. They null hypothesis for a hypothesis test to compare two population proportions is H0: p1 = p2. If the null hypothesis is true, then p1 – p2 = 0, and, consequently, the bariable in 


However, because p is unknown, we cannot use this variable as the test statistic. Consequently, we must estimate p by using sample information. The best estimate of p is obtained by pooling the data to get the proportion of successes in both samples combined; that is, we estimate p by

Where the p^p is called the pooled sample proportion. After replacing the p by p^p we get the final test statistic, which can be used as the test statsitic and has approximately the standard normal distribution for large samples if the null hypothesis is true. Hence we have Procedure 12.3, the two-proportions z-test. Also, it is known as the two-sample z-test for two population proportions and the two-variable proportions test.

It is very fortunate that the confidence intervals for the difference between two population proportions could be computed. As we can use Key Fact 12.2 to derive a confidence-interval procedure for the difference between two population proportions, called the two-proportions z-interval procedure. Note the following: 1) The two-proportions z-interval procedure is also known as the two-sample z-interval procedure for two population proportions and the two-variable proportions interval procedure. 2) Guidelines for interpreting confidence intervals for the difference, p1p2, between two population proportions are similar to those for interpreting confidence intervals for the difference, 𝜇1 – 𝜇2, between two population means, as describe in other relative threads.

Update on Oct 2 2017

Supplemental Data – Confidence Intervals of Odds Ratio (OR) and Relative Risk (RR)


The sampling distribution of the odds ratio is positively skewed. However, it is approximately normally distributed on the natural log scale. After finding the limits on the LN scale, use the EXP function to find the limits on the original scale. The standard deviation of LN(OR) is

SD of LN(OR) = square root of (1/a + 1/b + 1/c + 1/d)

Now we know the distribution of LN(OR) and the standard deviation (mean and variation) of LN(OR), and the z-proportion procedure could be conducted to compute the confidence intervals of LN(OR).


Similar with OR, the sampling distribution of the relative risk is positively skewed but is approximately normally distributed on the natural log scale. Constructing a confidence interval for the relative risk is similar to constructing a CI for the odds ratio except that there is a different formula for the SD.

SD of LN(RR) = square root of [ b/a(a+b) + d/c(c+d) ]

Statistic Procedures – Confidence Interval

August 26, 2017 Medical Statistics No comments , , , , , , , , , , , ,

Confidence Intervals for One Population Mean

A common problem in statistics is to obtain information about the mean, μ, of a population. One way to obtain information about a population mean μ without taking a census is to estimate it by a sample mean x(bar). So, a point estimate of a parameter is the value of a statistic used to estimate the parameter. More generally, a statistic is called an unbiased estimator of a parameter if the mean of all its possible values equals the parameter; otherwise, the statistic is called a biased estimator of the parameter. Ideally, we want our statistic to be unbiased and have small standard error. In that case, chances are good that our point estimate (the value of the statistic) will be close to the parameter.

However, it is not uncommon that a sample mean is usually not equal to the population mean, especially when the standard error is not small as stated previously. Therefore, we should accompany any point estimate of μ with information that indicates the accuracy of that estimate. This information is called a confidence-interval estimate for μ. By definition, the confidence interval (CI) is an interval of numbers obtain from a point estimate of a parameter. The confidence level is the confidence we have that the parameter lies in the confidence interval. And the confidence-interval estimate is the confidence level and confidence interval. An confidence interval for a population mean depends on the sample mean, x(bar), which in turn depdends on the sample selected.

Margin of error E indicates how accurate the sample mean of x(bar) is as an estimate for the value of the unknown parameter of μ. With the point estimate and confidence-interval estimate (of 95% confidence interval), we can be 95% confident that the μ is within E of the sample mean. Simply, it means that the μ = point estimate +- E.


  • Point estimate
  • Confidence-interval estimate
  • Margin of error

Computing the Confidence-Interval for One Population Mean (σ known)

We not develop a step-by-step procedure to obtain a confidence interval for a population mean when the population standard deviation is known. In doing so, we assume that the variable under consideration is normallhy distributed. Because of the central limit theorem, however, the procedure will also work to obtain an approximately correct confidence interval when the sample size is large, regardless of the distribution of the variable. The basis of our confidence-interval procedure is the sampling distribution of the sample mean for a normally distributed variable: Suppose that a variable x of a population is normally distributed with mean μ and standard deviation σ. Then, for samples of size n, the variable x(bar) is also normally distributed and has mean μ and standard deviation σ/√n. As a consequence, we have the procedure to compute the confidence-interval.

PS: The one-mean z-interval procedure is also known as the one-sample z-interval procedure and the one-variable z-interval procedure. We prefer "one-mean" because it makes clear the parameter being estimated.

PS: By saying that the confidence interval is exact, we mean that the true confidence level equals 1 – α; by saying that the confidence that the confidence interval is approximately correct, we mean that the true confidence level only approximately equals 1 – α.

Before applying Procedure 8.1, we need to make several comments about it and the assumptions for its use, including:

  • We use the term normal population as an abbreviation for "the variable under consideration is normally distributed."
  • The z-interval procedure works reasonably well even when the variable is not normally distributed and the sample size is small or moderate, provided the variable is not too far from being normally distributed. Thus we say that the z-interval procedure is robust to moderate violations of the normality assumption.
  • Watch for outlilers because their presence calls into question the normality assumption. Moreover, even for large samples, outliers can sometimes unduly affect a z-interval because the sample mean is not resistant to outliers.
  • A statistical procedure that works reasonably well even when one of its assumptions is violated (or moderately violated) is called a robust procedure relative to that assumption.


Key Fact 8.1 makes it clear that you should conduct preliminary data analyses before applying the z-interval procedure. More generally, the following fundamental principle of data analysis is relevant to all inferential procedures: Before performing a statistical-inference procedure, examine the sample data. If any of the conditions required for using the procedure appear to be violated, do not apply the procedure. Instead use a different, more appropriate procedure, if one exists. Even for small samples, where graphical displays must be interpreted carefully, it is far better to examine the data than not to. Remember, though, to proceed cautiously when conducting graphical analyses of small samples, especially very small samples – say, of size 10 or less.

Sample Size Estimation

If the margin of error and confidence level are specified in advance, then we must determine the sample size needed to meet those specifications. To find the formula for the required sample, we solve the margin-of-error formula, E = zα/2 · σ/√n, for n. See the computing formula in Formula 8.2.

Computing the Confidence-Interval for One Population Mean (σ unknown)

So far, we have discussed how to obtain the confidence-interval estimate when the population standard deviation, σ, is known. What if, as is usual in practice, the population standard deviation is unknown? Then we cannot base our confidence-interval procedure on the standardized version of x(bar). The best we can do is estimate the population standard deviation, σ, by the sample standard deviation, s; in other words, we replace σ by s in Procedure 8.1 and base our confidence-interval procedure on the resulting variable t (studentized version of x(bar)). Unlike the standardize version, the studentized version of x(bar) does not have a normal distribution.

Suppose that a variable x of population is normally distributed with mean μ. Then, for samples of size n, the variable t has the t-distribution with n-1 degrees of freedom. A variable with a t-distribution has an associated curve, called a t-curve. Although there is a different t-curve for each number of degrees of freedom, all t-curves are similar and resemble the standard normal cruve. As the number of degrees of freedom becomes larger, t-curves look increasingly like the standard normal curve.

Having discussed t-distributions and t-curves, we can now develop a procedure for obtaining a confidence interval for a population mean when the population standard deviation is unknown. The procedure is called the one-mean t-interval procedure or, when no confusion can arise, simply the t-interval procedure.

Properties and guidelines for use of the t-interval procedure are the same as those for the z-interval procedure. In particular, the t-interval procedure is robust to moderate violations of the normality assumption but, even for large samples, can sometimes be unduly affected by outliers because the sample mean and sample standard deviation are not resistant to outliers.

What If the Assumptions Are Not Satisfied?

Suppose you want to obtain a confidence interval for a population mean based on a small sample, but preliminary data analyses indicate either the presence of outliers or that the variable under consideration is far from normally distributed. As neither the z-interval procedure nor the t-interval procedure is appropriate, what can you do? Under certain conditions, you can use a nonparametric method. Most nonparametric methods do not require even approximate normality, are resistant to outliers and other extreme values, and can be applied regardless of sample size. However, parametric methods, such as the z-interval and t-interval procedures, tend to give more accurate results than nonparametric methods when the normality assumption and other requirements for their use are met.

How to compute the expected 95% CI

June 22, 2017 Medical Statistics No comments , , , , , , , , ,

Screen Shot 2017 06 21 at 10 02 07 PM

The Random Sampling Distribution of Means

Imagine you have a hat containing 100 cards, numbered from 0 to 99. At random, you take out five cards, record the number written on each one, and find the mean of these five numbers. Then you put the cards back in the hat and draw another random sample, repeating the same process for about 10 minutes.

Do you expect that the means of each of these samples will be exactly the same? Of course not. Because of sampling error, they vary somewhat. If you plot all the means on a frequency distribution, the sample means form a distribution, called the random sampling distribution of means. If you actually try this, you will note that this distribution looks pretty much like a normal distribution. If you continued drawing samples and plotting their means ad infinitum, you would find that the distribution actually becomes a normal distribution! This holds true even if the underlying population was not all normally distributed: in our population of cards in the hat, there is just one card with each number, so the shape of the distribution is actually rectangular, yet its random sampling of means still tends to be normal.

These principles are stated by the central limit theorem, which states that the random sampling distribution of means will always tend to be normal, irrespective of the shape of the population distribution from which the samples were drawn. According to the theorem, the mean of the random sampling distribution of means is equal the mean of the original population.

Like all distributions, the random sampling distribution of means not only has a mean, but also has a standard deviation. This particular standard deviation, the standard deviation of the random sampling distribution of means is the standard deviation of the population of all the sample means. It has its own name: standard error, or standard error of the mean. It is a measure of the extent to which the sample means deviate from the true population mean.

When repeated random samples are drawn from a population, most of the means of those samples are going to cluster around the original population mean. If the samples each consisted of just two cards what would happen to the shape of the random sampling distribution of means? Clearly, with an n of just 2, there would be quite a high chance of any particular sample mean falling out toward the tails of the distribution, giving a broader, fatter shape to the curve, and hence a higher standard error. On the other hand, if the samples consisted of 25 cards each (n = 25), it would be very unlikely for many of their means to lie far from the center of the curve. Therefore, there would be a much thinner, narrower curve and a lower standard error.

So the shape of the random sampling distribution of means, as reflected by its standard error, is affected by the size of the samples. In fact, the standard error is equal to the population standard deviation (σ) divided by the square root of the size of the samples (n).

Using the Standard ErrorScreen Shot 2017 06 21 at 9 04 38 PM

Because the random sampling distribution of means is normal, so the z score could be expressed as follow. It is possible to find the limits between which 95%  of all possible random sample means would be expected to fall (z score = 1.96).Screen Shot 2017 06 21 at 9 15 41 PM

Estimating the Mean of a Population

It has been shown that 95% of all possible members of the population (sample means) will lie within approximately +-2 (or, more exactly, +-1.96) standard errors of the population mean. The sample mean lies within +-1.96 standard errors of the population mean in 95% of the time; conversely, the population mean lies within +-1.96 standard errors of the sample mean 95% of the time. These limits of +-1.96 standard errors are called the confidence limits.

Screen Shot 2017 06 21 at 9 28 02 PM

Therefore, 95% confidence limits are approximately equal to the sample mean plus or minus two standard errors. The difference between the upper and lower confidence limits is called the confidence interval – sometimes abbreviated as CI. Researchers obviously want the confidence interval to be as narrow as possible. The formula for confidence limits shows that to make the confidence interval narrower (for a given level of confidence, such as 95%), the standard error must be made smaller.

Estimating the Standard Error

According to the formula above, we cannot calculate standard error unless we know population standard deviation (σ). In practice, σ will not be known: researchers hardly ever know the standard deviation of the population (and if they did, they would probably not need to use inferential statistics anyway).

As a result, standard error cannot be calculated, and so z scores cannot be used. However, the standard error can be estimated using data that are available from the sample alone. The resulting statistic is the estimated standard error of the mean, usually called estimated standard error, as shown by formula below.

Screen Shot 2017 06 21 at 9 38 55 PM

where S is the sample standard deviation.

t Scores

The estimated standard error is used to find a statistic, t, that can be used in place of z score. The t score, rather than the z score, must be used when making inferences about means that are based on estimates of population parameters rather than on the population parameters themselves. The t score is Student’s t, which is calculated in much the same way as z score. But while z was expressed in terms of the number of standard errors by which a sample mean lies above or below the population mean, t is expressed in terms of the number of estimated standard errors by which the sample mean lies above or below the population mean.

Screen Shot 2017 06 21 at 9 55 46 PM

Just as z score tables give the proportions of the normal distribution that lie above and below any given z score, t score tables provide the same information for any given t score. However, there is one difference: while the value of z for any given proportion of the distribution is constant, the value of t for any given proportion is not constant – it varies according to sample size. When the sample size is large (n >100), the value of t and z are similar, but as samples get smaller, t and z scores become increasingly different.

Degree of Freedom and t Tables

Table 2-1 (right-upper) is an abbreviated t score table that shows the values of t corresponding to different areas under the normal distribution for various sample sizes. Sample size (n) is not stated directly in t score tables; instead, the tables express sample size in terms of degrees of freedom (df). The mathematical concept behind degrees of freedom is complex and not needed for the purposes of USMLE or understanding statistics in medicine: for present purposes, df can be defined as simply equal to n – 1. Therefore, to determine the values of t that delineate the central 95% of the sampling distribution of means based on a sample size of 15, we would look in the table for the appropriate value of t for df = 14; this is sometimes written as t14. Table 2-1 shows that this value is 2.145.

As n becomes larger (100 or more), the values of t are very close to the corresponding values of z.

Evaluate The Article About Therapy (Randomized Trials)

January 28, 2016 Clinical Trials, Evidence-Based Medicine No comments , , , , , , , , , , ,

Section 1 How Serious Is The Risk of Bias?

Did Intervention and Control Groups Start With The Same Prognosis?

Consider the question of whether hospital care prolongs life. A study finds that more sick people die in the hospital than in the community. We would easily reject the naive conclusion that hospital care kills people because we recognize that hospitalized patients are sicker (worse prognosis) than patients in the community. Although the logic of prognostic balance is vividly clear in comparing hospitalized patients with those in the community, it may be less obvious in other contexts.

Were Patients Randomized?

The purpose of randomization is to create groups whose prognosis, with respect to the target outcomes, is similar. The reason that studies in which patient or physician preference determines whether a patient receives treatment or control (observational studies) often yield misleading results is that morbidity and mortality result from many causes. Treatment studies attempt to determine the impact of an intervention on events such as stroke, myocardial infarction, and death – occurrences that we call the trial's target outcomes. A patient's age, the underlying severity of illness, the presence of comorbidity, and a host of other factors typically determine the frequency with which a trial's target outcome occurs (prognostic factors or determinants of outcome). If prognostic factors – either those we know about or those we do not know about – prove unbalanced between a trial's treatment and control groups, the study's outcome will be biased, either underestimating or overestimating the treatment's effect. Because known prognostic factors often influence clinicians' recommendations and patients' decisions about taking treatment, observational studies often yield biased results that may get the magnitude or even the direction of the effect wrong.

Observational studies can theoretically match patients, either in the selection of patients for study or in the subsequent statistical analysis, for known prognostic factors. However, not all prognostic factors are easily measured or characterized, and in many diseases only a few are known. Therefore, even the most careful patient selection and statistical methods are unable to completely address the bias in the estimated treatment effect. The power of randomization is that treatment and control groups are more likely to have a balanced distribution of know and unknown prognostic factors. However, although randomization is a powerful technique, it does not always succeed in creating groups with similar prognosis. Investigators may make mistakes that compromise randomization, or randomization may fail because of chance – unlikely events sometimes happen.

Was Randomization Concealed?

When those enrolling patients are unware and cannot control the arm to which the patient is allocated, we refer to randomization as concealed. In unconcealed trials, those responsible for recruitment may systematically enroll sicker – or less sick – patients to either a treatment or control group. This behavior will compromise the purpose of randomization, and the study will yield a biased result (imbalance in prognosis).

Were Patients in the Treatment and Control Groups Similar With Respect to Known Prognostic Factors? (The Importance of Sample Size)

The purpose of randomization is to create groups whose prognosis, with respect to the target outcomes, is similar. Some times, through bad luck, randomization will fail to achieve this goal. The smaller the sample size, the more likely the trial will have prognostic imbalance.

Picture a trial testing a new treatment for heart failure that is enrolling patients classified as having New York Heart Association functional class III and class IV heart failure. Patients with class IV heart failure have a much worse prognosis than those with class III heart failure. The trial is small, with only 8 patients. One would not be surprised if all 4 patients with class III heart failure were allocated to the treatment group and all 4 patients with class IV heart failure were allocated to the control group. Such a result of the allocation process would seriously bias the study in favor of the treatment. Were the trial to enroll 800 patients, one would be startled if randomization placed all 400 patients with class III heart failure in the treatment arm. The larger the sample size, the more likely randomization will achieve its goal of prognostic balance.

The smaller the sample size, the more likely the trial will have prognostic imbalance. We can check how effectively randomization has balanced known prognostic factors by looking for a display of patient characteristics of the treatment and control groups at the study's commencement – the baseline or entry prognostic features. Although we will never know whether similarity exists for the unknown prognostic factors, we are reasssured when the known prognostic factors are well balanced. All is not lost if the treatment groups are not similar at baseline. Statistical techniques permit adjustment of the study result for baseline differences. When both adjusted analyses and unadjusted analyses generate the same conclusion, clinicians gain confidence that the risk of bias is not excessive.

Was Prognostic Balance Maintained as the Study Progressed?

To What Extent Was the Study Blinded?

If randomization succeeds, treatment and control groups begin wtih a similar prognosis. Randomization, however, provides no guarantees that the 2 groups will remain prognostically balanced. Blinding is the optimal strategy for maintaining prognostic balance. There are five groups that should, if possible, be blind to treatment assignment, including:

  • Patients – to avoid placebo effects
  • Clinicians – to prevent differential administration of therapies that affect the outcome of interest (cointervention)
  • Data collectors – to prevent bias in data collection
  • Adjudicators of outcome – to prevent bias in decisions about whether or not a patient has had an outcome of interest
  • Data analysts – to avoid bias in decisions regarding data analysis

These 5 groups involved in clinical trials will remain unware of whether patients are receiving the experimental therapy or control therapy.

Were the Groups Prognostically Balanced at the Study's Completion?

It is possible for investigators to effectively conceal and blind treatment assignment and still fail to achieve an unbiased result.

Was Follow-up Complete?

Ideally, at the conclusion of a trial, investigators will know the status of each patient with respect to the target outcome. The greater the number of patients whose outcome is unknown – patients lost of follow-up – the more a study is potentially compromised. The reason is that patients who are retained – they may disappear because they have adverse outcomes or because they are doing well and so did not return for assessment. The magnitude of the bias may be substantial. See  two examples in Pharmacy Profession Forum at

Loss to follow-up may substantially increase the risk of bias. If assuming a worst-case scenario does not change the inferences arising from study results, then loss to follow-up is unlikely a problem. If such an assumption would significantly alter the results, the extent to which bias is introduced depends on how likely it is that treatment patients lost to follow-up fared badly, whereas control patients lost to follow-up fared well. That decision is a matter of judgement.

Was the Trial Stopped Too Early?

Stopping trial early (i.e., before enrolling the planned sample size) when one sees an apparent large benefit is risky and may compromise randomization. These stopped early trials run the risk of greatly overestimating the treatment effect.

A trial designed with too short a follow-up also may compromise crucial information that adequate length of follow-up would reveal. For example, consider a trial that randomly assigned patients with an abdominal aortic aneurysm to either an open surgical repair or a less invasive, endovascular repair technique. At the end of the 30-day follow-up, mortality was significantly lower in the endovascular technique group. The investigators followed up participants for an additional 2 years and found that there was no difference in mortality between groups after the first year. Had the trial ended earlier, the endovascular technique may have been considered substantially better than the open surgical techinique.

Were Patients Analyzed in the Groups to Which They Were Randomized?

Investigators will undermine the benefits of randomization if they omit from the analysis patients who do not receive their assigned treatment or, worst yet, count events that occur in nonadherent patients who were assigned to treatment against the controll group. Such analyses will bias the results if the reasons for nonadherence are related to prognosis. In a number of randomized trials, patients who did not adhere to their assigned drug regimens fared worse than those who took their medication as instructed, even after taking into account all known prognostic factors. When adherent patients are destined to have a better outcome, omitting those who do not receive assigned treatment undermines the unbiased comparison provided by randomization. Investigators prevent this bias when they follow the intention-to-treat principle and analyze all patients in the group to which they were randomized irrespective of what treatment they actually received. Following the intention-to-treat principle does not, however, reduce bias associated with loss to follow-up.

Section 2 What Are the Results?

How Large Was the Treatment Effect?

Most frequently, RCTs monitor dichotomous outcomes (e.g., "yes" or "no" classifications for cancer recurrence, myocardial infarction, or death). Patients either have such an event or they do not, and the article reports the proportion of patients who develop such events. Consider, for example, a study in which 20% of a control group died but only 15% of those receiving a new treatment died. How might one express these results?

One possibility is the absolute difference (known as the absolute risk reduction [ARR] or risk difference) between the proportion who died in the control group (control group risk [CGR]) and the proportion who died in the experimental group (experimental group risk [EGR]), or CGR – EGR = 0.20 – 0.15 = 0.05. Another way to express the impact of treatment is as the RR: the risk of events among patients receiving the new treatment relative to that risk among patients in the control group, or EGR/CGR = 0.15/0.20 = 0.75.

The most commonly reported measure of dichotomous treatment effects is the complement of the RR, the RRR. It is expressed as a percentage: 1 – (EGR/CGR) x 100% = (1 – 0.75) x 100% = 25%. An RRR of 25% means that of those who would have died had they been in the control group, 25% will not die if they receive treatment; the greater the RRR, the more effective the therapy. Investigators may compute the RR during a specific period, as in a survival analysis; the relative measure of effect in such a time-to-event analysis is called the hazard ratio. When people do not specify whether they are talking about RRR or ARR – for instance, "Drug X was 30% effective in reducing the risk of death" or "The efficacy of the vaccine was 92%" – they are almost invariably taking about RRR.

How Precise Was the Estimate of the Treatment Effect?

We can never be sure of the true risk reduction; the best estimate of the true treatment effect is what we observe in a well-designed randomized trial. This estimate is called a point estimate to remind us that, although the true value lies somewhere in its neighborhood, it is unlikely to be precisely correct. Investigators often tell us the neighborhood within which the true effect likely lies by calculating CIs, a range of values within which one can be confident the true effect lies.

We usually use the 95% CI. You can consider the 95% CI as defining the range that – assuming the study has low risk of bias – includes the true RRR 95% of the time. The true RRR will generally lie beyond these extremes only 5% of the time, a property of the CI that relates closely to the conventional level of statistical significance of P <0.05.


If a trial randomized 100 patients each to experimental and control groups, and there were 20 deaths in the control group and 15 deaths in the experimental group, the authors would calculate a point estimate for the RRR of 25% [(1-0.15/0.20) x 100 = 25%]. You might guess, however, that the true RRR might be much smaller or much greater than 25%, based on a difference of only 5 deaths. In fact, you might surmise that the treatment might provide no benefit (an RRR of 0%) or might even do harm (a negative RRR). And you would be right; in fact, these results are consistent with both an RRR of -38% and and RRR of nearly 59%. In other words, the 95% CI on this RRR is -38% to 59%, and the trial really has not helped us decide whether or not to offer the new treatment.

If the trial enrolled 1000 patients per group rather than 100 patients per group, and the same event rates were observed as before. There were 200 deaths in the control group and 150 deaths in the experimental group. Again, the point estimate of the RRR is 25%. In this larger trial, you might think that our confidence that the true reduction in risk is close to 25% is much greater. Actually, in the larger trial the 95% CI on the RRR for this set of results is all on the positive side of 0 and runs from 9% to 41%.

These two examples show that the larger the sample size and higher the number of outcome events in a trial, the greater our confidence that the true RRR (or any other measure of effect) is close to what we observed. As one considers values farther and farther from the point estimate, they become less and less likely to represent the truth. By the time one crosses the upper or lower bundaries of the 95% CI, the values are unlikely to represnet the true RRR. All of this assumes the study is at low risk of bias.

Section 3 How Can I Apply the Results to Patient Care?

Were the Study Patients Similar to the Patient in My Practice?

If the patient before you would have qualified for enrollment in the study, you can apply the results with considerable confidence or consider the results generalizable. Often, your patient has different attributes or characteristics from those enrolled in the trial and would not have met a study's eligibility criteria. Patients may be older or younger, may be sicker or less sick, or may have comorbid disease that would have excluded them from participation in the study.

A study result probably applies even if, for example, adult patients are 2 years too old for enrollment in the study, had more severe disease, had previously been treated with a competing therapy, or had a comorbid condition. A better approach than rigidly applying the study inclusion and exclusion criteria is to ask whether there is some compelling reason why the results do not apply to the patient. You usually will not find a compelling reason, in which case you can generalize the results to your patient with confidence.

A related issue has to do with the extent to which we can generalize findings from a study using a particular drug to another closely (or not so closely) related agent. The issue of drug class effects and how conservative one should be in assuming class effects remains controversial. Generalizing findings of surgical treatment may be even riskier. Randomized trials of carotid endarterectomy, for instance, demonstrate much lower perioperative rates of stroke and death than one might expect in one's own community, which may reflect on either the patients or surgeons (and their relative expertise) selected to participate in randomized trials.

A final issue arises when a patient fits the features of a subgroup of patients in the trial report. We encourage you to be skeptical of subgroup analyses. The treatment is likely to benefit the subgroup more or less than the other patients only if the difference in the effects of treatment in the subgroups is large and unlikely to occur by chance. Even when these conditions apply, the results may be misleading, particularly when investigators did not specify their hypotheses before the study began, if they had a large number of hypotheses, or if other studies fail to replicate the finding.

Were All Patient-Important Outcomes Considered?

Treatments are indicated when they provide important benefits. Demonstrating that  a bronchodilator produce small increments in forced expiratory volume in patients with chronic airflow limitation, that a vasodilator improves cardiac output in heart failure patients, or that a lipid-lowering agent improves lipid profiles does not provide sufficient justification for administering these drugs. In these instances, investigators have chosen substitute outcomes or surrogate outcomes rather than those that patients would consider important. What clinicians and patients require is evidence that treatments improve outcomes that are important to patients, such as reducing shortness of breath during the activities required for daily living, avoiding hospitalization for heart failure, or decreasing the risk of a major stroke.

Substitute/Surrogate Outcomes

Trial of the impact of antiarrhythmic drugs after myocardial infarction illustrate the danger of using substitute outcomes or end points. Because abnormal ventricular depolarizations were associated with a high risk of death and antiarrhythmic drugs demonstrated a reduction in abnormal ventricular depolarizations (the substitute end point), it made sense that they should reduce death. A group of investigators, performed randomized trials on 3 agents (encainide, flecainide, and moricizine) that were previously found to be effective in suppressing the substitute end point of abnormal ventricular depolarizations. The investigators had to stop the trials when they discovered that mortality was substantially higher in patients receiving antiarrhythmic treatment than in those receiving placebo. Clinicians replying on the substitue end point of arrhythmia suppression would have continued to administer the 3 drugs, to the considerable detriment of their patients.

Even when investigators report favorable effects of treatment on a patient-important outcome, you must consider whether there may be deleterious effects on other outcomes. For instance, cancer chemotherapy may lengthen life but decrease its quality. Randomized trials often fail to adequately document the toxicity or adverse effects of the experimental intervention.

Composite End Points

Composite end points represent a final dangerous trend in presenting outcomes. Like surrogate outcomes, composite end points are attractive for reducing sample size and decreasing length of follow-up. Unfortunately, they can mislead. For example, a trial that reduced a composite outcome of death, nonfatal myocardial infarction, and admission for an acute coronary syndrome actually demonstrated a trend toward increased mortality with the experimental therapy and covincing effects only on admission for an acute coronary syndrome. The composite outcome would most strongly reflect the treatment effect of the most common of the components, admission for an acute coronary syndrome, even though there is no convincing evidence the treatment reduces the risk of death or myocardial infarction.

Another long-neglected outcome is the resource implications of alternative management strategies. Health care systems face increasing resource constraints the mandate careful attention to economic analysis.

PS: Substitute/surrogate end points

In clinical trials, a surrogate endpoint (or marker) is a measure of effect of a specific treatment that may correlate with a real clinical endpoint but does not necessarily have a guaranteed relationship. The National Institutes of Health(USA) defines surrogate endpoint as "a biomarker intended to substitute for a clinical endpoint".[1][2]

Surrogate markers are used when the primary endpoint is undesired (e.g., death), or when the number of events is very small, thus making it impractical to conduct a clinical trial to gather a statistically significant number of endpoints. The FDA and other regulatory agencies will often accept evidence from clinical trials that show a direct clinical benefit to surrogate markers. [3]

A surrogate endpoint of a clinical trial is a laboratory measurement or a physical sign used as a substitute for a clinically meaningful endpoint that measures directly how a patient feels, functions or survives. Changes induced by a therapy on a surrogate endpoint are expected to reflect changes in a clinically meaningful endpoint. [6]

A commonly used example is cholesterol. While elevated cholesterol levels increase the likelihood for heart disease, the relationship is not linear – many people with normal cholesterol develop heart disease, and many with high cholesterol do not. "Death from heart disease" is the endpoint of interest, but "cholesterol" is the surrogate marker. A clinical trial may show that a particular drug (for example, simvastatin (Zocor)) is effective in reducing cholesterol, without showing directly that simvastatin prevents death.

Are the Likely Treatment Benefits Worth the Potential Harm and Costs?

If the results of a study apply to your patient and the outcomes are important to your patient, the next question concerns whether the probable treatment benefits are worth the associated risks, burdern, and resource requirements. A 25% reduction in the RR of death may sound impressive, but its impact on your patient may nevertheless be minimal. This notion is illustrated by using a concept called number needed to treat (NNT), the number of patients who must receive an intervention fo therapy during a specific period to prevent 1 adverse outcome or produce 1 positive outcome. See here for how to calcuate NNT:

The impact of a treatment is related not only to its RRR but also to the risk of the adverse outcome it is designed to prevent. One large trial in myocardial infarction suggests that clopidogrel in addition to aspirin reduces the RR of death from a cardiovascular cause, nonfatal myocardial infarction, or stroke by approximately 20% in comparison to aspirin alone. Table 6-3 considers 2 patients presenting with acute myocardial infarction without elevation of ST segments on their electrocardiograms. Compared with aspirin alone, both patients have a RRR of approximately 20%, but the ARR is quite different between the two patients, which results in a siginifant different NNT.Screen Shot 2016-02-22 at 7.59.31 PM

A key element of the decision to start therapy, therefore, is to consider the patient's risk of the event if left untreated. For any given RRR, the higher the probability that a patient will experience an adverse outcome if we do not treat, the more likely the patient will benefit from treatment and the fewer such patients we need to treat to prevent 1 adverse outcome. Knowing the NNT assists clinicians in helping patients weigh the benefits and downsides associated with their management options. What if the siutation changes to the other end (Treatment usually will induces harm compared with control [adverse event is the nature of drugs], in this example, the harm is the increased risk of bleeding)? The answer is, for any given RRI (relative risk increasing), the higher the probability that a patient will experience an adverse outcome if we treat, the more likely the patient will get harm from treatment and the fewer such patients we need to treat to cause 1 adverse outcome.

Trading off benefits and risk also requires an accurate assessment of the adverse effects of treatment. Randomized trials with relatively small sample sizes are unsuitable for detecting rare but catastrophic adverse effects of therapy. Clinicians often must look to other sources of information – often characterized by higher risk of bias – to obtain an estimate of the adverse effects of therapy.

When determining the optimal treatment choice based on the relative benefits and harms of a therapy, the values and preferences of each individual patient must be considered. How best to communicate information to patients and how to incorporate their values into clinical decision making remain areas of active investigation in evidence-based medicine.

(The End)