**Confidence Intervals for One Population Proportion**

Statisticicans often need to determine the proportion (percentage) of a population that has a specific attribute. Some examples are:

- the percentage of U.S. adults who have health insurance
- the percentage of cars in the United States that are imports
- the percentage of U.S. adults who favor stricter clean air health standards
- the percentage of Canadian women in the labor force

In the first case, the population consists of all U.S. adults and the specified attribute is "has health insurance." For the second case, the population consists of all cars in the United States and the specific attribute is "is an import." The population in the third case is all U.S. adults and the specified attribute is "favors stricter clean air health standards." In the fourth case, the population consists of all Canadian women and the specified attribute is "is in the labor force."

We know that it is often impractical or impossible to take a census of a large population. In practice, therefore, we use data from a sample to make inferences about the population proportion.

A sample proportion, *p*^{^}, is computed by using the formula

*p*^{^} = *x* / *n*,

where *x* denotes the number of members in the sample that have the specified attribute and, as usual, *n* denotes the sample size. For convenience, we sometimes refer to *x* as the number of successes and to *n* – *x* as the number of failures.

**The Sampling Distribution of the Sample Proportion**

To make inferences about a population mean, ๐, we must know the sampling distribution of the sample mean, that is, the distribution of the variable *x*(bar) (see detail for confidence interval for one population mean at thread "Statistic Procedure – Confidence Interval" http://www.tomhsiung.com/wordpress/2017/08/statistic-procedures-confidence-interval/). The same is true for proportions: To make inferences about a population proportion, *p*, we need to know the sampling distribution of the **sample proportion**, that is, the distribution of the variable *p*^{^}. Because a proportin can always be regarded as a mean, we can use our knowledge of the sampling distribution of the sample mean to derive the sampling distribution of the sample proportion. In practice, the sample size usually is large, so we concentrate on that case.

The accuracy of the normal approximation depdends on *n* and *p*. If *p* is close to 0.5, the approximation is quite accurate, even for moderate *n*. The farther *p* is from 0.5, the larger *n* must be for the approximation to be accurate. As a rule of thumb, **we use the normal approximation when np and n(1 – p) are both 5 or greater**. Alternatively, another commonly used rule of thumb is that

*np*and

*n*(1 –

*p*) are both 10 or greater; still another is that

*np*(1 –

*p*) is 25 or greater.

Below is the one-proportion z-interval procedure, which is also known as the one-sample z-interval procedure for a population proportion and the one-variable proportion interval procedure. Of note, as stated in Assumption 2 of Procedure 12.1, a condition for using that procedure is that "the number of successes, *x*, and the number of failures, *n* – *x*, are both 5 or greater." We can restate this condition as "*np*^{^} and *n*(1 – *p*^{^}) are both 5 or greather," which, for an unknown *p*, corresponds to the rule of thumb for using the normal approximation given after Key Fact 12.1.

Determining the Required Sample Size

If the margin of error (*E*) and confidence level are specified in advance, then we must determine the sample size required to meet those specifications. Solving for *n* in the formula for margin of error, we get

*n* = *p*^{^}(1 – *p*^{^})(Z_{๐ผ/2} / *E*)^{2}

**This formula cannot be used to obtain the required sample size because the sample proportion, p^{^}, is not known prior to sampling**. There are two ways around this problem. To begin, we examine the graph of

*p*

^{^}(1 –

*p*

^{^}) versus

*p*

^{^}shown in Figure 12.1. The graph reveals that the largest

*p*

^{^}(1 –

*p*

^{^}) can be is 0.25, which occurs when

*p*

^{^}= 0.5. The farther

*p*

^{^}is from 0.5, the smaller will be the value of

*p*

^{^}(1 –

*p*

^{^}). Because the largest possible value of

*p*

^{^}(1 –

*p*

^{^}) is 0.25, the most conservative approach for determining sample size is to use that value in the above equation. The sample size obtained then will generally be larger than necessary and the margin of error less than required. Nonetheless, this approach guarantees that the specifications will at least be met. In the same vein, if we have in mind a likely range for the observed value of

*p*

^{^}, then, in light of Figure 12.1, we should take as our educated guess for

*p*

^{^}the value in the range closest to 0.5. In either case, we should be aware that, if the observed value of

*p*

^{^}is closer to 0.5 than is our educated guess, the margin of error will be larger than desired.

**Hypothesis Tests for One Population Proportion**

Just earlier, we showed how to obtain confidence intervals for a population proportion. Now we show how to perform hypothesis tests for a population proportion. This procedure is actually a special case of the one-mean z-test. For Key Fact 12.1, we deduce that, for large *n*, the standardized version of *p*^{^},

has approximately the standard normal distribution. Consequently, to perform a large sample hypothesis test with null hypothesis *H*_{0}: *p* = *p*_{0}, we can use the variable

at the test statistic and obtain the critical value(s) or P-value from the standard normal table. We call this hypothesis-testing procedure the **one-proportion z-test**.

**Hypothesis Tests for Two Population Proportions**

For independent samples of sizes *n*_{1} and *n*_{2} from the two populations, we have Key Fact 12.2

Now we can develop a hypothesis-testing procedure for comparing two population proportions. Our immediate goal is to identify a variable that we can use as the test statistic. From Key Fact 12.2, we know that, for large, independent samples, the standardized variabvle z has approximately the standard normal distribution. They null hypothesis for a hypothesis test to compare two population proportions is *H*_{0}: *p*_{1} = *p*_{2}. If the null hypothesis is true, then *p*_{1} – *p*_{2} = 0, and, consequently, the bariable in

becomes

However, because *p* is unknown, we cannot use this variable as the test statistic. Consequently, **we must estimate p by using sample information**. The best estimate of p is obtained by pooling the data to get the proportion of successes in both samples combined; that is, we estimate

*p*by

Where the *p*^{^}_{p} is called the **pooled sample proportion**. After replacing the *p* by *p*^{^}_{p} we get the final test statistic, which can be used as the test statsitic and has approximately the standard normal distribution for large samples if the null hypothesis is true. Hence we have Procedure 12.3, the **two-proportions z-test**. Also, it is known as the two-sample z-test for two population proportions and the two-variable proportions test.

It is very fortunate that the confidence intervals for the difference between two population proportions could be computed. As we can use Key Fact 12.2 to derive a confidence-interval procedure for the difference between two population proportions, called the **two-proportions z-interval procedure**. Note the following: 1) The two-proportions z-interval procedure is also known as the two-sample z-interval procedure for two population proportions and the two-variable proportions interval procedure. 2) Guidelines for interpreting confidence intervals for the difference, *p*_{1} – *p*_{2}, between two population proportions are similar to those for interpreting confidence intervals for the difference, ๐_{1} – ๐_{2}, between two population means, as describe in other relative threads.

**Update** on Oct 2 2017

**Supplemental Data – Confidence Intervals of Odds Ratio (OR) and Relative Risk (RR)**

**OR**

The sampling distribution of the odds ratio is positively skewed. However, **it is approximately normally distributed on the natural log scale**. After finding the limits on the LN scale, use the EXP function to find the limits on the original scale. The standard deviation of LN(OR) is

SD of LN(OR) = square root of (1/a + 1/b + 1/c + 1/d)

Now we know the distribution of LN(OR) and the standard deviation (mean and variation) of LN(OR), and the z-proportion procedure could be conducted to compute the confidence intervals of LN(OR).

**RR**

Similar with OR, the sampling distribution of the relative risk is positively skewed but is approximately normally distributed on the natural log scale. Constructing a confidence interval for the relative risk is similar to constructing a CI for the odds ratio except that there is a different formula for the SD.

SD of LN(RR) = square root of [ b/a(a+b) + d/c(c+d) ]