Skip to content

Nonparametric Methods of Estimating Survival Functions


Of the three survival functions, survivorship or its graphical presentation, the survival curve, is the most widely used. Examples include product-limit (PL) method of estimating the survivorship function by Kaplan and Meier, life-table analysis, relative survival rate, five-year surival rate, and corrected survival rate. Product-limit method is applicable to small, moderate, and large samples. However, if the data have already been grouped into intervals, or the sample size is very large, say in the thousands, or the interest is in a large population, it may be more convenient to perform a life-table analysis. The PL estimats and life-table estimates of the survivorship function are essentially the same. Many authors use the term life-table estimates for the PL estimates. The only difference is that the PL estimate is based on individual survival times, whereas in the life-table method, survival times are grouped into intervals. The PL estimate can be considered as a special case of the life-table estimate where each interval contains only one observation.

Product-Limit Estimates of Survivorship Function

Let us first consider the simple case where all the patients are observed to death so that the survival times are exact and known. Let t1, t2,…, tn be the exact survival times of the n individuals under study. Conceptually, we consider this group of patients as a random sample from a much larger population of similar patients. We relabel the n survival times, t1, t2,…, tn in ascending order such that t(1) <= t(2) =< … =< t(n). As a consequence (by the definition of survival function), the survivorship function at t(i) can be estimated as

where ni is the number of people in the sample surviving longer than t(i). If two or more t(i) are equal (tied observations), the largest i values is used. This gives a conservative estimate for the tied observations. In practice, sample survivorship function is computed at every distinct survival time. We do not have to worry about the intervals between the distinct survival times in which no one dies and the survivorship function remmains constant. Survivorship function in this example is a step function starting at 1.0 (100%) and decreasing in steps of 1/n (if there are no ties) to zero. When survivorship function is plotted versus t, the various percentiles of survial time can be read from the graph or calculated from survivorship function.

This method can be applied only if all patients are followed to death. If some of the patients are still alive at the end of the study, a different method of estimating survivorship function, such as the PL estimate given by Kaplan and Meier, is required. The rationale can be illustrated by the following simple example. Suppose that 10 patients join a clinical study at the beginning of 2000; during that year 6 patients die and 4 survive. At the end of the year, 20 additional patients join the study. In 2001, 3 patients who entered in the beginning of 2000 and 15 patients who entered later die, leaving one and five survivors, respectively. Suppose that the study terminates at the end of 2001 and you want to estimate the proportion of patients in the population surviving for two years or more, that is, S(2).

The first group of patients in the example is followed for two years; the second group is followed for only one year. One possible estimate, the reduced-sample estimate, is S^(2) = 1/10 = 0.1, which ignores the 20 patients who are followed only for one year. Kaplan and Meier believe that the second sample, under observation for only one year, can contribute to the estimate of S(2).

Patients who survived two years may be considered as surviving the first year and then surviving one more year. Thus, the probability of surviving for two years or more is equal to the probability of survival the first year and then surviving one more year. That is,

S(2) = P(surviving first year and then surviving one more year)

which can be written as

S(2) = P(surviving two years given patient has survived first year) x P(surviving first year)

The Kaplan-Meier estimate of S(2) following above is

For the data given above, one of the four patients who survived the first year survived two years, so the first proportion is 1/4. Four of the 10 patients who entered at the beginning of 2000 and 5 of the 20 patients who entered at the end of 2000 survived one year. Therefore, the second proportion is (4 + 5) / (10 + 20). The PL estimate of S(2) is

This simple rule may be generalized as follows: The probability of surviving k (>=2) or more years from the beginning of the study is a product of k observed survival rates:

where p1 denotes the proportion of patients surviving at least one year, p2 the proportion of patients surviving the second year after they have survived one year, p3 the proportion of patients surviving the third year after they have survived two years, and pk the proportion of patients surviving the kth year after they have survived k – 1 years.

Thereore, the PL estimate of the probability of surviving any particular number of years from the beginning of study is the product of the same estimate up to the preceding year, and the observed survival rate for the particular year, that is,

The PL estimates are maximum likelihood estimates. In practice, the PL estimates can be calculated by constructing a table with five columns following the outline below.

  • Column 1 contains all the survival times, both censored and uncensored, in order from smallest to largest. Affix a plus sign to the censored observation. If a censored observation has the same value as an uncensored observations, the latter should appear first.
  • The second column, labeled i, consists of the corresponding rank of each observation in column 1.
  • The third, labeled r, pertains to uncensored observations only. Let r = i.
  • Compute (nr) / (nr + 1), or pi, for every uncensored observation t(i) in column 4 to give the proportion of patients surviving up to and then through t(i).
  • In column 5, S^(t) is the product of all values of (nr) / (nr + 1) up to and including t. If some uncensored observations are ties, the smallest S^(t) should be used.

The Kaplan-Meier method provides very useful estimates of survival probabilities and graphical presentation of survival distribution. It is the most widely used method in survival data analysis. Breslow and Crowley and Meier have shown that under certain conditions, the estimate is consistent and asymptomatically normal. However, a few critical features should be mentioned.

  • The Kaplan-Meier estimates are limited to the time interval in which the observations fall. If the largest observation is uncensored, the PL estimate at that time equals zero. Although the estimate may not be welcomed by physicians, it is correct since no one in the sample lives longer. If the largest observation is censored, the PL estimate can never equal zero and is undefined beyond the largest observation.
  • The most commonly used summary statistic in survival analysis is the median survival time. A simple estimate of the median can be read from survival curves estimated by the PL method as the time t at which S^(t) = 0.5. However, the solution may not be unique, if the surival curve is horizontal at S^(t) = 0.5; any t value in the interval t1 to t2 is a reasonable estimate of the median. A practical solution is to take the midpoint of the interval as the PL estimtate of the median.
  • If less than 50% of the observations are uncensored and the largest observation is censored, the median survival time cannot be estimated. A practical way to handle the situation is to use probabilities of surviving a given length of time, say 1, 3, or 5 years, or the mean survival time limited to a given time t.
  • The PL method assumes that the censoring times are independent of the survival times. In other words, the reason an observation is censored is unrelated to the cause of death. This assumption is true if the patient is still alive at the end of the study period. However, the assumption is violated if the patient develops severe adverse effects from the treatment and is forced to leave the study before death or if the patient died of a cause other than the one under study. When there is inappropriate censoring, the PL method is not appropriate. In practice, one way to alleviate the problem is to avoid it or to reduce it to a minimum.
  • Simialr to other estimators, the standard error (S.E.) of the Kaplan-Meier estimator of S(t) gives an indicaton of the potential error of S^(t). The confidence interval deserces more attention than just the point estimate S^(t). A 95% confidence interval for S(t) is S^(t) x 1.96 S.E. [S^(t)].

Life-Table Analysis

The life-table method is one of the oldest techniques for measuring mortality and describing the survival experience of a population. It has been used by actuaries, demographers, governmental agencies, and medical researchers in studies of survival, population growth, fertility, migration, length of married life, length of working life, and so on. There has been a decennial series of life tables on the entire U.S. population since 1900. States and local governments also publish life tables. These life tables, summarizing the mortality experience of a specific population for a specific period of time, are called population life tables. As clinical and epidemiologic research become more common, the life-table method has been applied to patients with a given disease who have been followed for a period of time. Life tables constructed for pateints are called clinical life tables. Although population and clinical life tables are similar in calculation, the sources of required data are different.

There are two kinds of population life tables: the cohort life table and current life table. The cohort life table describes the survival or mortality experience from birth to death of a specific cohort of persons who were born at about the same time, for example, all persons born in 1950. The cohort has to be followed from 1950 until all of them die. The proportion of death (survivor) is then used to construct life tables for successive calender years. This type of table, useful in population projection and prospective studies, is not often constructed since it requires a long follow-up period.

The current life table is constructed by applying the age-specific mortality rates of a population in a given period of time to a hypothetical cohort of 100,000 or 1,000,000 persons. The starting point is birth at year 0. Two sources of data are required for constructing a population life table: 1) census data on the number of living persons at each age for a given year at midyear and 2) vital statistics on the number of deaths in the given year for each age. For example, a current U.S. life table assumes a hypothetical cohort of 100,000 persons that is subject to the age-specific death rates based on the observed data for the United States in the 1900 census. The current life table, based on the life experience of an actual population over a short period of time, gives a good summary of current mortality. This type of life table is regularly published by government agencies of different levels. One of the most often reported statistics from current life tables is the life expectancy. The term population life table is often used to refer to ther current life table.

Current life tables usually have the following columns:

  • Age interval (x to x + t). This is the time interval between two exact ages x and x + t; t is the length of the interval. For example, the interval 20-21 includes the time interval from the 20th birthday up to the 21st birthday (but not including the 21st birthday).
  • Proportion of persons alive at beginning of age interval but dying during the interval (tqx). The information is obtained from census data. For example, (tqx) for age interval 20-21 is the proportion of persons who died on or after their 20th birthday and before their 21st birthday. It is an estimate of the conditional probability of dying in the interval given the person is alive at age x. This column is usually calcuated from data of the decennial census of population and deaths occurring in the given time interval.
  • Number living at beginning of age interval (lx). The initial value of lx, the size of the hypothetical population, is usually 100,000 or 1,000,000. The successive values are computed using the formula, lx = lx-1(1 – tqx-t), where 1 – tqx-t is the proportion of persons who survived the previous age interval.
  • Number dying during age interval (tdx), tdx = lx(tqx) = lxlx+1
  • Stationary population (tLx and Tx). Here tLx is the total number of years lived in the ith age interval or the number of person-years that lx persons, aged x exactly, live through the interval. For those who survive the interval, their contribution to tLx is the length of the interval, t. For those who die during the interval, we may not know exactly the time of death and the surival time must be estimated. The conventional assumption is that they live one-half of the interval and contribute t/2 to the calculatin of tLx. Thus, tLx = t(lx+1 + 1/2*tdx). The symbol Tx is the total number of person-years lived beyond age t by person alive at that age.
  • Average remaining lifetime or average number of years of life remaining at beginning of age interval (e0i). This is also known as the life expectancy at a given age, which is defined as the number of years remaining to be lived by persons at age x: e0i =  Tx/lx. The expected age at death of a person aged x is x + e0i. The e0i at x = 0 is the life expetancy at birth.

Clinical life table, or the actuarial life table method has been applied to clinical data for many decades. Berkson and Gage and Cutler and Ederer give a life-table method for estimating the survivorship function; Gehan provides methods for estimating all three functions (survivorship, density, and hazard).

The life-table method requires a fairly large number of observations, so that survival times can be grouped into intervals. Similar to the PL estimate, the life-table method incorporates all survival information accumulated up to the termination of the study. For example, in computing a five-year survival rate of breast cancer patients, one need not restrict oneself only to those patients who have entered on study for five or more years. Patients who have entered for four, three, two, and even one year contribute useful information to the evaluation of five-year survival. In this way, the life-table technique uses incomplete data such as losses to follow-up and persons withdrawn alive as well as complete death data.

Columns of clinical life table:

  • Interval (ti + ti+1). The first column gives the intervals into which the survival times and times to loss or withdrawal are distributed. The interval is from ti up to but not including ti+1, i = 1, …, s. The last interval has an infinite length. These intervals are assumed to be fixed.
  • Midpoint (tmi). The midpoint of each interval, designated tmi, i = 1, …, s-1, is included for convenience in ploting the harzard and probability density functions. Both functions are plotted as tmi.
  • Width (bi). The width of each interval, bi = ti+1ti, i = 1, …, s-1, is needed for calculation of the hazard and density functions. The width of the last interval, bs, is theoretically infinite; no estimate of the hazard or density function can be obtained for this interval.
  • Number lost to follow-up (li). This is the number of people who are lost to observation and whose survival status is thus unknown in the ith interval (i = 1, …, s).
  • Number withdrawn alive (wi). People withdrawn alive in the ith interval are those known to be alive at the closing date of the study. The survival time recorded for such persons is the length of time from entrance to the closing date of the study.
  • Number dying (di). This is the number of people who die in the ith interval. The survival time of these people is the time from entrance to death.
  • Number entering the ith interval (n'i). The number of people entering the first interval n'1 is the total sample size. Other entries are determined from n'i = n'i-1 – li-1 – wi-1 – di-1. That is, the number of persons entering the ith interval is equal to the number studied at the beginning of the preceding interval minus those who are lost to follow-up, withdrawn alive, or have died in the preceding interval.
  • Number exposed to risk (ni). This is the number of people who are exposed to risk in teh ith interval and is defined as ni = n'i – 1/2*(li + wi). It is assumed that the times to loss or withdrawal are approximately uniformly distributed in the interval. Therefore, people lost or withdrawn in the interval are exposed to risk of death for one-half the interval. If there are no losses or withdrawals, ni = n'i.
  • Conditional proportion dying (q^i). This is defined as qi = di/ni for i = 1, …, s-1, and q^s = 1. It is an estimate of the conditional probability of death in teh ith interval given exposure to the risk of death in the ith interval.
  • Conditonal proportion surviving (p^i). This is given by p^i = 1 – q^i, which is an estimate of the conditional probability of surviving in the ith interval.
  • Cumulative proportion surviving [S^(ti)]. This is an estimate of the survivorship function at time ti; it is often referred to as the cumulative survival rate. For i = 1, S^(t1) = 1 and for i = 2, …, s,  S^(ti) = p^i-1S^(ti-1). It is the usual life-table estimate and is based on the fact taht surviving to the start of the ith interval means surviving to the start of and then through the (i – 1)th interval.
  • Estimated probability density function [f^(tm)]. This is defined as the probability of dying in the ith interval per unit width. Thus, a natural estimate at the midpoint of the interval is

  • Hazard function [h^(tmi)]. The hazard function for the ith interval, estimated at teh midpoint, is

It is the number of death per unit time in the interval divided by the average number of survivors at the midpoint of the interval. That is, h^(tmi) is derived from f^(tmi)/S^(tmi) and S^(tmi) = 1/2*[S^(ti+1) + S^(ti)] since S^(ti) is defined as the probability of surviving at the beginning, not the midpoint, of the ith interval:

Several Major Distributions of Survival Function


Exponential Distribution

The simplest and most important distribution in survival studies is the exponential distribution. In the late 1940s, researchers began to choose the exponential distribution to describe the life pattern of electronic systems. The exponential distribution has since continued to play a role in lifetime studies analogous to that of the normal distribution in other areas of statistics. The exponential distribution is often referred to as a purely random failure pattern. It is famous for its unique "lack of memory," which requires that the age of the animal or person does not affect future survival. Although many survival data cannot be described adequately by the exponential distribution, an understanding of it facilitates the treatment of more general situations.

The exponential distribution is characterized by a constant hazard rate 𝜆, its only parameter. A high ðœ† value indicates high risk and short survival; a low ðœ† value indicates low risk and long survival. When the surival time T follows the exponential distribution with a parameter ðœ†, the probability density function is defined as

The cumulative distribution function is

and the survivorship function is then

and the hazard function is

Note that the hazard function is a constant, ðœ†, independent of t. Because the exponential distribution is characterized by a constant hazard rate, independent of the age of the person, there is no aging or wearing out, and failure or death is a random event indepdendent of time. When natural logarithms of the survivorship function are taken, log S(t) = -𝜆t, which is a linear function of t.

Weibull Distribution

The Weibull distribution is a generalization of the exponential distribution. However, unlike the exponential distribution, it does not assume a constant hazard rate and therefore has broader application. The Weibull distribution is characterized by two parameters, 𝛾 and 𝜆. The value of ð›¾ determines the shape of the distribution curve and the value of ðœ† determines its scaling. Consequently, ð›¾ and 𝜆 are called the shape and scale parameters, respectively. When ð›¾ = 1, the hazard rate remains constant as time increases; this is the exponential case. The hazard rate increases when ð›¾ >1 and decrease when ð›¾ <1 as t increases. Thus, the Weibull distribution may be used to model the survival distribution of a population with increasing, decreasing, or constant risk.

The probability density function, cumulative distribution functions are, survivorship function, and hazard function are:

Weibull distribution is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Frechet and first applied by Rosin & Rammler to describe a particle size distribution.

Lognormal Distribution

In its simplest form the lognormal distribution can be defined as the distribution of a variable whose logarithm follows the normal distribution. Its origin may be traced as far back as 1879, when McAlister described explicitly a theory of the distribution. Most of its aspects have since been under study. Gaddum gave a review of its application in biology, followed by Boag's applications in cancer research. Its history, properties, estimation problem, and uses in economics have been discussed in detail by AItchison and Brown. Later, other investigators also observed that the age at onset of Alzheimer's disease and the distribution of survival time of several diseases such as Hodgkin's disease and chronic leukemia could be rather closely approximated by a lognormal distribution since they are markedly skewed to the right and the logarithms of survival times are approxiamtely normally distributed.

Consider the survival time T such that log T is normally distributed with mean 𝜇 and variance 𝜎2. We then say that T is lognormally distributed and write T as 𝛬(𝜇, ðœŽ2). It should be noted that ðœ‡ and 𝜎2 are not the mean and variance of the lognormal distribution. The hazard function of the lognormal distrition increases initially to a maximum and then decreases (almost as soon as the median is passed) to zero as time approaches infinity. Therefore, the lognormal distribution is suitable for survival patterns with an initially increasing and then decreasing hazard rate. By a central limit theorem, it can be shown that the distribution of the product of n independent positive variates approaches a lognormal distribution under very general conditions: for example, the distribution of the size of an organism whose growth is subject to many small impulses, the effect of each of which is proportional to the momentary size of the organism.

Gamma Distributions

The gamma distribution, which includes the exponential and chi-square distribution, was used a long time ago by Brown and Flood to describe the life of glass tumblers circulating in a cafeteria and by Birnbaum and Saunders as a statistical model for life length of materials. Since then, this distribution has been used frequently as a model for industrial reliability problems and human survival.

Suppose that failure or death takes place in n stages or as soon as n subfailures have happened. At the end of the first stage, after time T1, the first subfailure occurs; after that the second stage begins and the second subfailure occurs after time T2; and so on. Total failure or death occurs at the end of the nth stage, when the nth subfailure happens. The survival time, T, is then T1 + T2 + … + Tn. The times T1, T2, …, Tn spent in each stage are assumed to be independently exponentially distributed with probability density function 𝜆exp(-𝜆ti), i = 1, …, n. That is, the subfailures occur independently at a constant rate ðœ†. The distribution of T is then called the Erlangian distribution. There is no need for the stages to have physical significance since we can always assume that death occurs in the n-stage process just described. This idea, introduced by A. K. Erlang in his study of congestion in telephone systems, has been used widely in queuing theory and life processes.

The gamma distribution is characterized by two parameters, 𝛾 and ðœ†. When 0 < ð›¾ < 1, there is negative aging and the hazard rate decreases monotonically from infinity to ðœ† as time increases from 0 to infinity. When ð›¾ > 1, there is positive aging and the hazard rate increases monotonically from 0 to ðœ† as time increases from 0 to infinity. When ð›¾ = 1, the hazard rate equals ðœ†, a constant, as in the exponential case.

Log-logistic Distribution

The survival time T has a log-logistic distribution if log(T) has a logistic distribution. The density, survivorship, hazard, and cumulative hazard functions of the log-logistic distribution are, respectively,

The log-logistic distribution is characterized by two parameters, 𝛼, and 𝛾. The median of the log-logistic distribution is ð›¼-1/𝛾. When ð›¾ > 1, the log-logistic hazard has the value 0 at time 0, increases to a peak at a specific t, and then declines, which is similar to the lognormal hazard. When ð›¾ = 1, the hazard starts at ð›¼1/𝛾 and then declines monotonically. When ð›¾ < 1, the hazard starts at infinity and then declines, which is similar to the Weibull distribution. The hazard function declines toward 0 as t approaches infinity. Thus, the log-logistic distribution may be used to describe a first increasing and then decreasing hazard or a monotonically decreasing hazard.

Other Survival Distributions

Many other distributions can be used as models of survival time, three of which we discuss briefly in this section: the linear exponential, the Gompertz, and a distribution whose hazard rate is a step function. The linear-exponential model and the Gompertz distribution are extensions of the exponential distribution. Both describe survival patterns that have a constant initial hazard rate. The hazard rate varies as a linear function of time or age in the linear-exponential model and as an exponential function of time or age in the Gompertz distribution.

In demonstrating the use of the linear-exponential model, Broadbent, uses as an example the serivce of milk bottles that are filled in a dairy, circulated to customers, and returned empty to the dairy. The model was also used by Carbone et al. to describe the survival pattern of patients with plasmacytic myeloma. The hazard function of the linear-exponential distribution is

where ðœ† and ð›¾ can be values such that h(t) is nonnegative. The hazard rate increases from 𝜆 with time if ð›¾ > 0, decrease if ð›¾ < 0, and remains constant (an exponential case) if ð›¾ = 0. The probability density function and the survivorship function are, respectively,

The Gompertz distribution is also characterized by two parameters, ðœ† and 𝛾. The hazard function, survival function, and the probability density function are below, respectively,

Finally, we consider a distribution where the hazard rate is a step function. The hazard rate, survival function, and probability density function are below, respective,


One application of this distribution is the life-table analysis. In a life-table analysis, time is divided into intervals and the harzard rate is assumed to be constant in each interval. However, the overall hazard rate is not necessarily consrtant.


The nine distributions described above are, among others, reasonable model for survival time distribution. All have been designed by considering a biological failure, a death process, or an aging property. They may or may not be appropriate for many practical situations, but the objective here is to illustrate the various possible techniques, assumptions, and arguments that can be used to choose the most appropriate model. If none of these distributions fits the data, investigators might have to derive an original model to suit the particular data, perhaps by using some of the ideas presented here.

Statistic Procedures – Hypothesis Tests for One Population Mean


We often use inferential statistics to make decision or judgments about the value of a parameter, such as a population mean. One of the most commonly used methods for making such decisions or judgements is to perform a hypothesis test. A hypothesis is a statement taht something is true. Typically, a hypothesis test involves two hypotheses: the null hypothesis and the alternative hypothesis (or research hypothesis), which we define as follows.

  • Null hypothesis: A hypothesis is to be tested. We use the symbol H0 to represent the null hypothesis.
  • Alternative hypothesis: A hypothesis to be considered as an alternative to the null hypothesis. We use the symbol Ha to represent the alternative hypothesis.
  • Hypothesis test: The problem in a hypothesis test is to decide whether the null hypothesis should be rejected in favor of the alternative hypothesis.

The first step in setting up a hypothesis test is to decide on the null hypothesis and the alternative hypothesis. The following are some guidelines for choosing these two hypotheses. Although the guidelines refer specifically to hypothesis tests for one population mean, μ, they apply to any hypothesis test concerning one parameter.

Null hypothesis for a hypothesis test concerning a population mean, μ, always specifies a single value for that parameter. Hence we can express the null hypothesis as H0: μ = μ0, where μ0 is some number. The choice of the alternative hypothesis depdends on and should reflect the purpose of the hypothesis test. Three choices are possible for the alternative hypothesis: 1) If the primary concern is deciding whether a population mean, μ, is different from a specified value μ0, we express the alternative hypothesis as, Ha: μ != μ0, where a hypothesis test whose alternative hypothesis has this form is called a two-tailed test. 2) If the primary concern is deciding whether a population mean, μ, is less than a specified value μ0, we express the alternative hypothesis as, Ha: μ < μ0, where a hypothesis test whose alternative hypothesis has this form is called a left-tailed test. 3) If the primary concern is deciding whether a population mean, μ, is greater than a specified value μ0, we express the alternative hypothesis as, Ha: μ > μ0, where a hypothesis test whose alternative hypothesis has this form is called a right-tailed test. A hypothesis test is called a one-tailed test if it is either left tailed or right tailed. It is not uncommon that an sample mean falls within the area of acceptance for a two-tailed test but falls within the area of rejection for a one-tailed test. Therefore, a researcher who wishes to reject the null hypothesis may sometimes find that using a one-tailed rather a two-tailed test allows a previously nonsignificant result to become significant. For this reason, it is important that one-tailed test must depend on the nature of the hypothesis being tested and should therefore be decided at the outset of the research, rather than being decided afterward according to how the results turn out. One-tailed tests can only be used when there is a directional alternative hypothesis. This means that they may not be used unless results in only one direction are of interest and the possibility of the results being in the opposite direction is of no interest or consequence to the researcher.

PS: Results form Wiki

A two-tailed test is appropriate if the estimated value may be more than or less than the reference value, for example, whether a test taker may score above or below the historical average. A one-tailed test is appropirate if the estimated value may depart from the reference value in only one direction, for example, whether a machine produces more than one-percent defective products.

The basic logic hypothesis testing is that: Take a random sample from the population. If the sample data are consistent with the null hypothesis, do not reject the null hypothesis; if the sample data are inconsistent with the null hypothesis and supportive of the alternative hypothesis, reject the null hypothesis in favor of the alternative hypothesis. Suppose that a hypothesis test is conducted at a small significance level: If the null hypothesis is rejected, we conclude that the data provide sufficient evidence to support the alternative hypothesis. If the null hypothesis is not rejected, we conclude that the data do not provide sufficient evidence to support the alterantive hypothesis. Another way of viewing the use of a small significance level is as follows: The null hypothesis gets the benefit of the doubt; the alternative hypothesis has the burden of proof.

When the null hypothesis is rejected in a hypothesis test performed at the signifiance level α, we frequently express that fact with the phrase "the test results are statistically signifant at the α level." Simiarly, when the null hypothesis is not rejected in a hypothesis test performed at the sigificance level α, we often express that fact with the phrase "the test results are not statistically significant at the α level."

One-Mean z-Test (σ known)

The one-mean z-test is also known as the one-sample z-test and the one-variable z-test. We prefer "one-mean" because it makes clear the parameter being tested. Procedure 9.1 provides a step by step method for performing a one-mean z-test. As you can see, Procedure 9.1 includes options for either the critical-value approach or the P-value approach.

Properties and guidelines for use of the one-mean z-test are similar to those for the one-mean z-interval procedure. In particular, the one-mean z-test is robust to moderate violations of the normality assumption but, even for large samples, can sometimes be unduly affected by outlikers because the sample mean is not resistant to outliers.

PS: By saying that the hypothesis test is exact, we mean that the true significance level equals α; by saying that it is approximately correct, we mean that the true significance level only approximately equals α.

One-Mean t-Test

Type II Error

Hypothesis tests do not always yield correct conclusions; they have built-in margins of error. An important part of planning a study is to consider both types of errors that can be made and their effects. Recall that two types of errors are possible with hypothesis tests. One is a Type I error: rejecting a true null hypothesis. The other is a Type II error: not rejecting a false null hypothesis. Also recall that the probability of making a Type I error is called the significance level of the hypothesis test and is denoted α, and that the probability of making a Type II error is denoted β.

Computing Type II Error Probabilities

The probability of making a Type II error depends on the sample size, the significance level, and the true value of the parameter under consideration.

Power Curve for a Oone-Mean z-Test

In modern statistical practice, analysts generally use the probability of not making a Type II error, called the power, to appraise the performance of a hypothesis test. Once we know the Type II error probability, β, obtaining the power is simple – we just substract β from 1. The power of a hypothesis test is between 0 and 1 and measures the ability of the hypothesis test to detect a false null hypothesis. If the power is near 0, the hypothesis test is not very good at detecting a false full hypothesis; if the power is near 1, the hypothesis test is extremely good at detect a false null hypothesis.

In reality, the true value of the parameter in question will be unkown. Consequently, construct a table of power for various values of the parameter consistent with the alternative hypothesis is helpful in evaluating the overall effectiveness of a hypothesis test. Even more helpful is a visual display of the effectiveness of the hypothesis test, obtained by plotting points of power against various values of the parameter and then connecting the points with a smooth curve. The resulting curve is called a power curve. In general, the closer a power is to 1, the better the hypothesis test is at detecting a false null hypothesis. Procedure 9.5 provides a step-by-step method for obtaining a power curve for a one-mean z-test.

Sample Size and Power

For a fixed significance level, increasing the sample size increases the power. By using a sufficiently large sample size, we can obtain a hypothesis test with as much power as we want. However, in practice, larger sample sizes tend to increase the cost of a study. Consequently, we must balance, among other things, the cost of a large sample against the cost of possible errors. As we have indicated, power is a useful way to evaluate the overall effectiveness of a hypothesis-testing procedure. Additionally, power can be used to compare different procedures. For example, a researcher might decide between two hypothesis-testing procedures on the basis of which test is more powerful for the situation under consideration.

Statistic Procedures – Confidence Interval


Confidence Intervals for One Population Mean

A common problem in statistics is to obtain information about the mean, μ, of a population. One way to obtain information about a population mean μ without taking a census is to estimate it by a sample mean x(bar). So, a point estimate of a parameter is the value of a statistic used to estimate the parameter. More generally, a statistic is called an unbiased estimator of a parameter if the mean of all its possible values equals the parameter; otherwise, the statistic is called a biased estimator of the parameter. Ideally, we want our statistic to be unbiased and have small standard error. In that case, chances are good that our point estimate (the value of the statistic) will be close to the parameter.

However, it is not uncommon that a sample mean is usually not equal to the population mean, especially when the standard error is not small as stated previously. Therefore, we should accompany any point estimate of μ with information that indicates the accuracy of that estimate. This information is called a confidence-interval estimate for μ. By definition, the confidence interval (CI) is an interval of numbers obtain from a point estimate of a parameter. The confidence level is the confidence we have that the parameter lies in the confidence interval. And the confidence-interval estimate is the confidence level and confidence interval. An confidence interval for a population mean depends on the sample mean, x(bar), which in turn depdends on the sample selected.

Margin of error E indicates how accurate the sample mean of x(bar) is as an estimate for the value of the unknown parameter of μ. With the point estimate and confidence-interval estimate (of 95% confidence interval), we can be 95% confident that the μ is within E of the sample mean. Simply, it means that the μ = point estimate +- E.


  • Point estimate
  • Confidence-interval estimate
  • Margin of error

Computing the Confidence-Interval for One Population Mean (σ known)

We not develop a step-by-step procedure to obtain a confidence interval for a population mean when the population standard deviation is known. In doing so, we assume that the variable under consideration is normallhy distributed. Because of the central limit theorem, however, the procedure will also work to obtain an approximately correct confidence interval when the sample size is large, regardless of the distribution of the variable. The basis of our confidence-interval procedure is the sampling distribution of the sample mean for a normally distributed variable: Suppose that a variable x of a population is normally distributed with mean μ and standard deviation σ. Then, for samples of size n, the variable x(bar) is also normally distributed and has mean μ and standard deviation σ/√n. As a consequence, we have the procedure to compute the confidence-interval.

PS: The one-mean z-interval procedure is also known as the one-sample z-interval procedure and the one-variable z-interval procedure. We prefer "one-mean" because it makes clear the parameter being estimated.

PS: By saying that the confidence interval is exact, we mean that the true confidence level equals 1 – α; by saying that the confidence that the confidence interval is approximately correct, we mean that the true confidence level only approximately equals 1 – α.

Before applying Procedure 8.1, we need to make several comments about it and the assumptions for its use, including:

  • We use the term normal population as an abbreviation for "the variable under consideration is normally distributed."
  • The z-interval procedure works reasonably well even when the variable is not normally distributed and the sample size is small or moderate, provided the variable is not too far from being normally distributed. Thus we say that the z-interval procedure is robust to moderate violations of the normality assumption.
  • Watch for outlilers because their presence calls into question the normality assumption. Moreover, even for large samples, outliers can sometimes unduly affect a z-interval because the sample mean is not resistant to outliers.
  • A statistical procedure that works reasonably well even when one of its assumptions is violated (or moderately violated) is called a robust procedure relative to that assumption.


Key Fact 8.1 makes it clear that you should conduct preliminary data analyses before applying the z-interval procedure. More generally, the following fundamental principle of data analysis is relevant to all inferential procedures: Before performing a statistical-inference procedure, examine the sample data. If any of the conditions required for using the procedure appear to be violated, do not apply the procedure. Instead use a different, more appropriate procedure, if one exists. Even for small samples, where graphical displays must be interpreted carefully, it is far better to examine the data than not to. Remember, though, to proceed cautiously when conducting graphical analyses of small samples, especially very small samples – say, of size 10 or less.

Sample Size Estimation

If the margin of error and confidence level are specified in advance, then we must determine the sample size needed to meet those specifications. To find the formula for the required sample, we solve the margin-of-error formula, E = zα/2 · σ/√n, for n. See the computing formula in Formula 8.2.

Computing the Confidence-Interval for One Population Mean (σ unknown)

So far, we have discussed how to obtain the confidence-interval estimate when the population standard deviation, σ, is known. What if, as is usual in practice, the population standard deviation is unknown? Then we cannot base our confidence-interval procedure on the standardized version of x(bar). The best we can do is estimate the population standard deviation, σ, by the sample standard deviation, s; in other words, we replace σ by s in Procedure 8.1 and base our confidence-interval procedure on the resulting variable t (studentized version of x(bar)). Unlike the standardize version, the studentized version of x(bar) does not have a normal distribution.

Suppose that a variable x of population is normally distributed with mean μ. Then, for samples of size n, the variable t has the t-distribution with n-1 degrees of freedom. A variable with a t-distribution has an associated curve, called a t-curve. Although there is a different t-curve for each number of degrees of freedom, all t-curves are similar and resemble the standard normal cruve. As the number of degrees of freedom becomes larger, t-curves look increasingly like the standard normal curve.

Having discussed t-distributions and t-curves, we can now develop a procedure for obtaining a confidence interval for a population mean when the population standard deviation is unknown. The procedure is called the one-mean t-interval procedure or, when no confusion can arise, simply the t-interval procedure.

Properties and guidelines for use of the t-interval procedure are the same as those for the z-interval procedure. In particular, the t-interval procedure is robust to moderate violations of the normality assumption but, even for large samples, can sometimes be unduly affected by outliers because the sample mean and sample standard deviation are not resistant to outliers.

What If the Assumptions Are Not Satisfied?

Suppose you want to obtain a confidence interval for a population mean based on a small sample, but preliminary data analyses indicate either the presence of outliers or that the variable under consideration is far from normally distributed. As neither the z-interval procedure nor the t-interval procedure is appropriate, what can you do? Under certain conditions, you can use a nonparametric method. Most nonparametric methods do not require even approximate normality, are resistant to outliers and other extreme values, and can be applied regardless of sample size. However, parametric methods, such as the z-interval and t-interval procedures, tend to give more accurate results than nonparametric methods when the normality assumption and other requirements for their use are met.

Inherited Variation and Polymorphism in DNA


The original Human Genome Project and the subsequent study of now many thousands of individuals worldwide have provided a vast amount of DNA sequence information. With this information in hand, one can begin to characterize the types and frequencies of polymorphic variation found in the human genome and to generate catalogues of human DNA sequence diversity around the globe. DNA polymorphisms can be classified according to how the DNA sequence varies between the different alleles.

Single Nucleotide Polymorphisms

The simplest and most common of all polymorphisms are single nucleotide polymorphisms (SNPs). A locus characterized by a SNP usually has only two alleles, corresponding to the two different bases occupying that particular location in the genome. As mentioned previously, SNPs are common and are observed on average once every 1000 bp in the genome. However, the distribution of SNPs is uneven around the genome; many more SNPs are found in noncoding parts of the genome, in introns and in sequences that are some distance from known genes. Nonetheless, there is still a significant number of SNPs that do occur in genes and other known functional elements in the genome. For the set of protein-coding genes, over 100,000 exonic SNPs have been documented to date. Approximately half of these do not alter the predicted amino acid sequence of the encoded protein and are thus termed synonymous, whereas the other half do alter the amino acid sequence and are said to be nonsynonymous. Other SNPs introduce or change a stop codon, and yet others alter a known splice site; such SNPs are candidates to have significant functional consequences.

The significance for health of the vast majority of SNPs is unknown and is the subject of ongoing research. The fact that SNPs are common does not mean that they are without effect on health or longevity. What it does mean is that any effect of common SNPs is likely to involve a relatively subtle altering of disease susceptibility rather than a direct cause of serious illness.

Insertion-Deletion Polymorphisms

A second class of polymorphism is the result of variations caused by insertion or deletion (in/dels or simply indels) of anywhere from a single base pair up to approximately 1000 bp, although larger indels have been documented as well. Over a million indels have been described, numbering in the hundreds of thousands in any one individual’s genome. Approximately half of all indels are referred to as “simple” because they have only two alleles – that is, the presence or absence of the inserted or deleted segment.

Microsatellite Polymorphisms

Other indels, however, are multiallelic due to variable numbers of the segment of DNA that is inserted in tandem at a particular location, thereby constituting what is referred to as a microsatellite. They consist of stretches of DNA composed of units of two, three, or four nucleotides, such as TGTGTG, CAACAACAA, or AAATAAATAAAT, repeated between one and a few dozen times at a particular site in the genome. The different alleles in a microsatellite polymorphism are the result of differing numbers of repeated nucleotide units contained within any one microsatellite and are therefore sometimes also referred to as short tandem repeat (STR) polymorphisms. A microsatellite locus often has many alleles (repeat lengths) that can be rapidly evaluated by standard laboratory procedures to distinguish different individuals and to infer familial relationships. Many tens of thousands of microsatellite polymorphic loci are known throughout the human genome. Finally, microsatellites are a particularly useful group of indels. Determining the alleles at multiple microsatellite loci is currently the method of choice for DNA fingerprinting used for identity testing.

Mobile Element Insertion Polymorphisms

Nearly half of the human genome consists of families of repetitive elements that are dispersed around the genome. Although most of the copies of these repeats are stationary, some of them are mobile and contribute to human genetic diversity through the process of retrotransposition, a process that involves transcription into an RNA, reverse transcription into a DNA sequence, and insertion into another site in the genome. Mobile element polymorphisms are found in nongenic regions of the genome, a small proportion of them are found within genes. At least 5000 of these polymorphic loci have an insertion frequency of greater than 10% in various populations.

Coyp Number Variants

Another important type of human polymorphism includes copy number variants (CNVs). CNVs are conceptually related to indels and microsatellites but consist of variation in the number of copies of larger segments of the genome, ranging in size from 1000 bp to many hundreds of kilobase pairs. Variants larger than 500 kb are found in 5% to 10% of individuals in the general population, whereas variants encompassing more than 1 Mb are found in 1% to 2%. The largest CNVs are sometimes found in regions of the genome characterized by repeated blocks of homologous sequences called segmental duplications (or segdups).

Smaller CNVs in particular may have only two alleles (i.e., the presence or absence of a segment), similar to indels in that regard. Larger CNVs tend to have multiple alleles due to the presence of different numbers of copies of a segment of DNA in tandem. In terms of genome diversity between individuals, the amount of DNA involved in CNVs vastly exceeds the amount that differs because of SNPs. The content of any two human genomes can differ by as much as 50 to 100 Mb because of copy number differences at CNV loci.

Notably, the variable segment at many CNV loci can include one to as several dozen genes, that thusCNVs are frequently implicated in traits that involve altered gene dosage. When a CNV is frequent enough to be polymorphic, it represents a background of common variation that must be understood if alterations in copy number observed in patients are to be interpreted properly. As with all DNA polymorphism, the significance of different CNV alleles in health and disease susceptibility is the subject of intensive investigation.

Inversion Polymorphisms

A final group of polymorphisms to be discussed is inversions, which differ in size from a few base pairs to large regions of the genome (up to several megabase pairs) that can be present in either of two orientations in the genomes of different individuals. Most inversions are characterized by regions of sequence homology at the edges of the inverted segment, implicating a process of homologous recombination in the origin of the inversions. In their balanced form, inversions, regardless of orientation, do not involve a gain or loss of DNA, and the inversion polymorphisms (with two alleles corresponding to the two orientations) can achieve substantial frequencies in the general population.