Month: September 2017

Inferences for Population Proportions

September 24, 2017 Evidence-Based Medicine, Medical Statistics No comments , , , , , , , , , ,

Confidence Intervals for One Population Proportion

Statisticicans often need to determine the proportion (percentage) of a population that has a specific attribute. Some examples are:

  • the percentage of U.S. adults who have health insurance
  • the percentage of cars in the United States that are imports
  • the percentage of U.S. adults who favor stricter clean air health standards
  • the percentage of Canadian women in the labor force

In the first case, the population consists of all U.S. adults and the specified attribute is "has health insurance." For the second case, the population consists of all cars in the United States and the specific attribute is "is an import." The population in the third case is all U.S. adults and the specified attribute is "favors stricter clean air health standards." In the fourth case, the population consists of all Canadian women and the specified attribute is "is in the labor force."

We know that it is often impractical or impossible to take a census of a large population. In practice, therefore, we use data from a sample to make inferences about the population proportion.

A sample proportion, p^, is computed by using the formula

p^ = x / n

where x denotes the number of members in the sample that have the specified attribute and, as usual, n denotes the sample size. For convenience, we sometimes refer to x as the number of successes and to nx as the number of failures.

The Sampling Distribution of the Sample Proportion

To make inferences about a population mean, 𝜇, we must know the sampling distribution of the sample mean, that is, the distribution of the variable x(bar) (see detail for confidence interval for one population mean at thread "Statistic Procedure – Confidence Interval" The same is true for proportions: To make inferences about a population proportion, p, we need to know the sampling distribution of the sample proportion, that is, the distribution of the variable p^. Because a proportin can always be regarded as a mean, we can use our knowledge of the sampling distribution of the sample mean to derive the sampling distribution of the sample proportion. In practice, the sample size usually is large, so we concentrate on that case.

The accuracy of the normal approximation depdends on n and p. If p is close to 0.5, the approximation is quite accurate, even for moderate n. The farther p is from 0.5, the larger n must be for the approximation to be accurate. As a rule of thumb, we use the normal approximation when np and n(1 – p) are both 5 or greater. Alternatively, another commonly used rule of thumb is that np and n(1 – p) are both 10 or greater; still another is that np(1 – p) is 25 or greater.

Below is the one-proportion z-interval procedure, which is also known as the one-sample z-interval procedure for a population proportion and the one-variable proportion interval procedure. Of note, as stated in Assumption 2 of Procedure 12.1, a condition for using that procedure is that "the number of successes, x, and the number of failures, nx, are both 5 or greater." We can restate this condition as "np^ and n(1 – p^) are both 5 or greather," which, for an unknown p, corresponds to the rule of thumb for using the normal approximation given after Key Fact 12.1.

Determining the Required Sample Size

If the margin of error (E) and confidence level are specified in advance, then we must determine the sample size required to meet those specifications. Solving for n in the formula for margin of error, we get

n = p^(1 – p^)(Z𝛼/2 / E)2

This formula cannot be used to obtain the required sample size because the sample proportion, p^, is not known prior to sampling. There are two ways around this problem. To begin, we examine the graph of p^(1 – p^) versus p^ shown in Figure 12.1. The graph reveals that the largest p^(1 – p^) can be is 0.25, which occurs when p^ = 0.5. The farther p^ is from 0.5, the smaller will be the value of p^(1 – p^). Because the largest possible value of p^(1 – p^) is 0.25, the most conservative approach for determining sample size is to use that value in the above equation. The sample size obtained then will generally be larger than necessary and the margin of error less than required. Nonetheless, this approach guarantees that the specifications will at least be met. In the same vein, if we have in mind a likely range for the observed value of p^, then, in light of Figure 12.1, we should take as our educated guess for p^ the value in the range closest to 0.5. In either case, we should be aware that, if the observed value of p^ is closer to 0.5 than is our educated guess, the margin of error will be larger than desired.

Hypothesis Tests for One Population Proportion

Just earlier, we showed how to obtain confidence intervals for a population proportion. Now we show how to perform hypothesis tests for a population proportion. This procedure is actually a special case of the one-mean z-test. For Key Fact 12.1, we deduce that, for large n, the standardized version of p^,

has approximately the standard normal distribution. Consequently, to perform a large sample hypothesis test with null hypothesis H0: p = p0, we can use the variable

at the test statistic and obtain the critical value(s) or P-value from the standard normal table. We call this hypothesis-testing procedure the one-proportion z-test.

Hypothesis Tests for Two Population Proportions

For independent samples of sizes n1 and n2 from the two populations, we have Key Fact 12.2

Now we can develop a hypothesis-testing procedure for comparing two population proportions. Our immediate goal is to identify a variable that we can use as the test statistic. From Key Fact 12.2, we know that, for large, independent samples, the standardized variabvle z has approximately the standard normal distribution. They null hypothesis for a hypothesis test to compare two population proportions is H0: p1 = p2. If the null hypothesis is true, then p1 – p2 = 0, and, consequently, the bariable in 


However, because p is unknown, we cannot use this variable as the test statistic. Consequently, we must estimate p by using sample information. The best estimate of p is obtained by pooling the data to get the proportion of successes in both samples combined; that is, we estimate p by

Where the p^p is called the pooled sample proportion. After replacing the p by p^p we get the final test statistic, which can be used as the test statsitic and has approximately the standard normal distribution for large samples if the null hypothesis is true. Hence we have Procedure 12.3, the two-proportions z-test. Also, it is known as the two-sample z-test for two population proportions and the two-variable proportions test.

It is very fortunate that the confidence intervals for the difference between two population proportions could be computed. As we can use Key Fact 12.2 to derive a confidence-interval procedure for the difference between two population proportions, called the two-proportions z-interval procedure. Note the following: 1) The two-proportions z-interval procedure is also known as the two-sample z-interval procedure for two population proportions and the two-variable proportions interval procedure. 2) Guidelines for interpreting confidence intervals for the difference, p1p2, between two population proportions are similar to those for interpreting confidence intervals for the difference, 𝜇1 – ðœ‡2, between two population means, as describe in other relative threads.

Update on Oct 2 2017

Supplemental Data – Confidence Intervals of Odds Ratio (OR) and Relative Risk (RR)


The sampling distribution of the odds ratio is positively skewed. However, it is approximately normally distributed on the natural log scale. After finding the limits on the LN scale, use the EXP function to find the limits on the original scale. The standard deviation of LN(OR) is

SD of LN(OR) = square root of (1/a + 1/b + 1/c + 1/d)

Now we know the distribution of LN(OR) and the standard deviation (mean and variation) of LN(OR), and the z-proportion procedure could be conducted to compute the confidence intervals of LN(OR).


Similar with OR, the sampling distribution of the relative risk is positively skewed but is approximately normally distributed on the natural log scale. Constructing a confidence interval for the relative risk is similar to constructing a CI for the odds ratio except that there is a different formula for the SD.

SD of LN(RR) = square root of [ b/a(a+b) + d/c(c+d) ]

Nonparametric Methods of Estimating Survival Functions

September 10, 2017 Clinical Trials, Medical Statistics No comments , , , , , , , , , ,

Of the three survival functions, survivorship or its graphical presentation, the survival curve, is the most widely used. Examples include product-limit (PL) method of estimating the survivorship function by Kaplan and Meier, life-table analysis, relative survival rate, five-year surival rate, and corrected survival rate. Product-limit method is applicable to small, moderate, and large samples. However, if the data have already been grouped into intervals, or the sample size is very large, say in the thousands, or the interest is in a large population, it may be more convenient to perform a life-table analysis. The PL estimats and life-table estimates of the survivorship function are essentially the same. Many authors use the term life-table estimates for the PL estimates. The only difference is that the PL estimate is based on individual survival times, whereas in the life-table method, survival times are grouped into intervals. The PL estimate can be considered as a special case of the life-table estimate where each interval contains only one observation.

Product-Limit Estimates of Survivorship Function

Let us first consider the simple case where all the patients are observed to death so that the survival times are exact and known. Let t1, t2,…, tn be the exact survival times of the n individuals under study. Conceptually, we consider this group of patients as a random sample from a much larger population of similar patients. We relabel the n survival times, t1, t2,…, tn in ascending order such that t(1) <= t(2) =< … =< t(n). As a consequence (by the definition of survival function), the survivorship function at t(i) can be estimated as

where ni is the number of people in the sample surviving longer than t(i). If two or more t(i) are equal (tied observations), the largest i values is used. This gives a conservative estimate for the tied observations. In practice, sample survivorship function is computed at every distinct survival time. We do not have to worry about the intervals between the distinct survival times in which no one dies and the survivorship function remmains constant. Survivorship function in this example is a step function starting at 1.0 (100%) and decreasing in steps of 1/n (if there are no ties) to zero. When survivorship function is plotted versus t, the various percentiles of survial time can be read from the graph or calculated from survivorship function.

This method can be applied only if all patients are followed to death. If some of the patients are still alive at the end of the study, a different method of estimating survivorship function, such as the PL estimate given by Kaplan and Meier, is required. The rationale can be illustrated by the following simple example. Suppose that 10 patients join a clinical study at the beginning of 2000; during that year 6 patients die and 4 survive. At the end of the year, 20 additional patients join the study. In 2001, 3 patients who entered in the beginning of 2000 and 15 patients who entered later die, leaving one and five survivors, respectively. Suppose that the study terminates at the end of 2001 and you want to estimate the proportion of patients in the population surviving for two years or more, that is, S(2).

The first group of patients in the example is followed for two years; the second group is followed for only one year. One possible estimate, the reduced-sample estimate, is S^(2) = 1/10 = 0.1, which ignores the 20 patients who are followed only for one year. Kaplan and Meier believe that the second sample, under observation for only one year, can contribute to the estimate of S(2).

Patients who survived two years may be considered as surviving the first year and then surviving one more year. Thus, the probability of surviving for two years or more is equal to the probability of survival the first year and then surviving one more year. That is,

S(2) = P(surviving first year and then surviving one more year)

which can be written as

S(2) = P(surviving two years given patient has survived first year) x P(surviving first year)

The Kaplan-Meier estimate of S(2) following above is

For the data given above, one of the four patients who survived the first year survived two years, so the first proportion is 1/4. Four of the 10 patients who entered at the beginning of 2000 and 5 of the 20 patients who entered at the end of 2000 survived one year. Therefore, the second proportion is (4 + 5) / (10 + 20). The PL estimate of S(2) is

This simple rule may be generalized as follows: The probability of surviving k (>=2) or more years from the beginning of the study is a product of k observed survival rates:

where p1 denotes the proportion of patients surviving at least one year, p2 the proportion of patients surviving the second year after they have survived one year, p3 the proportion of patients surviving the third year after they have survived two years, and pk the proportion of patients surviving the kth year after they have survived k – 1 years.

Thereore, the PL estimate of the probability of surviving any particular number of years from the beginning of study is the product of the same estimate up to the preceding year, and the observed survival rate for the particular year, that is,

The PL estimates are maximum likelihood estimates. In practice, the PL estimates can be calculated by constructing a table with five columns following the outline below.

  • Column 1 contains all the survival times, both censored and uncensored, in order from smallest to largest. Affix a plus sign to the censored observation. If a censored observation has the same value as an uncensored observations, the latter should appear first.
  • The second column, labeled i, consists of the corresponding rank of each observation in column 1.
  • The third, labeled r, pertains to uncensored observations only. Let r = i.
  • Compute (nr) / (nr + 1), or pi, for every uncensored observation t(i) in column 4 to give the proportion of patients surviving up to and then through t(i).
  • In column 5, S^(t) is the product of all values of (nr) / (nr + 1) up to and including t. If some uncensored observations are ties, the smallest S^(t) should be used.

The Kaplan-Meier method provides very useful estimates of survival probabilities and graphical presentation of survival distribution. It is the most widely used method in survival data analysis. Breslow and Crowley and Meier have shown that under certain conditions, the estimate is consistent and asymptomatically normal. However, a few critical features should be mentioned.

  • The Kaplan-Meier estimates are limited to the time interval in which the observations fall. If the largest observation is uncensored, the PL estimate at that time equals zero. Although the estimate may not be welcomed by physicians, it is correct since no one in the sample lives longer. If the largest observation is censored, the PL estimate can never equal zero and is undefined beyond the largest observation.
  • The most commonly used summary statistic in survival analysis is the median survival time. A simple estimate of the median can be read from survival curves estimated by the PL method as the time t at which S^(t) = 0.5. However, the solution may not be unique, if the surival curve is horizontal at S^(t) = 0.5; any t value in the interval t1 to t2 is a reasonable estimate of the median. A practical solution is to take the midpoint of the interval as the PL estimtate of the median.
  • If less than 50% of the observations are uncensored and the largest observation is censored, the median survival time cannot be estimated. A practical way to handle the situation is to use probabilities of surviving a given length of time, say 1, 3, or 5 years, or the mean survival time limited to a given time t.
  • The PL method assumes that the censoring times are independent of the survival times. In other words, the reason an observation is censored is unrelated to the cause of death. This assumption is true if the patient is still alive at the end of the study period. However, the assumption is violated if the patient develops severe adverse effects from the treatment and is forced to leave the study before death or if the patient died of a cause other than the one under study. When there is inappropriate censoring, the PL method is not appropriate. In practice, one way to alleviate the problem is to avoid it or to reduce it to a minimum.
  • Simialr to other estimators, the standard error (S.E.) of the Kaplan-Meier estimator of S(t) gives an indicaton of the potential error of S^(t). The confidence interval deserces more attention than just the point estimate S^(t). A 95% confidence interval for S(t) is S^(t) x 1.96 S.E. [S^(t)].

Life-Table Analysis

The life-table method is one of the oldest techniques for measuring mortality and describing the survival experience of a population. It has been used by actuaries, demographers, governmental agencies, and medical researchers in studies of survival, population growth, fertility, migration, length of married life, length of working life, and so on. There has been a decennial series of life tables on the entire U.S. population since 1900. States and local governments also publish life tables. These life tables, summarizing the mortality experience of a specific population for a specific period of time, are called population life tables. As clinical and epidemiologic research become more common, the life-table method has been applied to patients with a given disease who have been followed for a period of time. Life tables constructed for pateints are called clinical life tables. Although population and clinical life tables are similar in calculation, the sources of required data are different.

There are two kinds of population life tables: the cohort life table and current life table. The cohort life table describes the survival or mortality experience from birth to death of a specific cohort of persons who were born at about the same time, for example, all persons born in 1950. The cohort has to be followed from 1950 until all of them die. The proportion of death (survivor) is then used to construct life tables for successive calender years. This type of table, useful in population projection and prospective studies, is not often constructed since it requires a long follow-up period.

The current life table is constructed by applying the age-specific mortality rates of a population in a given period of time to a hypothetical cohort of 100,000 or 1,000,000 persons. The starting point is birth at year 0. Two sources of data are required for constructing a population life table: 1) census data on the number of living persons at each age for a given year at midyear and 2) vital statistics on the number of deaths in the given year for each age. For example, a current U.S. life table assumes a hypothetical cohort of 100,000 persons that is subject to the age-specific death rates based on the observed data for the United States in the 1900 census. The current life table, based on the life experience of an actual population over a short period of time, gives a good summary of current mortality. This type of life table is regularly published by government agencies of different levels. One of the most often reported statistics from current life tables is the life expectancy. The term population life table is often used to refer to ther current life table.

Current life tables usually have the following columns:

  • Age interval (x to x + t). This is the time interval between two exact ages x and x + t; t is the length of the interval. For example, the interval 20-21 includes the time interval from the 20th birthday up to the 21st birthday (but not including the 21st birthday).
  • Proportion of persons alive at beginning of age interval but dying during the interval (tqx). The information is obtained from census data. For example, (tqx) for age interval 20-21 is the proportion of persons who died on or after their 20th birthday and before their 21st birthday. It is an estimate of the conditional probability of dying in the interval given the person is alive at age x. This column is usually calcuated from data of the decennial census of population and deaths occurring in the given time interval.
  • Number living at beginning of age interval (lx). The initial value of lx, the size of the hypothetical population, is usually 100,000 or 1,000,000. The successive values are computed using the formula, lx = lx-1(1 – tqx-t), where 1 – tqx-t is the proportion of persons who survived the previous age interval.
  • Number dying during age interval (tdx), tdx = lx(tqx) = lxlx+1
  • Stationary population (tLx and Tx). Here tLx is the total number of years lived in the ith age interval or the number of person-years that lx persons, aged x exactly, live through the interval. For those who survive the interval, their contribution to tLx is the length of the interval, t. For those who die during the interval, we may not know exactly the time of death and the surival time must be estimated. The conventional assumption is that they live one-half of the interval and contribute t/2 to the calculatin of tLx. Thus, tLx = t(lx+1 + 1/2*tdx). The symbol Tx is the total number of person-years lived beyond age t by person alive at that age.
  • Average remaining lifetime or average number of years of life remaining at beginning of age interval (e0i). This is also known as the life expectancy at a given age, which is defined as the number of years remaining to be lived by persons at age x: e0i =  Tx/lx. The expected age at death of a person aged x is x + e0i. The e0i at x = 0 is the life expetancy at birth.

Clinical life table, or the actuarial life table method has been applied to clinical data for many decades. Berkson and Gage and Cutler and Ederer give a life-table method for estimating the survivorship function; Gehan provides methods for estimating all three functions (survivorship, density, and hazard).

The life-table method requires a fairly large number of observations, so that survival times can be grouped into intervals. Similar to the PL estimate, the life-table method incorporates all survival information accumulated up to the termination of the study. For example, in computing a five-year survival rate of breast cancer patients, one need not restrict oneself only to those patients who have entered on study for five or more years. Patients who have entered for four, three, two, and even one year contribute useful information to the evaluation of five-year survival. In this way, the life-table technique uses incomplete data such as losses to follow-up and persons withdrawn alive as well as complete death data.

Columns of clinical life table:

  • Interval (ti + ti+1). The first column gives the intervals into which the survival times and times to loss or withdrawal are distributed. The interval is from ti up to but not including ti+1, i = 1, …, s. The last interval has an infinite length. These intervals are assumed to be fixed.
  • Midpoint (tmi). The midpoint of each interval, designated tmi, i = 1, …, s-1, is included for convenience in ploting the harzard and probability density functions. Both functions are plotted as tmi.
  • Width (bi). The width of each interval, bi = ti+1ti, i = 1, …, s-1, is needed for calculation of the hazard and density functions. The width of the last interval, bs, is theoretically infinite; no estimate of the hazard or density function can be obtained for this interval.
  • Number lost to follow-up (li). This is the number of people who are lost to observation and whose survival status is thus unknown in the ith interval (i = 1, …, s).
  • Number withdrawn alive (wi). People withdrawn alive in the ith interval are those known to be alive at the closing date of the study. The survival time recorded for such persons is the length of time from entrance to the closing date of the study.
  • Number dying (di). This is the number of people who die in the ith interval. The survival time of these people is the time from entrance to death.
  • Number entering the ith interval (n'i). The number of people entering the first interval n'1 is the total sample size. Other entries are determined from n'i = n'i-1 – li-1 – wi-1 – di-1. That is, the number of persons entering the ith interval is equal to the number studied at the beginning of the preceding interval minus those who are lost to follow-up, withdrawn alive, or have died in the preceding interval.
  • Number exposed to risk (ni). This is the number of people who are exposed to risk in teh ith interval and is defined as ni = n'i – 1/2*(li + wi). It is assumed that the times to loss or withdrawal are approximately uniformly distributed in the interval. Therefore, people lost or withdrawn in the interval are exposed to risk of death for one-half the interval. If there are no losses or withdrawals, ni = n'i.
  • Conditional proportion dying (q^i). This is defined as qi = di/ni for i = 1, …, s-1, and q^s = 1. It is an estimate of the conditional probability of death in teh ith interval given exposure to the risk of death in the ith interval.
  • Conditonal proportion surviving (p^i). This is given by p^i = 1 – q^i, which is an estimate of the conditional probability of surviving in the ith interval.
  • Cumulative proportion surviving [S^(ti)]. This is an estimate of the survivorship function at time ti; it is often referred to as the cumulative survival rate. For i = 1, S^(t1) = 1 and for i = 2, …, s,  S^(ti) = p^i-1S^(ti-1). It is the usual life-table estimate and is based on the fact taht surviving to the start of the ith interval means surviving to the start of and then through the (i – 1)th interval.
  • Estimated probability density function [f^(tm)]. This is defined as the probability of dying in the ith interval per unit width. Thus, a natural estimate at the midpoint of the interval is

  • Hazard function [h^(tmi)]. The hazard function for the ith interval, estimated at teh midpoint, is

It is the number of death per unit time in the interval divided by the average number of survivors at the midpoint of the interval. That is, h^(tmi) is derived from f^(tmi)/S^(tmi) and S^(tmi) = 1/2*[S^(ti+1) + S^(ti)] since S^(ti) is defined as the probability of surviving at the beginning, not the midpoint, of the ith interval:

Several Major Distributions of Survival Function

September 3, 2017 Medical Statistics, Oncology, Research No comments , , , , , , , , , , , , , ,

Exponential Distribution

The simplest and most important distribution in survival studies is the exponential distribution. In the late 1940s, researchers began to choose the exponential distribution to describe the life pattern of electronic systems. The exponential distribution has since continued to play a role in lifetime studies analogous to that of the normal distribution in other areas of statistics. The exponential distribution is often referred to as a purely random failure pattern. It is famous for its unique "lack of memory," which requires that the age of the animal or person does not affect future survival. Although many survival data cannot be described adequately by the exponential distribution, an understanding of it facilitates the treatment of more general situations.

The exponential distribution is characterized by a constant hazard rate 𝜆, its only parameter. A high ðœ† value indicates high risk and short survival; a low ðœ† value indicates low risk and long survival. When the surival time T follows the exponential distribution with a parameter ðœ†, the probability density function is defined as

The cumulative distribution function is

and the survivorship function is then

and the hazard function is

Note that the hazard function is a constant, ðœ†, independent of t. Because the exponential distribution is characterized by a constant hazard rate, independent of the age of the person, there is no aging or wearing out, and failure or death is a random event indepdendent of time. When natural logarithms of the survivorship function are taken, log S(t) = -𝜆t, which is a linear function of t.

Weibull Distribution

The Weibull distribution is a generalization of the exponential distribution. However, unlike the exponential distribution, it does not assume a constant hazard rate and therefore has broader application. The Weibull distribution is characterized by two parameters, 𝛾 and 𝜆. The value of ð›¾ determines the shape of the distribution curve and the value of ðœ† determines its scaling. Consequently, ð›¾ and 𝜆 are called the shape and scale parameters, respectively. When ð›¾ = 1, the hazard rate remains constant as time increases; this is the exponential case. The hazard rate increases when ð›¾ >1 and decrease when ð›¾ <1 as t increases. Thus, the Weibull distribution may be used to model the survival distribution of a population with increasing, decreasing, or constant risk.

The probability density function, cumulative distribution functions are, survivorship function, and hazard function are:

Weibull distribution is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Frechet and first applied by Rosin & Rammler to describe a particle size distribution.

Lognormal Distribution

In its simplest form the lognormal distribution can be defined as the distribution of a variable whose logarithm follows the normal distribution. Its origin may be traced as far back as 1879, when McAlister described explicitly a theory of the distribution. Most of its aspects have since been under study. Gaddum gave a review of its application in biology, followed by Boag's applications in cancer research. Its history, properties, estimation problem, and uses in economics have been discussed in detail by AItchison and Brown. Later, other investigators also observed that the age at onset of Alzheimer's disease and the distribution of survival time of several diseases such as Hodgkin's disease and chronic leukemia could be rather closely approximated by a lognormal distribution since they are markedly skewed to the right and the logarithms of survival times are approxiamtely normally distributed.

Consider the survival time T such that log T is normally distributed with mean 𝜇 and variance 𝜎2. We then say that T is lognormally distributed and write T as 𝛬(𝜇, ðœŽ2). It should be noted that ðœ‡ and 𝜎2 are not the mean and variance of the lognormal distribution. The hazard function of the lognormal distrition increases initially to a maximum and then decreases (almost as soon as the median is passed) to zero as time approaches infinity. Therefore, the lognormal distribution is suitable for survival patterns with an initially increasing and then decreasing hazard rate. By a central limit theorem, it can be shown that the distribution of the product of n independent positive variates approaches a lognormal distribution under very general conditions: for example, the distribution of the size of an organism whose growth is subject to many small impulses, the effect of each of which is proportional to the momentary size of the organism.

Gamma Distributions

The gamma distribution, which includes the exponential and chi-square distribution, was used a long time ago by Brown and Flood to describe the life of glass tumblers circulating in a cafeteria and by Birnbaum and Saunders as a statistical model for life length of materials. Since then, this distribution has been used frequently as a model for industrial reliability problems and human survival.

Suppose that failure or death takes place in n stages or as soon as n subfailures have happened. At the end of the first stage, after time T1, the first subfailure occurs; after that the second stage begins and the second subfailure occurs after time T2; and so on. Total failure or death occurs at the end of the nth stage, when the nth subfailure happens. The survival time, T, is then T1 + T2 + … + Tn. The times T1, T2, …, Tn spent in each stage are assumed to be independently exponentially distributed with probability density function 𝜆exp(-𝜆ti), i = 1, …, n. That is, the subfailures occur independently at a constant rate ðœ†. The distribution of T is then called the Erlangian distribution. There is no need for the stages to have physical significance since we can always assume that death occurs in the n-stage process just described. This idea, introduced by A. K. Erlang in his study of congestion in telephone systems, has been used widely in queuing theory and life processes.

The gamma distribution is characterized by two parameters, 𝛾 and ðœ†. When 0 < ð›¾ < 1, there is negative aging and the hazard rate decreases monotonically from infinity to ðœ† as time increases from 0 to infinity. When ð›¾ > 1, there is positive aging and the hazard rate increases monotonically from 0 to ðœ† as time increases from 0 to infinity. When ð›¾ = 1, the hazard rate equals ðœ†, a constant, as in the exponential case.

Log-logistic Distribution

The survival time T has a log-logistic distribution if log(T) has a logistic distribution. The density, survivorship, hazard, and cumulative hazard functions of the log-logistic distribution are, respectively,

The log-logistic distribution is characterized by two parameters, 𝛼, and 𝛾. The median of the log-logistic distribution is ð›¼-1/𝛾. When ð›¾ > 1, the log-logistic hazard has the value 0 at time 0, increases to a peak at a specific t, and then declines, which is similar to the lognormal hazard. When ð›¾ = 1, the hazard starts at ð›¼1/𝛾 and then declines monotonically. When ð›¾ < 1, the hazard starts at infinity and then declines, which is similar to the Weibull distribution. The hazard function declines toward 0 as t approaches infinity. Thus, the log-logistic distribution may be used to describe a first increasing and then decreasing hazard or a monotonically decreasing hazard.

Other Survival Distributions

Many other distributions can be used as models of survival time, three of which we discuss briefly in this section: the linear exponential, the Gompertz, and a distribution whose hazard rate is a step function. The linear-exponential model and the Gompertz distribution are extensions of the exponential distribution. Both describe survival patterns that have a constant initial hazard rate. The hazard rate varies as a linear function of time or age in the linear-exponential model and as an exponential function of time or age in the Gompertz distribution.

In demonstrating the use of the linear-exponential model, Broadbent, uses as an example the serivce of milk bottles that are filled in a dairy, circulated to customers, and returned empty to the dairy. The model was also used by Carbone et al. to describe the survival pattern of patients with plasmacytic myeloma. The hazard function of the linear-exponential distribution is

where ðœ† and ð›¾ can be values such that h(t) is nonnegative. The hazard rate increases from 𝜆 with time if ð›¾ > 0, decrease if ð›¾ < 0, and remains constant (an exponential case) if ð›¾ = 0. The probability density function and the survivorship function are, respectively,

The Gompertz distribution is also characterized by two parameters, ðœ† and 𝛾. The hazard function, survival function, and the probability density function are below, respectively,

Finally, we consider a distribution where the hazard rate is a step function. The hazard rate, survival function, and probability density function are below, respective,


One application of this distribution is the life-table analysis. In a life-table analysis, time is divided into intervals and the harzard rate is assumed to be constant in each interval. However, the overall hazard rate is not necessarily consrtant.


The nine distributions described above are, among others, reasonable model for survival time distribution. All have been designed by considering a biological failure, a death process, or an aging property. They may or may not be appropriate for many practical situations, but the objective here is to illustrate the various possible techniques, assumptions, and arguments that can be used to choose the most appropriate model. If none of these distributions fits the data, investigators might have to derive an original model to suit the particular data, perhaps by using some of the ideas presented here.