probability density function

Nonparametric Methods of Estimating Survival Functions

September 10, 2017 Clinical Trials, Medical Statistics No comments , , , , , , , , , ,

Of the three survival functions, survivorship or its graphical presentation, the survival curve, is the most widely used. Examples include product-limit (PL) method of estimating the survivorship function by Kaplan and Meier, life-table analysis, relative survival rate, five-year surival rate, and corrected survival rate. Product-limit method is applicable to small, moderate, and large samples. However, if the data have already been grouped into intervals, or the sample size is very large, say in the thousands, or the interest is in a large population, it may be more convenient to perform a life-table analysis. The PL estimats and life-table estimates of the survivorship function are essentially the same. Many authors use the term life-table estimates for the PL estimates. The only difference is that the PL estimate is based on individual survival times, whereas in the life-table method, survival times are grouped into intervals. The PL estimate can be considered as a special case of the life-table estimate where each interval contains only one observation.

Product-Limit Estimates of Survivorship Function

Let us first consider the simple case where all the patients are observed to death so that the survival times are exact and known. Let t1, t2,…, tn be the exact survival times of the n individuals under study. Conceptually, we consider this group of patients as a random sample from a much larger population of similar patients. We relabel the n survival times, t1, t2,…, tn in ascending order such that t(1) <= t(2) =< … =< t(n). As a consequence (by the definition of survival function), the survivorship function at t(i) can be estimated as

where ni is the number of people in the sample surviving longer than t(i). If two or more t(i) are equal (tied observations), the largest i values is used. This gives a conservative estimate for the tied observations. In practice, sample survivorship function is computed at every distinct survival time. We do not have to worry about the intervals between the distinct survival times in which no one dies and the survivorship function remmains constant. Survivorship function in this example is a step function starting at 1.0 (100%) and decreasing in steps of 1/n (if there are no ties) to zero. When survivorship function is plotted versus t, the various percentiles of survial time can be read from the graph or calculated from survivorship function.

This method can be applied only if all patients are followed to death. If some of the patients are still alive at the end of the study, a different method of estimating survivorship function, such as the PL estimate given by Kaplan and Meier, is required. The rationale can be illustrated by the following simple example. Suppose that 10 patients join a clinical study at the beginning of 2000; during that year 6 patients die and 4 survive. At the end of the year, 20 additional patients join the study. In 2001, 3 patients who entered in the beginning of 2000 and 15 patients who entered later die, leaving one and five survivors, respectively. Suppose that the study terminates at the end of 2001 and you want to estimate the proportion of patients in the population surviving for two years or more, that is, S(2).

The first group of patients in the example is followed for two years; the second group is followed for only one year. One possible estimate, the reduced-sample estimate, is S^(2) = 1/10 = 0.1, which ignores the 20 patients who are followed only for one year. Kaplan and Meier believe that the second sample, under observation for only one year, can contribute to the estimate of S(2).

Patients who survived two years may be considered as surviving the first year and then surviving one more year. Thus, the probability of surviving for two years or more is equal to the probability of survival the first year and then surviving one more year. That is,

S(2) = P(surviving first year and then surviving one more year)

which can be written as

S(2) = P(surviving two years given patient has survived first year) x P(surviving first year)

The Kaplan-Meier estimate of S(2) following above is

For the data given above, one of the four patients who survived the first year survived two years, so the first proportion is 1/4. Four of the 10 patients who entered at the beginning of 2000 and 5 of the 20 patients who entered at the end of 2000 survived one year. Therefore, the second proportion is (4 + 5) / (10 + 20). The PL estimate of S(2) is

This simple rule may be generalized as follows: The probability of surviving k (>=2) or more years from the beginning of the study is a product of k observed survival rates:

where p1 denotes the proportion of patients surviving at least one year, p2 the proportion of patients surviving the second year after they have survived one year, p3 the proportion of patients surviving the third year after they have survived two years, and pk the proportion of patients surviving the kth year after they have survived k – 1 years.

Thereore, the PL estimate of the probability of surviving any particular number of years from the beginning of study is the product of the same estimate up to the preceding year, and the observed survival rate for the particular year, that is,

The PL estimates are maximum likelihood estimates. In practice, the PL estimates can be calculated by constructing a table with five columns following the outline below.

  • Column 1 contains all the survival times, both censored and uncensored, in order from smallest to largest. Affix a plus sign to the censored observation. If a censored observation has the same value as an uncensored observations, the latter should appear first.
  • The second column, labeled i, consists of the corresponding rank of each observation in column 1.
  • The third, labeled r, pertains to uncensored observations only. Let r = i.
  • Compute (nr) / (nr + 1), or pi, for every uncensored observation t(i) in column 4 to give the proportion of patients surviving up to and then through t(i).
  • In column 5, S^(t) is the product of all values of (nr) / (nr + 1) up to and including t. If some uncensored observations are ties, the smallest S^(t) should be used.

The Kaplan-Meier method provides very useful estimates of survival probabilities and graphical presentation of survival distribution. It is the most widely used method in survival data analysis. Breslow and Crowley and Meier have shown that under certain conditions, the estimate is consistent and asymptomatically normal. However, a few critical features should be mentioned.

  • The Kaplan-Meier estimates are limited to the time interval in which the observations fall. If the largest observation is uncensored, the PL estimate at that time equals zero. Although the estimate may not be welcomed by physicians, it is correct since no one in the sample lives longer. If the largest observation is censored, the PL estimate can never equal zero and is undefined beyond the largest observation.
  • The most commonly used summary statistic in survival analysis is the median survival time. A simple estimate of the median can be read from survival curves estimated by the PL method as the time t at which S^(t) = 0.5. However, the solution may not be unique, if the surival curve is horizontal at S^(t) = 0.5; any t value in the interval t1 to t2 is a reasonable estimate of the median. A practical solution is to take the midpoint of the interval as the PL estimtate of the median.
  • If less than 50% of the observations are uncensored and the largest observation is censored, the median survival time cannot be estimated. A practical way to handle the situation is to use probabilities of surviving a given length of time, say 1, 3, or 5 years, or the mean survival time limited to a given time t.
  • The PL method assumes that the censoring times are independent of the survival times. In other words, the reason an observation is censored is unrelated to the cause of death. This assumption is true if the patient is still alive at the end of the study period. However, the assumption is violated if the patient develops severe adverse effects from the treatment and is forced to leave the study before death or if the patient died of a cause other than the one under study. When there is inappropriate censoring, the PL method is not appropriate. In practice, one way to alleviate the problem is to avoid it or to reduce it to a minimum.
  • Simialr to other estimators, the standard error (S.E.) of the Kaplan-Meier estimator of S(t) gives an indicaton of the potential error of S^(t). The confidence interval deserces more attention than just the point estimate S^(t). A 95% confidence interval for S(t) is S^(t) x 1.96 S.E. [S^(t)].

Life-Table Analysis

The life-table method is one of the oldest techniques for measuring mortality and describing the survival experience of a population. It has been used by actuaries, demographers, governmental agencies, and medical researchers in studies of survival, population growth, fertility, migration, length of married life, length of working life, and so on. There has been a decennial series of life tables on the entire U.S. population since 1900. States and local governments also publish life tables. These life tables, summarizing the mortality experience of a specific population for a specific period of time, are called population life tables. As clinical and epidemiologic research become more common, the life-table method has been applied to patients with a given disease who have been followed for a period of time. Life tables constructed for pateints are called clinical life tables. Although population and clinical life tables are similar in calculation, the sources of required data are different.

There are two kinds of population life tables: the cohort life table and current life table. The cohort life table describes the survival or mortality experience from birth to death of a specific cohort of persons who were born at about the same time, for example, all persons born in 1950. The cohort has to be followed from 1950 until all of them die. The proportion of death (survivor) is then used to construct life tables for successive calender years. This type of table, useful in population projection and prospective studies, is not often constructed since it requires a long follow-up period.

The current life table is constructed by applying the age-specific mortality rates of a population in a given period of time to a hypothetical cohort of 100,000 or 1,000,000 persons. The starting point is birth at year 0. Two sources of data are required for constructing a population life table: 1) census data on the number of living persons at each age for a given year at midyear and 2) vital statistics on the number of deaths in the given year for each age. For example, a current U.S. life table assumes a hypothetical cohort of 100,000 persons that is subject to the age-specific death rates based on the observed data for the United States in the 1900 census. The current life table, based on the life experience of an actual population over a short period of time, gives a good summary of current mortality. This type of life table is regularly published by government agencies of different levels. One of the most often reported statistics from current life tables is the life expectancy. The term population life table is often used to refer to ther current life table.

Current life tables usually have the following columns:

  • Age interval (x to x + t). This is the time interval between two exact ages x and x + t; t is the length of the interval. For example, the interval 20-21 includes the time interval from the 20th birthday up to the 21st birthday (but not including the 21st birthday).
  • Proportion of persons alive at beginning of age interval but dying during the interval (tqx). The information is obtained from census data. For example, (tqx) for age interval 20-21 is the proportion of persons who died on or after their 20th birthday and before their 21st birthday. It is an estimate of the conditional probability of dying in the interval given the person is alive at age x. This column is usually calcuated from data of the decennial census of population and deaths occurring in the given time interval.
  • Number living at beginning of age interval (lx). The initial value of lx, the size of the hypothetical population, is usually 100,000 or 1,000,000. The successive values are computed using the formula, lx = lx-1(1 – tqx-t), where 1 – tqx-t is the proportion of persons who survived the previous age interval.
  • Number dying during age interval (tdx), tdx = lx(tqx) = lxlx+1
  • Stationary population (tLx and Tx). Here tLx is the total number of years lived in the ith age interval or the number of person-years that lx persons, aged x exactly, live through the interval. For those who survive the interval, their contribution to tLx is the length of the interval, t. For those who die during the interval, we may not know exactly the time of death and the surival time must be estimated. The conventional assumption is that they live one-half of the interval and contribute t/2 to the calculatin of tLx. Thus, tLx = t(lx+1 + 1/2*tdx). The symbol Tx is the total number of person-years lived beyond age t by person alive at that age.
  • Average remaining lifetime or average number of years of life remaining at beginning of age interval (e0i). This is also known as the life expectancy at a given age, which is defined as the number of years remaining to be lived by persons at age x: e0i =  Tx/lx. The expected age at death of a person aged x is x + e0i. The e0i at x = 0 is the life expetancy at birth.

Clinical life table, or the actuarial life table method has been applied to clinical data for many decades. Berkson and Gage and Cutler and Ederer give a life-table method for estimating the survivorship function; Gehan provides methods for estimating all three functions (survivorship, density, and hazard).

The life-table method requires a fairly large number of observations, so that survival times can be grouped into intervals. Similar to the PL estimate, the life-table method incorporates all survival information accumulated up to the termination of the study. For example, in computing a five-year survival rate of breast cancer patients, one need not restrict oneself only to those patients who have entered on study for five or more years. Patients who have entered for four, three, two, and even one year contribute useful information to the evaluation of five-year survival. In this way, the life-table technique uses incomplete data such as losses to follow-up and persons withdrawn alive as well as complete death data.

Columns of clinical life table:

  • Interval (ti + ti+1). The first column gives the intervals into which the survival times and times to loss or withdrawal are distributed. The interval is from ti up to but not including ti+1, i = 1, …, s. The last interval has an infinite length. These intervals are assumed to be fixed.
  • Midpoint (tmi). The midpoint of each interval, designated tmi, i = 1, …, s-1, is included for convenience in ploting the harzard and probability density functions. Both functions are plotted as tmi.
  • Width (bi). The width of each interval, bi = ti+1ti, i = 1, …, s-1, is needed for calculation of the hazard and density functions. The width of the last interval, bs, is theoretically infinite; no estimate of the hazard or density function can be obtained for this interval.
  • Number lost to follow-up (li). This is the number of people who are lost to observation and whose survival status is thus unknown in the ith interval (i = 1, …, s).
  • Number withdrawn alive (wi). People withdrawn alive in the ith interval are those known to be alive at the closing date of the study. The survival time recorded for such persons is the length of time from entrance to the closing date of the study.
  • Number dying (di). This is the number of people who die in the ith interval. The survival time of these people is the time from entrance to death.
  • Number entering the ith interval (n'i). The number of people entering the first interval n'1 is the total sample size. Other entries are determined from n'i = n'i-1 – li-1 – wi-1 – di-1. That is, the number of persons entering the ith interval is equal to the number studied at the beginning of the preceding interval minus those who are lost to follow-up, withdrawn alive, or have died in the preceding interval.
  • Number exposed to risk (ni). This is the number of people who are exposed to risk in teh ith interval and is defined as ni = n'i – 1/2*(li + wi). It is assumed that the times to loss or withdrawal are approximately uniformly distributed in the interval. Therefore, people lost or withdrawn in the interval are exposed to risk of death for one-half the interval. If there are no losses or withdrawals, ni = n'i.
  • Conditional proportion dying (q^i). This is defined as qi = di/ni for i = 1, …, s-1, and q^s = 1. It is an estimate of the conditional probability of death in teh ith interval given exposure to the risk of death in the ith interval.
  • Conditonal proportion surviving (p^i). This is given by p^i = 1 – q^i, which is an estimate of the conditional probability of surviving in the ith interval.
  • Cumulative proportion surviving [S^(ti)]. This is an estimate of the survivorship function at time ti; it is often referred to as the cumulative survival rate. For i = 1, S^(t1) = 1 and for i = 2, …, s,  S^(ti) = p^i-1S^(ti-1). It is the usual life-table estimate and is based on the fact taht surviving to the start of the ith interval means surviving to the start of and then through the (i – 1)th interval.
  • Estimated probability density function [f^(tm)]. This is defined as the probability of dying in the ith interval per unit width. Thus, a natural estimate at the midpoint of the interval is

  • Hazard function [h^(tmi)]. The hazard function for the ith interval, estimated at teh midpoint, is

It is the number of death per unit time in the interval divided by the average number of survivors at the midpoint of the interval. That is, h^(tmi) is derived from f^(tmi)/S^(tmi) and S^(tmi) = 1/2*[S^(ti+1) + S^(ti)] since S^(ti) is defined as the probability of surviving at the beginning, not the midpoint, of the ith interval:

Several Major Distributions of Survival Function

September 3, 2017 Medical Statistics, Oncology, Research No comments , , , , , , , , , , , , , ,

Exponential Distribution

The simplest and most important distribution in survival studies is the exponential distribution. In the late 1940s, researchers began to choose the exponential distribution to describe the life pattern of electronic systems. The exponential distribution has since continued to play a role in lifetime studies analogous to that of the normal distribution in other areas of statistics. The exponential distribution is often referred to as a purely random failure pattern. It is famous for its unique "lack of memory," which requires that the age of the animal or person does not affect future survival. Although many survival data cannot be described adequately by the exponential distribution, an understanding of it facilitates the treatment of more general situations.

The exponential distribution is characterized by a constant hazard rate 𝜆, its only parameter. A high ðœ† value indicates high risk and short survival; a low ðœ† value indicates low risk and long survival. When the surival time T follows the exponential distribution with a parameter ðœ†, the probability density function is defined as

The cumulative distribution function is

and the survivorship function is then

and the hazard function is

Note that the hazard function is a constant, ðœ†, independent of t. Because the exponential distribution is characterized by a constant hazard rate, independent of the age of the person, there is no aging or wearing out, and failure or death is a random event indepdendent of time. When natural logarithms of the survivorship function are taken, log S(t) = -𝜆t, which is a linear function of t.

Weibull Distribution

The Weibull distribution is a generalization of the exponential distribution. However, unlike the exponential distribution, it does not assume a constant hazard rate and therefore has broader application. The Weibull distribution is characterized by two parameters, 𝛾 and 𝜆. The value of ð›¾ determines the shape of the distribution curve and the value of ðœ† determines its scaling. Consequently, ð›¾ and 𝜆 are called the shape and scale parameters, respectively. When ð›¾ = 1, the hazard rate remains constant as time increases; this is the exponential case. The hazard rate increases when ð›¾ >1 and decrease when ð›¾ <1 as t increases. Thus, the Weibull distribution may be used to model the survival distribution of a population with increasing, decreasing, or constant risk.

The probability density function, cumulative distribution functions are, survivorship function, and hazard function are:

Weibull distribution is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Frechet and first applied by Rosin & Rammler to describe a particle size distribution.

Lognormal Distribution

In its simplest form the lognormal distribution can be defined as the distribution of a variable whose logarithm follows the normal distribution. Its origin may be traced as far back as 1879, when McAlister described explicitly a theory of the distribution. Most of its aspects have since been under study. Gaddum gave a review of its application in biology, followed by Boag's applications in cancer research. Its history, properties, estimation problem, and uses in economics have been discussed in detail by AItchison and Brown. Later, other investigators also observed that the age at onset of Alzheimer's disease and the distribution of survival time of several diseases such as Hodgkin's disease and chronic leukemia could be rather closely approximated by a lognormal distribution since they are markedly skewed to the right and the logarithms of survival times are approxiamtely normally distributed.

Consider the survival time T such that log T is normally distributed with mean 𝜇 and variance 𝜎2. We then say that T is lognormally distributed and write T as 𝛬(𝜇, ðœŽ2). It should be noted that ðœ‡ and 𝜎2 are not the mean and variance of the lognormal distribution. The hazard function of the lognormal distrition increases initially to a maximum and then decreases (almost as soon as the median is passed) to zero as time approaches infinity. Therefore, the lognormal distribution is suitable for survival patterns with an initially increasing and then decreasing hazard rate. By a central limit theorem, it can be shown that the distribution of the product of n independent positive variates approaches a lognormal distribution under very general conditions: for example, the distribution of the size of an organism whose growth is subject to many small impulses, the effect of each of which is proportional to the momentary size of the organism.

Gamma Distributions

The gamma distribution, which includes the exponential and chi-square distribution, was used a long time ago by Brown and Flood to describe the life of glass tumblers circulating in a cafeteria and by Birnbaum and Saunders as a statistical model for life length of materials. Since then, this distribution has been used frequently as a model for industrial reliability problems and human survival.

Suppose that failure or death takes place in n stages or as soon as n subfailures have happened. At the end of the first stage, after time T1, the first subfailure occurs; after that the second stage begins and the second subfailure occurs after time T2; and so on. Total failure or death occurs at the end of the nth stage, when the nth subfailure happens. The survival time, T, is then T1 + T2 + … + Tn. The times T1, T2, …, Tn spent in each stage are assumed to be independently exponentially distributed with probability density function 𝜆exp(-𝜆ti), i = 1, …, n. That is, the subfailures occur independently at a constant rate ðœ†. The distribution of T is then called the Erlangian distribution. There is no need for the stages to have physical significance since we can always assume that death occurs in the n-stage process just described. This idea, introduced by A. K. Erlang in his study of congestion in telephone systems, has been used widely in queuing theory and life processes.

The gamma distribution is characterized by two parameters, 𝛾 and ðœ†. When 0 < ð›¾ < 1, there is negative aging and the hazard rate decreases monotonically from infinity to ðœ† as time increases from 0 to infinity. When ð›¾ > 1, there is positive aging and the hazard rate increases monotonically from 0 to ðœ† as time increases from 0 to infinity. When ð›¾ = 1, the hazard rate equals ðœ†, a constant, as in the exponential case.

Log-logistic Distribution

The survival time T has a log-logistic distribution if log(T) has a logistic distribution. The density, survivorship, hazard, and cumulative hazard functions of the log-logistic distribution are, respectively,

The log-logistic distribution is characterized by two parameters, 𝛼, and 𝛾. The median of the log-logistic distribution is ð›¼-1/𝛾. When ð›¾ > 1, the log-logistic hazard has the value 0 at time 0, increases to a peak at a specific t, and then declines, which is similar to the lognormal hazard. When ð›¾ = 1, the hazard starts at ð›¼1/𝛾 and then declines monotonically. When ð›¾ < 1, the hazard starts at infinity and then declines, which is similar to the Weibull distribution. The hazard function declines toward 0 as t approaches infinity. Thus, the log-logistic distribution may be used to describe a first increasing and then decreasing hazard or a monotonically decreasing hazard.

Other Survival Distributions

Many other distributions can be used as models of survival time, three of which we discuss briefly in this section: the linear exponential, the Gompertz, and a distribution whose hazard rate is a step function. The linear-exponential model and the Gompertz distribution are extensions of the exponential distribution. Both describe survival patterns that have a constant initial hazard rate. The hazard rate varies as a linear function of time or age in the linear-exponential model and as an exponential function of time or age in the Gompertz distribution.

In demonstrating the use of the linear-exponential model, Broadbent, uses as an example the serivce of milk bottles that are filled in a dairy, circulated to customers, and returned empty to the dairy. The model was also used by Carbone et al. to describe the survival pattern of patients with plasmacytic myeloma. The hazard function of the linear-exponential distribution is

where ðœ† and ð›¾ can be values such that h(t) is nonnegative. The hazard rate increases from 𝜆 with time if ð›¾ > 0, decrease if ð›¾ < 0, and remains constant (an exponential case) if ð›¾ = 0. The probability density function and the survivorship function are, respectively,

The Gompertz distribution is also characterized by two parameters, ðœ† and 𝛾. The hazard function, survival function, and the probability density function are below, respectively,

Finally, we consider a distribution where the hazard rate is a step function. The hazard rate, survival function, and probability density function are below, respective,


One application of this distribution is the life-table analysis. In a life-table analysis, time is divided into intervals and the harzard rate is assumed to be constant in each interval. However, the overall hazard rate is not necessarily consrtant.


The nine distributions described above are, among others, reasonable model for survival time distribution. All have been designed by considering a biological failure, a death process, or an aging property. They may or may not be appropriate for many practical situations, but the objective here is to illustrate the various possible techniques, assumptions, and arguments that can be used to choose the most appropriate model. If none of these distributions fits the data, investigators might have to derive an original model to suit the particular data, perhaps by using some of the ideas presented here.