Of the three survival functions, survivorship or its graphical presentation, the survival curve, is the most widely used. Examples include product-limit (PL) method of estimating the survivorship function by Kaplan and Meier, life-table analysis, relative survival rate, five-year surival rate, and corrected survival rate. Product-limit method is applicable to small, moderate, and large samples. However, if the data have already been grouped into intervals, or the sample size is very large, say in the thousands, or the interest is in a large population, it may be more convenient to perform a life-table analysis. The PL estimats and life-table estimates of the survivorship function are essentially the same. Many authors use the term life-table estimates for the PL estimates. The only difference is that the PL estimate is based on individual survival times, whereas in the life-table method, survival times are grouped into intervals. The PL estimate can be considered as a special case of the life-table estimate where each interval contains only one observation.
Product-Limit Estimates of Survivorship Function
Let us first consider the simple case where all the patients are observed to death so that the survival times are exact and known. Let t_{1}, t_{2},…, t_{n} be the exact survival times of the n individuals under study. Conceptually, we consider this group of patients as a random sample from a much larger population of similar patients. We relabel the n survival times, t_{1}, t_{2},…, t_{n} in ascending order such that t_{(1)} <= t_{(2)} =< … =< t_{(n)}. As a consequence (by the definition of survival function), the survivorship function at t_{(i)} can be estimated as
where n – i is the number of people in the sample surviving longer than t_{(i)}. If two or more t_{(i)} are equal (tied observations), the largest i values is used. This gives a conservative estimate for the tied observations. In practice, sample survivorship function is computed at every distinct survival time. We do not have to worry about the intervals between the distinct survival times in which no one dies and the survivorship function remmains constant. Survivorship function in this example is a step function starting at 1.0 (100%) and decreasing in steps of 1/n (if there are no ties) to zero. When survivorship function is plotted versus t, the various percentiles of survial time can be read from the graph or calculated from survivorship function.
This method can be applied only if all patients are followed to death. If some of the patients are still alive at the end of the study, a different method of estimating survivorship function, such as the PL estimate given by Kaplan and Meier, is required. The rationale can be illustrated by the following simple example. Suppose that 10 patients join a clinical study at the beginning of 2000; during that year 6 patients die and 4 survive. At the end of the year, 20 additional patients join the study. In 2001, 3 patients who entered in the beginning of 2000 and 15 patients who entered later die, leaving one and five survivors, respectively. Suppose that the study terminates at the end of 2001 and you want to estimate the proportion of patients in the population surviving for two years or more, that is, S(2).
The first group of patients in the example is followed for two years; the second group is followed for only one year. One possible estimate, the reduced-sample estimate, is S^(2) = 1/10 = 0.1, which ignores the 20 patients who are followed only for one year. Kaplan and Meier believe that the second sample, under observation for only one year, can contribute to the estimate of S(2).
Patients who survived two years may be considered as surviving the first year and then surviving one more year. Thus, the probability of surviving for two years or more is equal to the probability of survival the first year and then surviving one more year. That is,
S(2) = P(surviving first year and then surviving one more year)
which can be written as
S(2) = P(surviving two years given patient has survived first year) x P(surviving first year)
The Kaplan-Meier estimate of S(2) following above is
For the data given above, one of the four patients who survived the first year survived two years, so the first proportion is 1/4. Four of the 10 patients who entered at the beginning of 2000 and 5 of the 20 patients who entered at the end of 2000 survived one year. Therefore, the second proportion is (4 + 5) / (10 + 20). The PL estimate of S(2) is
This simple rule may be generalized as follows: The probability of surviving k (>=2) or more years from the beginning of the study is a product of k observed survival rates:
where p_{1} denotes the proportion of patients surviving at least one year, p_{2} the proportion of patients surviving the second year after they have survived one year, p_{3} the proportion of patients surviving the third year after they have survived two years, and p_{k} the proportion of patients surviving the kth year after they have survived k – 1 years.
Thereore, the PL estimate of the probability of surviving any particular number of years from the beginning of study is the product of the same estimate up to the preceding year, and the observed survival rate for the particular year, that is,
The PL estimates are maximum likelihood estimates. In practice, the PL estimates can be calculated by constructing a table with five columns following the outline below.
- Column 1 contains all the survival times, both censored and uncensored, in order from smallest to largest. Affix a plus sign to the censored observation. If a censored observation has the same value as an uncensored observations, the latter should appear first.
- The second column, labeled i, consists of the corresponding rank of each observation in column 1.
- The third, labeled r, pertains to uncensored observations only. Let r = i.
- Compute (n – r) / (n – r + 1), or p_{i}, for every uncensored observation t_{(i)} in column 4 to give the proportion of patients surviving up to and then through t_{(i)}.
- In column 5, S^(t) is the product of all values of (n – r) / (n – r + 1) up to and including t. If some uncensored observations are ties, the smallest S^(t) should be used.
The Kaplan-Meier method provides very useful estimates of survival probabilities and graphical presentation of survival distribution. It is the most widely used method in survival data analysis. Breslow and Crowley and Meier have shown that under certain conditions, the estimate is consistent and asymptomatically normal. However, a few critical features should be mentioned.
- The Kaplan-Meier estimates are limited to the time interval in which the observations fall. If the largest observation is uncensored, the PL estimate at that time equals zero. Although the estimate may not be welcomed by physicians, it is correct since no one in the sample lives longer. If the largest observation is censored, the PL estimate can never equal zero and is undefined beyond the largest observation.
- The most commonly used summary statistic in survival analysis is the median survival time. A simple estimate of the median can be read from survival curves estimated by the PL method as the time t at which S^(t) = 0.5. However, the solution may not be unique, if the surival curve is horizontal at S^(t) = 0.5; any t value in the interval t_{1} to t_{2} is a reasonable estimate of the median. A practical solution is to take the midpoint of the interval as the PL estimtate of the median.
- If less than 50% of the observations are uncensored and the largest observation is censored, the median survival time cannot be estimated. A practical way to handle the situation is to use probabilities of surviving a given length of time, say 1, 3, or 5 years, or the mean survival time limited to a given time t.
- The PL method assumes that the censoring times are independent of the survival times. In other words, the reason an observation is censored is unrelated to the cause of death. This assumption is true if the patient is still alive at the end of the study period. However, the assumption is violated if the patient develops severe adverse effects from the treatment and is forced to leave the study before death or if the patient died of a cause other than the one under study. When there is inappropriate censoring, the PL method is not appropriate. In practice, one way to alleviate the problem is to avoid it or to reduce it to a minimum.
- Simialr to other estimators, the standard error (S.E.) of the Kaplan-Meier estimator of S(t) gives an indicaton of the potential error of S^(t). The confidence interval deserces more attention than just the point estimate S^(t). A 95% confidence interval for S(t) is S^(t) x 1.96 S.E. [S^(t)].
Life-Table Analysis
The life-table method is one of the oldest techniques for measuring mortality and describing the survival experience of a population. It has been used by actuaries, demographers, governmental agencies, and medical researchers in studies of survival, population growth, fertility, migration, length of married life, length of working life, and so on. There has been a decennial series of life tables on the entire U.S. population since 1900. States and local governments also publish life tables. These life tables, summarizing the mortality experience of a specific population for a specific period of time, are called population life tables. As clinical and epidemiologic research become more common, the life-table method has been applied to patients with a given disease who have been followed for a period of time. Life tables constructed for pateints are called clinical life tables. Although population and clinical life tables are similar in calculation, the sources of required data are different.
There are two kinds of population life tables: the cohort life table and current life table. The cohort life table describes the survival or mortality experience from birth to death of a specific cohort of persons who were born at about the same time, for example, all persons born in 1950. The cohort has to be followed from 1950 until all of them die. The proportion of death (survivor) is then used to construct life tables for successive calender years. This type of table, useful in population projection and prospective studies, is not often constructed since it requires a long follow-up period.
The current life table is constructed by applying the age-specific mortality rates of a population in a given period of time to a hypothetical cohort of 100,000 or 1,000,000 persons. The starting point is birth at year 0. Two sources of data are required for constructing a population life table: 1) census data on the number of living persons at each age for a given year at midyear and 2) vital statistics on the number of deaths in the given year for each age. For example, a current U.S. life table assumes a hypothetical cohort of 100,000 persons that is subject to the age-specific death rates based on the observed data for the United States in the 1900 census. The current life table, based on the life experience of an actual population over a short period of time, gives a good summary of current mortality. This type of life table is regularly published by government agencies of different levels. One of the most often reported statistics from current life tables is the life expectancy. The term population life table is often used to refer to ther current life table.
Current life tables usually have the following columns:
- Age interval (x to x + t). This is the time interval between two exact ages x and x + t; t is the length of the interval. For example, the interval 20-21 includes the time interval from the 20th birthday up to the 21st birthday (but not including the 21st birthday).
- Proportion of persons alive at beginning of age interval but dying during the interval (_{t}q_{x}). The information is obtained from census data. For example, (_{t}q_{x}) for age interval 20-21 is the proportion of persons who died on or after their 20th birthday and before their 21st birthday. It is an estimate of the conditional probability of dying in the interval given the person is alive at age x. This column is usually calcuated from data of the decennial census of population and deaths occurring in the given time interval.
- Number living at beginning of age interval (l_{x}). The initial value of l_{x}, the size of the hypothetical population, is usually 100,000 or 1,000,000. The successive values are computed using the formula, l_{x} = l_{x-1}(1 – _{t}q_{x-t}), where 1 – _{t}q_{x-t} is the proportion of persons who survived the previous age interval.
- Number dying during age interval (_{t}d_{x}), _{t}d_{x} = l_{x}(_{t}q_{x}) = l_{x} – l_{x+1}
- Stationary population (_{t}L_{x} and T_{x}). Here _{t}L_{x} is the total number of years lived in the ith age interval or the number of person-years that l_{x} persons, aged x exactly, live through the interval. For those who survive the interval, their contribution to _{t}L_{x} is the length of the interval, t. For those who die during the interval, we may not know exactly the time of death and the surival time must be estimated. The conventional assumption is that they live one-half of the interval and contribute t/2 to the calculatin of _{t}L_{x}. Thus, _{t}L_{x} = t(l_{x+1} + 1/2*_{t}d_{x}). The symbol T_{x} is the total number of person-years lived beyond age t by person alive at that age.
- Average remaining lifetime or average number of years of life remaining at beginning of age interval (e^{0}_{i}). This is also known as the life expectancy at a given age, which is defined as the number of years remaining to be lived by persons at age x: e^{0}_{i} = T_{x}/l_{x}. The expected age at death of a person aged x is x + e^{0}_{i}. The e^{0}_{i} at x = 0 is the life expetancy at birth.
Clinical life table, or the actuarial life table method has been applied to clinical data for many decades. Berkson and Gage and Cutler and Ederer give a life-table method for estimating the survivorship function; Gehan provides methods for estimating all three functions (survivorship, density, and hazard).
The life-table method requires a fairly large number of observations, so that survival times can be grouped into intervals. Similar to the PL estimate, the life-table method incorporates all survival information accumulated up to the termination of the study. For example, in computing a five-year survival rate of breast cancer patients, one need not restrict oneself only to those patients who have entered on study for five or more years. Patients who have entered for four, three, two, and even one year contribute useful information to the evaluation of five-year survival. In this way, the life-table technique uses incomplete data such as losses to follow-up and persons withdrawn alive as well as complete death data.
Columns of clinical life table:
- Interval (t_{i} + t_{i+1}). The first column gives the intervals into which the survival times and times to loss or withdrawal are distributed. The interval is from t_{i} up to but not including t_{i+1}, i = 1, …, s. The last interval has an infinite length. These intervals are assumed to be fixed.
- Midpoint (t_{mi}). The midpoint of each interval, designated t_{mi}, i = 1, …, s-1, is included for convenience in ploting the harzard and probability density functions. Both functions are plotted as t_{mi}.
- Width (b_{i}). The width of each interval, b_{i} = t_{i+1}– t_{i}, i = 1, …, s-1, is needed for calculation of the hazard and density functions. The width of the last interval, b_{s}, is theoretically infinite; no estimate of the hazard or density function can be obtained for this interval.
- Number lost to follow-up (l_{i}). This is the number of people who are lost to observation and whose survival status is thus unknown in the ith interval (i = 1, …, s).
- Number withdrawn alive (w_{i}). People withdrawn alive in the ith interval are those known to be alive at the closing date of the study. The survival time recorded for such persons is the length of time from entrance to the closing date of the study.
- Number dying (d_{i}). This is the number of people who die in the ith interval. The survival time of these people is the time from entrance to death.
- Number entering the ith interval (n^{'}_{i}). The number of people entering the first interval n^{'}_{1} is the total sample size. Other entries are determined from n^{'}_{i} = n^{'}_{i-1} – l_{i-1} – w_{i-1} – d_{i-1}. That is, the number of persons entering the ith interval is equal to the number studied at the beginning of the preceding interval minus those who are lost to follow-up, withdrawn alive, or have died in the preceding interval.
- Number exposed to risk (n_{i}). This is the number of people who are exposed to risk in teh ith interval and is defined as n_{i} = n^{'}_{i} – 1/2*(l_{i} + w_{i}). It is assumed that the times to loss or withdrawal are approximately uniformly distributed in the interval. Therefore, people lost or withdrawn in the interval are exposed to risk of death for one-half the interval. If there are no losses or withdrawals, n_{i} = n^{'}_{i}.
- Conditional proportion dying (q^{^}_{i}). This is defined as q_{i} = d_{i}/n_{i} for i = 1, …, s-1, and q^{^}_{s} = 1. It is an estimate of the conditional probability of death in teh ith interval given exposure to the risk of death in the ith interval.
- Conditonal proportion surviving (p^{^}_{i}). This is given by p^{^}_{i} = 1 – q^{^}_{i}, which is an estimate of the conditional probability of surviving in the ith interval.
- Cumulative proportion surviving [S^{^}(t_{i})]. This is an estimate of the survivorship function at time t_{i}; it is often referred to as the cumulative survival rate. For i = 1, S^{^}(t_{1}) = 1 and for i = 2, …, s, S^{^}(t_{i}) = p^{^}_{i-1}S^{^}(t_{i-1}). It is the usual life-table estimate and is based on the fact taht surviving to the start of the ith interval means surviving to the start of and then through the (i – 1)th interval.
- Estimated probability density function [f^{^}(t_{m})]. This is defined as the probability of dying in the ith interval per unit width. Thus, a natural estimate at the midpoint of the interval is
- Hazard function [h^{^}(t_{mi})]. The hazard function for the ith interval, estimated at teh midpoint, is
It is the number of death per unit time in the interval divided by the average number of survivors at the midpoint of the interval. That is, h^{^}(t_{mi}) is derived from f^{^}(t_{mi})/S^{^}(t_{mi}) and S^{^}(t_{mi}) = 1/2*[S^{^}(t_{i+1}) + S^{^}(t_{i})] since S^{^}(t_{i}) is defined as the probability of surviving at the beginning, not the midpoint, of the ith interval: