In most trials, participants have data missing for a variety of reasons. Perhaps they were not able to keep their scheduled clinic visits or were unable to perform or undergo the particular procedures or assessments. In some cases, follow-up of the participant was not completed as outlined in the protocol. The challenge is how to deal with missing data or data of such poor quality that they are in essence missing. One approach is to withdraw participants who have poor data completely from the analysis. However, the remaining subset may no longer be representative of the population randomized and there is no guarantee that the validity of the randomization has been maintained in this process.

Many methods to deal with this issue assume that the data are missing at random; that is, the probability of a measurement not being observed does not depend on what its value would have been. In some contexts, this may be a reasonable assumption, but for clinical trials, and clinical research in general, it would be difficult to confirm. It is, in fact, probably not a vlid assumption, as the reason the data are missing is often associated with the health status of the participant. Thus, during trial design and conduct, every effort must be made to minimize missing data. If the amount of missing data is relatively small, then the available analytic methods will probably be helpful. If the amount of missing data is substantial, there may be no method capable of rescuing the trial. Here, we discuss some of the issues that must be kept in mind when analyzing a trial with missing data.

Rubin provided a definition of missing data mechanisms. If data are missing for reasons unrelated to the measurement that would have been observed and unrelated to covariates, then the data are “missing completely at random.” Statistical analyses based on likelihood inference are valid when the data are missing at random or missing completely at random. If a measure or index allows a researcher to estimate the probability of having missing data, say in a participant with poor adherence to the protocol, then using methods proposed by Rubin and others might allow some adjustment to reduce bias. However, adherence, as indicated earlier, is often associated with a participant’s outcome and attempts to adjust for adherence can lead to misleading results.

If participants do not adhere to the intervention and also do not return for follow-up visits, the primary outcome measured may not be obtained unless it is survival or some easily ascertained event. In this situation, an intention-to-treat analysis is not feasible and no analysis is fully satisfactory. Because withdrawal of participants from the analysis is known to be problematic, one approach is to “impute” or fill in the missing data such that standard analyses can be conducted. This is appealing if the imputation process can be done without introducing bias. There are many procedures for imputation. Those based on multiple imputations are more robust than single imputation.

A commonly used single imputation method is to carry the last observed value forward. This method, also known as an endpoint analysis, requires the very strong and unverifiable assumption that all future observations, if they were available, would remain constant. Although commonly used, the last observation carried forward method is not generally recommended. Using the average value for all participants with available data, or using a regression model to predict the missing value are alternatives, but in either case, the requirement that the data be missing at random is necessary for proper inference.

A more complex approach is to conduct multiple imputations, typically using regression methods, and then perform a standard analysis for each imputation. The final analysis should take into consideration the variability across the imputations. As with single imputation, the inference based on multiple imputation depends on the assumption that the data are missing at random. Other technical approaches are not described here.

If the number of participants lost to follow-up differs in the study groups, the analysis of the data could be biased. For example, participants who are taking a new drug that has adverse effect may, as a consequence, miss scheduled clinic visits. Events may occur but be unobserved. These losses to follow-up would probably not be the same in the control group. In this situation, there may be a bias favoring the new drug. Even if the number lost to follow-up is the same in each study group, the possibility of bias still exists because the participants who are lost in one group may have quite different prognoses and outcomes than those in the other group.

An outlier is an extreme value significantly different from the remaining values. The concern is whether extreme values in the sample should be included in the analysis. This question may apply to a laboratory result, to the data from one of several areas in a hospital or from a clinic in a multi center trial. Removing outliers is not recommended unless the data can be clearly shown to be erroneous. Even though a value may be an outlier, it could be correct, indicating that on occasions an extreme result is possible. This fact could be very important and should not be ignored.