## Type I and Type II Error in Statistics

We often use inferential statistics to make decisions or judgements about the value of a parameter, such as a population mean. For example, we might need to decide whether the mean weight, 𝜇, of all bags of pretzels packaged by a particular company differs from the advertised weight of 454 grams, or we might want to determine whether the mean age, 𝜇, of all cars in use has increased from the year 2000 mean of 9.0 years. One of the most commonly used methods for making such decisions or judgments is to perform a hypothesis test. A hypothesis is a statement that something is true. For example, the statement “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g” is a hypothesis. Typically, a hypothesis test involves two hypotheses: the null hypothesis and the alternative hypothesis (or research hypothesis), which we define as follows. For instance, in the pretzel packaging example, the null hypothesis might be “the mean weight of all bags of pretzels packaged equals the advertised weight of 454 g,” and the alternative hypothesis might be “the mean weight of all bags of pretzels packaged differs from the advertised weight of 454 g.”

The first step in setting up a hypothesis test is to decide on the null hypothesis and the alternative hypothesis. Generally, the null hypothesis for a hypothesis test concerning a population mean, 𝜇, alway specifies a single value for that parameter. Hence, we can express the null hypothesis as

H0: 𝜇 = 𝜇0

The choice of the alternative hypothesis depends on and should reflect the purpose of the hypothesis test. Three choices are possible for the alternative hypothesis.

• If the primary concern is deciding whether a population mean, 𝜇, is different from a specific value 𝜇0, we express the alternative hypothesis as, Ha ≠ 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a two-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is less than a specific value 𝜇0, we express the alternative hypothesis as, Ha < 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a left-tailed test.
• If the primary concern is deciding whether a population mean, 𝜇, is greater than a specified value 𝜇0, we express the alternative hypothesis as, Ha > 𝜇0. A hypothesis test whose alternative hypothesis has this form is called a right-tailed test.

PS: A hypothesis test is called a one-tailed test if it is either left tailed or right tailed. After we have chosen the null and alternative hypotheses, we must decide whether to reject the null hypothesis in favor of the alternative hypothesis. The procedure for deciding is roughly as follows. In practice, of course, we must have a precise criterion for deciding whether to reject the null hypothesis, which involves a test statistic, that is, a statistic calculated from the data that is used as a basis for deciding whether the null hypothesis should be rejected.

Type I and Type II Errors

In statistics, type I error is to reject the null hypothesis when it is in fact true; whereas type II error is not to reject the null hypothesis when it is in fact false. The probabilities of both type I and type II errors are useful (and essential) to evaluating the effectiveness of a hypothesis test, which involves analyzing the chances of making an incorrect decision. A type I error occurs if a true null hypothesis is rejected. The probability of that happening, the type I error probability, commonly called the significance level of the hypothesis test, is denote 𝛼. A type II error occurs if a false null hypothesis is not rejected. The probability of that happening, the type II error probability, is denote 𝛽. Ideally, both type I and Type II errors should have small probabilities. Then the chance of making an incorrect decision would be small, regardless of whether the null hypothesis is true or false. We can design a hypothesis test to have any specified significance level. So, for instance, of not rejecting a true null hypothesis is important, we should specify a small value for 𝛼. However, in making our choice for 𝛼, we must keep Key Fact 9.1 in mind. Consequently, we must always assess the risks involved in committing both types of errors and use that assessment as a method for balancing the type I and type II error probabilities.

The significance level, 𝛼, is the probability of making type I error, that is, of rejecting a true null hypothesis. Therefore, if the hypothesis test is conducted at a small significance level (e.g., 𝛼 = 0.05), the chance of rejecting a true null hypothesis will be small. Thus, if we do reject the null hypothesis, we can be reasonably confident that the null hypothesis is false. In other words, if we do reject the null hypothesis, we conclude that the data provide sufficient evidence to support the alternative hypothesis.

However, we usually do not know the probability, 𝛽, of making a type II error, that is, of not rejecting a false null hypothesis. Consequently, if we do not reject the null hypothesis, we simply reserve judgement about which hypothesis is true. In other words, if we do not reject the null hypothesis, we conclude only that the data do not provide sufficient evidence to support the alternative hypothesis; we do not conclude that the data provide sufficient evidence to support the null hypothesis. In short, it might be true that there is a true difference but the power of the statistic procedure is not high enough to detect it.

## The Logic Behind Meta-analysis – Random-effects Model The fixed model starts with the assumption that true effect size is the same in all studies. However, in many systematic reviews this assumption is implausible. When we decide to incorporate a group of studies in a meta-analysis, we assume that the studies have enough in common that it makes sense to synthesize the information, but there is generally no reason to assume that they are identical in the sense that the true effect size is exactly the same in all the studies. For example, suppose that we are working with studies that compare the proportion of patients developing a disease in two groups (vaccinated versus placebo). If the treatment works we would expect the effect size (say, the risk ratio) to be similar but not identical across studies. The effect size might be higher (or lower) when the participants are older, or more educated, or healthier than others, or when a more intensive variant of an intervention is used, and so on. Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there maybe different effect sizes underlying different studies.

Or suppose that we are working with studies that assess the impact of an educational intervention. The magnitude of the impact might vary depending on the other resources available to the children, the class size, the age, and other factors, which are likely to vary from study to study. We might not have assessed these covariates in each study. Indeed, we might not even know what covariates actually are related to the size of the effect. Nevertheless, logic dictates that such factors do exist and will lead to variations in the magnitude of the effect.

One way to address this variation across studies is to perform a random-effects meta-analysis. In a random-effects meta-analysis we usually assume that the true effects are normally distributed. For example, in Figure 12.1 the mean of all true effect sizes is 0.60 but the individual effect sizes are distributed about this mean, as indicated by the normal curve. The width of the curve suggests that most of the true effects fall in the range of 0.50 to 0.70. Suppose that our meta-analysis includes three studies drawn from the distribution of studies depicted by the normal curve, and that the true effects in these studies happen to be 0.50, 0.55, and 0.65. If each study had an infinite sample size the sampling error would be zero and the observed effect for each study would be the same as the true effect for that study. If we were to plot the observed effects rather than the true effects, the observed effects would exactly coincide with the true effects.

Of course, the sample size in any study is not infinite and therefore the sampling error is not zero. If the true effect size for a study is 𝜗i, then the observed effect for that study will be less than or greater than 𝜗i, because of sampling error. This figure also highlights the fact that the distance between the overall mean and the observed effect in any given study consists of two distinct parts: true variation in effect sizes (𝜁i) and sampling error (𝜀i). More generally, the observed effect Yi for any study is given by the grand mean, the deviation of the study’s true effect from the grand mean, and the deviation of the study’s observed effect from the study’s true effect. That is, Therefore, to predict how far the observed effect Yi is likely to fall from 𝜇 in any given study we need to consider both the variance of 𝜁i and the variance of 𝜀i. The distance from 𝜇 to each 𝜗i depends on the standard deviation of the distribution of the true effects across studies, called 𝜏 (or 𝜏2 for its variance). The same value of 𝜏2 applies to all studies in the meta-analysis, and in Figure 12.4 is represented by the normal curve at the bottom, which extends roughly from 0.50 to 0.70. The distance from 𝜗i to Yi depends on the sampling distribution of the sample effects about 𝜗i. This depends on the variance of the observed effect size from each study, VYi, and so will vary from one study to the next. In Figure 12.4 the curve for Study 1 is relatively wide while the curve for Study 2 is relatively narrow.

Performing A Random-Effects Meta-Analysis In an actual meta-analysis, of course, rather than start with the population effect and make projections about the observed effects, we start with the observed effects and try to estimate the population effect. In other words our goal is to use the collection of Yi to estimate the overall mean, 𝜇. In order to obtain the most precise estimate of the overall mean (to minimize the variance) we compute a weight mean, where the weight assigned to each study is the inverse of that study’s variance. To compute a study’s variance under the random-effects model, we need to know both the within-study variance and 𝜏2, since the study’s total variance is the sum of these two values.

The parameter 𝜏2 (tau-squared) is the between-studies variance (the variance of the effect size parameters across the population of studies). In other words, if we somehow knew the true effect size for each study, and computed the variance of these effect sizes (across an infinite number of studies), this variance would be 𝜏2. One method for estimating 𝜏2 is the method of moments (or the DerSimonian and Laird) method, as follows. where where k is the number of studies, and In the fixed-effect analysis each study was weighted by the inverse of its variance. In the random-effects analysis, each study will be weighted by the inverse of its variance. The difference is that the variance now includes the original (within-studies) variance plus the estimate of the between-studies variance, T2. To highlight the parallel between the formulas here (random effects) and those in the previous threads (fixed effect) we use the same notations but add an asterisk (*) to represent the random-effects version. Under the random-effects model the weight assigned to each study is where Vyi(*) is the within-study variance for study I plus the between-studies variance, T2. That is, The weight mean, M(*), is then computed as that is, the sum of the products (effect size multiplied by weight) divided by the sum of the weights.

The variance of the summary effect is estimated as the reciprocal of the sum of the weights, or and the estimated standard error of the summary effect is then the square root of the variance, Summary

• Under the random-effects model, the true effects in the studies are assumed to have been sampled from a distribution of true effects.
• The summary effect is our estimate of the mean of all relevant true effects, and the null hypothesis is that the mean of these effects is 0.0 (equivalent to a ratio fo 1.0 for ratio measures).
• Since our goal is to estimate the mean of the distribution, we need to take account of two sources of variance. First, there is within-study error in estimating the effect in each study. Second (even if we knew the true mean for each of our studies), there is variation in the true effects across studies. Study weights are assigned with the goal of minimizing both sources of variance.

## The Logic Behind Meta-analysis – Fixed-ffect Model Effect Size (Based on Means)

When the studies report means and standard deviations (more precisely, the sample standard error of the mean), the preferred effect size is usually the raw mean difference, the standardized mean difference mean difference, or the response ratio. When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta-analysis can be performed directly on the raw data.

Consider a study that reports means for two groups and (Treated and Control) and suppose we wish to compare the means of these two groups, the population mean difference (effect size) is defined as

Population mean difference = 𝜇1 – 𝜇2

Population standard error of mean difference (pooled) = Spooled*(Square Root of [1/n1 + 1/n2])

Overview

Most meta-analyses are based on one of two statistical models, the fixed-effect model or the random-effects model. Under the fixed-effect model we assume that there is one true effect size (hence the term fixed effect) which underlies all the studies in the analysis, and that all differences in observed effects are due to sampling error. While we follow the practice of calling this a fixed-effect model, a more descriptive term would be a common-effect model.

By contrast, under the random-effects model we allow that the true effect could vary from study to study. For example, the effect size might be higher (or lower) in studies where the participants are older, or more educated, or healthier than in others, or when a more intensive variant of an intervention is used, and so on. Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there may be different effect sizes underlying different studies.

Since all studies share the same true effect, it follows that the observed effect size varies from one study to the next only because of the random error inherent in each study. If each study had an infinite sample size the sampling error would be zero and the observed effect for each study would be the same as the true effect. If we were to plot the observed effects rather than the true effects, the observed effects would exactly coincide with the true effects.

In practice, of course, the sample size in each study in not infinite, and so there is sampling error and the effect observed in the study is not the same as the true effect. In Figure 11.2 the true effect for each study is still 0.60 but the observed effect differs from one study to the next.

While the error in any given study is random, we can estimate the sampling distribution of the errors. In Figure 11.3 we have placed a normal curve about the true effect size for each study, with the width of the curve being based on the variance in that study. In Study 1 the sample size was small, the variance large, and the observed effect is likely to fall anywhere in the relatively wide range of 0.20 to 1.00. By contrast, in Study 2 the sample size was relative large, the variance is small, and the observed effect is likely to fall in the relatively narrow range of 0.40 to 0.80. Note that the width of the normal curve is based on the square root of the variance, or standard error. Meta-analysis Procedure

In an actual meta-analysis, of course, rather than starting with the population effect and making projections about the observed effects, we work backwards, starting with the observed effects and trying to estimate the population effect. In order to obtain the most precise estimate of the population effect (to minimize the variance) we compute a weighted mean, where the weight assigned to each study is the inverse of that study’s variance. Concretely, the weight assigned to each study in a fixed-effect meta-analysis is Where VYi is the within-study variance for study (i). The weighted mean (M) is then computed as That is, the sum of the products WiYi (effect size multiplied by weight) divided by the sum of the weights.

The variance of the summary effect is estimated as the reciprocal of the sum the weights, or Once VM is estimated, the standard deviation of the weighted mean (or, standard error of the weighted mean) is computed as the square root of the variance of the summary effect. Now we know the distribution, the point estimation, and the standard deviation, of the weight mean. Thus, the confidence interval of the summary effect could be computed by the confidence interval Z-procedure.

Effect Sizes Measurements

Raw Mean Difference

When the studies report means and standard deviations (continuous variables), the preferred effect size is usually the raw mean difference, the standard mean difference (SMD), or the response ratio. When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta-analysis can be performed directly on the raw difference in means, or the raw mean difference. The primary advantage of the raw mean difference is that it is intuitively meaningful, either inherently or because of widespread use. Examples of raw mean difference include systolic blood pressure (mm Hg), serum LDL-C level (mg/dL), body surface area (m2), and so on.

We can estimate the mean difference D from a study that used two independent groups revealed by the inference procedure for two population means (independent samples). Let’s recall a little for the inference procedure for two population means. The sampling distribution of the difference between two sample meets these characteristics: PS: All is based on the central limit theorem – if the sample size is large, the mean is approximately normally distributed, regardless of the distribution of the variable under consideration.

Once we know the sample mean difference, D, the standard deviation of the mean difference (or the standard error), and in the light of the central limit theorem, we could compute the variance of D. In addition to know the group mean, the standard deviation of group mean, and the group size, we also could compute the pooled sample standard deviation (Sp) or the nonpooled method. Therefore, we would have the value of variance of D, which will be used by meta-analysis procedures (fixed-effect, or random-effects model) to compute the weight (Wi = 1 / VYi). And once the standard error is known, the synthesized confidence interval could be computed.

Standardized Mean Difference, d and g

As noted, the raw mean difference is a useful index when the measure is meaningful, either inherently or because of widespread use. By contrast, when the measure is less well known, the use of a raw mean difference has less to recommend it. In any event, the raw mean difference is an option only if all the studies in the meta-analysis use the same scale. If different studies use different instruments to assess the outcome, then the scale of measurement will differ from study to study and it would not be meaningful to combine raw mean differences.

In such cases we can divide the mean difference in each study by that study’s standard deviation to create an index (the standard mean difference, SMD) that would be comparable across studies. This is the same approach suggested by Cohen in connection with describing the magnitude of effects in statistical power analysis. The standard mean difference can be considered as being comparable across studies based on either of two arguments (Hedges and Olkin, 1985). If the outcome measures in all studies are linear transformations of each other, the standardized mean difference can be seen as the mean difference that would have been obtained if all data were transformed to a scale where the standard deviation within-groups was equal to 1.0.

The other argument for comparability of standardized mean differences is the fact that the standardized mean difference is a measure of overlap between distributions. In this telling, the standardized mean difference reflects the difference between the distributions in the two groups (and how each represents a distinct cluster of scores) even if they do not measure exactly the same outcome.

Computing d and g from studies that use independent groups

We can estimate the standardized mean difference from studies that used two independent groups as where Swithin is the pooled standard deviation across groups. And n1 and n2 are the sample sizes in the two groups, S1 and S2 are the standard deviations in the two groups. The reason that we pool the two sample estimates of the standard deviation is that even if we assume that the underlying population standard deviations are the same, it is unlikely that the sample estimates S1 and S2 will be identical. By pooling the two estimates of the standard deviation, we obtain a more accurate estimate of their common value.

The sample estimate of the standardized mean difference is often called Cohen’s d in research synthesis. Some confusion about the terminology has resulted from the fact that the index 𝛿, originally proposed by Cohen as a population parameter for describing the size of effects for statistical power analysis is also sometimes called d. The variance of d is given by, Again, with the standard mean difference and variance of the standard mean difference known, we could compute the confidence interval of the standard mean difference. However, it turns out that d has a slight bias, tending to overestimate the absolute value of 𝛿 in small samples. This bias can be removed by a simple correction that yields an unbiased estimate of 𝛿, with the unbiased estimate sometimes called Hedges’ g (Hedges, 1981). To convert from d to Hedges’ g we use a correction factor, which is called J. Hedges (1981) gives the exact formula for J, but in common practice researchers use an approximation,  Summary

• Under the fixed-effect model all studies in the analysis share a common true effect.
• The summary effect is our estimate of this common effect size, and the null hypothesis is that this common effect is zero (for a difference) or one (for a ratio).
• All observed dispersion reflects sampling error, and study weights are assigned with the goal of minimizing this within-study error. Converting Among Effect Sizes

Despite that widespread used outcome measures would be across studies under investigation, it is not uncommon that the outcome measures among individual studies are different. When we convert between different measures we make certain assumptions about the nature of the underlying traits or effects. Even if these assumptions do not hold exactly, the decision to use these conversions is often better than the alternative, which is to simply omit the studies that happened to use an alternate metric. This would involve loss of information, and possibly the systematic loss of information, resulting in a biased sample of studies. A sensitivity analysis to compare the meta-analysis results with and without the converted studies would be important. Figure 7.1 outlines the mechanism for incorporating multiple kinds of data in the same meta-analysis. First, each study is used to compute an effect size and variance of native index, the log odds ratio for binary data, d for continuous data, and r for correlational data. Then, we convert all of these indices to a common index, which  would be either the log odds ratio, d, or r. If the final index is d, we can move from there to Hedges’ g. This common index and its variance are then used in the analysis.

We can convert from a log odds ratio to the standardized mean difference d using where 𝜋 is the mathematical constant. The variance of d would then be where VlogOddsRatio is the variance of the log odds ratio. This method was originally proposed by Hasselblad and Hedges (1995) but variations have been proposed. It assumes that an underlying continuous trait exists and has a logistic distribution (which is similar to a normal distribution) in each group. In practice, it will be difficult to test this assumption.

## Conceiving the Research Question and Developing the Study Plan

The research question is the uncertainty that the investigator wants to resolve by performing his/her study. There is no shortage of good research questions, and even as we succeed in answering some questions, we remain surrounded by others. Clinical trials, for example, established that treatments that block the synthesis of estradiol (aromatase inhibitors) reduce the risk of breast cancer in women who have had early stage cancer. But this led to new questions: How long should treatment be continued; does this treatment prevent breast cancer in patients with BRCA 1 and BRCA 2 mutations; and what is the best way to prevent the osteoporosis that is an adverse effect of these drugs? Beyond that are primary prevention questions: Are these treatments effective and safe for preventing breast cancer in healthy women?

Origins of A Research Question

For an established investigator the best research questions usually emerge from the findings and problems she has observed in her own prior studies and in those of other workers in the field. A new investigator has not yet developed this base of experience. Although a fresh perspective is sometimes useful by allowing a creative person to conceive new approaches to old problems, lack of experience is largely an impediment.

A good way to begin is to clarify the difference between a research question and a research interest. Consider this research question:

• Dose participation in group counseling sessions reduce the likelihood of domestic violence among women who have recently immigrated from Central America?

This might be asked by someone whose research interest involves the efficacy of group counseling, or the prevention of domestic violence, or improving health in recent immigrants. The distinction between research questions and research interests matters because it may turn out that the specific research question cannot be transformed into a viable study plan, but the investigator can still address research interest by asking a different question. Of course, it’s impossible to formulate a research question if you are not even sure about your research interest (beyond knowing that you’re supposed to have one). If you find yourself in this boat, you’re not alone: Many new investigators have not yet discovered a topic that interests them and is susceptible to a study plan they can design. You can begin by considering what sorts of research studies have piqued your interest when you’ve seen them in a journal. Or perhaps you were bothered by a specific patient whose treatment seemed inadequate or inappropriate: What could have been done differently that might have improved her outcome? Or one of your attending physicians told you that hypokalemia always caused profound thirst, and another said the opposite, just as dogmatically.

Mastering the Literature

It is important to master the published literature in an area of study: Scholarship is a necessary precursor to good research. A new investigator should conduct a thorough search of published literature in the areas pertinent to the research question and critically read important original papers. Carrying out a systematic review is a great next step for developing and establishing expertise in a research area, and the underlying literature review can serve as background for grant proposals and research reports. Recent advances may be known to active investigators in a particular field long before they are published. Thus, mastery of a subject entails participating in meetings and building relationships with experts in the field.

Being Alert to New Ideas and Techniques

In addition to the medical literature as a source of ideas for research questions, it is helpful to attend conferences in which new work is presented. At least as important as the formal presentations are the opportunities for informal conversations with other scientists at posters and during the breaks. A new investigator who overcomes her shyness and engages a speaker at the coffee break may find the experience richly rewarding, and occasionally she will have a new senior colleague. Even better, for a speaker known in advance to be especially relevant, it may be worthwhile to look up her recent publications and contact her in advance to arrange a meeting during the conference.

A skeptical attitude about prevailing beliefs can stimulate good research questions. For example, it was widely believed that lacerations which extend through the dermis required sutures to assure rapid healing and a satisfactory cosmetic outcome. However, Quinn et al. noted personal experience and case series evidence that wounds of moderate size repair themselves regardless of whether wound edges are approximated. They carried out a randomized trial in which all patients with hand lacerations less than 2 cm in length received tap water irrigation and a 48-hour antibiotic dressing. One group was randomly assigned to have their wounds sutured, and the other group did not receive sutures. The suture group had a more painful and time-consuming treatment in the emergency room, but blinded assessment revealed similar time to healing and similar cosmetic results. This has now become a standard approach used in clinical practice.

The application of new technologies often generates new insights and questions about familiar clinical problems, which in turn can generate new paradigms. Advances in imaging and in molecular and genetic technologies, for example, have spawned translational research studies that have led to new treatments and tests that have changed clinical medicine. Similarly, taking a new concept, technology, or finding from one field and applying it to a problem in a different field can lead to good research questions. Low bone density, for example, is a risk factor for fractures. Investigators applied this technology to other outcomes and found that women with low bone density have higher rates of cognitive decline, stimulating research for factors, such as low endogenous levels of estrogen, that could lead to loss of both bone and memory.

Keeping the Imagination Roaming

Careful observation of patients has led to many descriptive studies and is fruitful source of research questions. Teaching is also an excellent source of inspiration; ideas for studies often occur while preparing presentations or during discussions with inquisitive students. Because there is usually not enough time to develop these ideas on the spot, it is useful to keep them in a computer file or notebook for future reference.

There is a major role for creativity in the process of conceiving research questions, imagining new methods to address old questions, and playing with ideas. Some creative ideas come to mind during informal conversations with colleagues over lunch; others arise from discussing recent research or your own ideas in small groups. Many inspirations are solo affairs that strike while preparing a lecture, showering, perusing the Internet, or just sitting and thinking. Fear of criticism or seeming unusual can prematurely quash new ideas. The trick is to put an unresolved problem clearly in view and allow the mind to run freely around it. There is also a need for tenacity, returning to a troublesome problem repeatedly until a resolution is reached.

Choosing and Working with a Mentor

Nothing substitutes for experience in guiding the many judgements involved in conceiving a research question and fleshing out a study plan. Therefore an essential strategy for a new investigator is to apprentice herself to an experienced mentor who has the time and interest to work with her regularly.

A good mentor will be available for regular meetings and informal discussions, encourage creative ideas, provide wisdom that comes from experience, help ensure protected time for research, open doors to networking and funding opportunities, encourage the development of independent work, and put the new investigator’s name first on grants and publications whenever appropriate. Sometimes it is desirable to have more than one mentor, representing different disciplines. Good relationships of this sort can also lead to tangible resources that are needed – office space, access to clinical populations, data sets and specimen banks, specialized laboratories, financial resources, and a research team.

Characteristics of A Good Research Question

• Feasible

It is best to know the practical limits and problems of studying a question early on, before wasting much time and effort along unworkable lines.

Number of subjects. Many studies do not achieve their intended purposes because they can not enroll enough subjects. A preliminary calculation of the sample size requirements of the study early on can be quite helpful, together with an estimate of the number of subjects likely to be available for the study, the number who would be excluded or refuse to participate, and the number who would be lost to follow up. Even careful planning often produces estimates that are overly optimistic, and the investigator should assume that there are enough eligible and willing subjects. It is sometimes necessary to carry out a pilot survey or chart review to be sure. If the number of subjects appears insufficient, the investigator can consider several strategies: expanding the inclusion criteria, eliminating unnecessary exclusion criteria, lengthening the time frame for enrolling subjects, acquiring additional sources of subjects, developing more precise measurement approaches, inviting colleagues to join in a multi center study, and using a different study design.

Technical expertise. The investigators must have skills, equipment, and experience needed for designing the study, recruiting the subjects, measuring the variables, and managing and analyzing the data. Consultants can help to shore up technical aspects that are unfamiliar to the investigators, but for major areas of the study it is better to have an experienced colleague steadily involved as a coinvestigator; for example, it is wise to include a statistician as a member of the research team from the beginning of the planning process. It is best to use familiar and established approaches, because the process of developing new methods and skills is time-consuming and uncertain. When a new approach is needed, such as measurement of a new biomarker, expertise in how to accomplish the innovation should be sought.

Cost in time and money. It is important to estimate the costs of each component of the project, bearing in mind that the time and money needed will generally exceed the amounts projected at the outset. If the projected costs exceed the available funds, the only options are to consider a less expensive design or to develop additional sources of funding. Early recognition of a study that is too expensive or time-consuming can lead to modification or abandonment of the plan before expending a great deal of effort.

Scope. Problems often arise when an investigator attempts to accomplish too much, marking many measurements at repeated contacts with a large group of subjects in an effort to answer too many research questions. The solution is to narrow the scope of the study and focus only on the most important goals. Many scientists find it difficult to give up the opportunity to answer interesting side questions, but the reward may be a better answer to the main question at hand.

Fundability. Few investigators have the personal or institutional resources to fund their own research projects, particularly if subjects need to be enrolled and followed, or expensive measurements must be made. The most elegantly designed research proposal will not be feasible if no one will pay for it.

• Interesting

An investigator may have many motivations for pursuing a particular research question: because it will provide financial support, because it is a logical or important next step in building a career, or because getting at the truth of the matter is interesting. We like this last reason; it is one that grows as it is exercised and that provides the intensity of effort needed for overcoming the many hurdles and frustrations of the research process. However, it is wise to confirm that you are not the only one who finds a question interesting. Speak with mentors, outside experts, and representatives of potential funders such as NIH project officers before devoting substantial energy to develop a research plan or grant proposal that peers and funding agencies may consider dull.

• Novel
Good clinical research contributes new information. A study that merely reiterates what is already established is not worth the effort and cost and is unlikely to receive funding. The novelty of a proposed study can be determined by thoroughly reviewing the literature, consulting with experts who are familiar with unpublished ongoing research, and searching for abstracts of projects in your area of interest that have been funded using the NIH Research Portfolio Online Reporting Tools (RePORT) website. Reviews of studies submitted to NIH give considerable weight to whether a proposed study is innovative such that a successful result could shift paradigms of research or clinical practice through the use of new concepts, methods, or interventions. Although novelty is an important criterion, a research question need not be totally original – it can be worthwhile to ask whether a previous observation can be replicated, whether the findings in one population also apply to others, or whether a new measurement method can clarify the relationship between known risk factors and a disease. A confirmatory study is particularly useful if it avoids the weaknesses of previous studies or if the result to be confirmed was unexpected.
• Ethical
A good research question must be ethical. If the study poses unacceptable physical risks or invasion of privacy, the investigator must seek other ways to answer the question. If there is uncertainty about whether the study is ethical, it is helpful to discuss it at an early stage with a representative of the institutional review board (IRB).
• Relevant
A good way to decide about relevance is to imagine the various outcomes that are likely to occur and consider how each possibility might advance scientific knowledge, influence practice guidelines and health policy, or guide further research. NIH reviewers emphasize the significance of a proposed study: the importance of the problem, how the project will improve scientific knowledge, and how the result will change concepts, methods, or clinical services.

Developing the Research Question and Study Plan

It helps a great deal to write down the research question and a brief (one-page) outline of the study plan at an early stage (detail here http://www.tomhsiung.com/wordpress/2017/05/outline-of-a-study/). This requires some self-discipline, but it forces investigator to clarify her ideas about the plan and to discover specific problems that need attention. The outline also provides a basis for specific suggestions from colleagues.

## Outline of A Study

May 10, 2017 Clinical Research, Clinical Trials No comments

This is the one-page study plan of a project carried out by Valerie Flaherman, MD, MPH, begun while she was a general pediatrics fellow at UCSF. Most beginning investigators find observational studies easier to pull off, but in this case a randomized clinical trial of modest size and scope was feasible, the only design that could adequately address the research question, and ultimately successful.

Title: Effect of Early Limited Formula Use on Breastfeeding

Research question:

Among term newborns who have lost >=5% of their birth weight before 36 hours of age, does feeding 10 cc of formula by syringe after each breastfeeding before the onset of mature milk production increase the likelihood of subsequent successful breastfeeding?

Significance:

1. Breast milk volume is low until mature milk production begins 2-5 days after birth.
2. Some mothers become worried if the onset of mature milk production is late and their baby loses a lot of weight, leading them to abandon breastfeeding within the first week. A strategy that increased the proportion of mothers who succeed in breastfeeding would have many health and psycho-social benefits to mother and child.
3. Observational studies have found that formula feeding in the first few days after birth is associated with decreased breastfeeding duration. Although this could be due to confounding by indication, the finding has led to WHO and CDC guidelines aimed at reducing the use of formula during the birth hospitalization.
4. However, a small amount of formula combined with breastfeeding and counseling might make the early breastfeeding experience more positive and increase the likelihood of success. A clinical trial is needed to assess possible benefits and harms of this strategy.

Study design:

Unblinded randomized control trial with blinded outcome ascertainment

Subjects:

• Entry criteria: Healthy term newborns 24-48 hours old who have lost >=5% of their birth weight in the first 36 hours after birth.
• Sampling design: Consecutive sample of consenting patients in two Northern California academic medical centers

Predictor variable, randomly assigned but not blinded:

• Control: Parents are taught infant soothing techniques.
• Intervention: Parents are taught to syringe-feed 10 cc of formula after each breastfeeding until the onset of mature milk production.

Outcome variables, blindly ascertained:

1. Any formula feeding at 1 week and 1, 2, and 3 months
2. Any breastfeeding at 1 week and 1, 2, and 3 months