Skip to content

Pharmacokinetics – Distribution Series II – Rate of Drug Distribution

13-Nov-17

Figure 4.1 shows the plasma concentration and the typical tissue concentration profile after the administration of a drug by intravenous injection. It can be seen that during the distribution phase, the tissue concentration increases as the drug distributes to the tissue. Eventually, a type of equilibrium is reached, and following this, in the postdistribution phase, the tissue concentration falls in parallel with the plasma concentration.

Drug distribution is a two-stage process that consist of:

1.Delivery of the drug to the tissue by the blood

2.Diffusion or uptake of drug from the blood to the tissue

The overall rate of distribution is controlled by the slowest of these steps. The delivery of drug to the tissue is controlled by the specific blood flow to a given tissue. This is expressed as tissue perfusion, the volume of blood delivered per unit time (mL/min) per unit of tissue (g). Once at the tissue site, uptake or distribution from the blood is driven largely by the passive diffusion of drug across the epithelial membrane of the capillaries. Because most capillary membranes are very loose, drugs can usually diffuse from the plasma very easily. Consequently, in most cases, drug distribution is perfusion controlled. The rate of drug distribution will vary from one tissue to another, and generally, drugs will distribute fastest to the tissues that have the higher perfusion rates.

Perfusion-Controlled Drug Distribution

Drug is presented to the tissues in the arterial blood, and any uptake of drug by the tissue will result in a lower concentration of drug leaving the tissue in the venous blood. The amount of drug delivered to the tissue per unit time or rate of presentation of a drug to a tissue is given by

rate of presentation = Q * Ca

where Ca is the drug concentration in the arterial blood and Q is the blood flow to the tissue

rate drug leaves the tissue = Q * Cv

where Cv is the drug concentration in the venous blood

so, rate of up take = Q * (Ca – Cv) (remember the O2ER in oxygen delivery?)

When drug uptake is perfusion controlled, the tissue presents no barrier for drug uptake, and the intial rate of uptake will equal the rate of presentation:

initial rate of uptake = Q * Ca

Thus, it is a first-order process. The value of Ca will change continuously as distribution proceeds throughout the body and as drug is eliminated. When the distribution phase in a tissue is complete, the concentration of drug in the tissue will be in equilibrium with the concentration leaving the tissue (venous blood). The ratio of these concentrations is expressed using the tissue blood partition coefficient (Kp):

where Ct is the tissue concentration. The value of Kp will depend on the binding and the relative affinity of a drug for the blood and tissues. Tissue binding will promote a large value of Kp, whereas extensive binding to the plasma proteins will promote a small Kp.

Once the initial distribution phase is complete, the amount of drug in the tissue (At) at any time is

At = Ct * Vt = Kp * Cv * Vt

Distribution is a first-order process and that the rate of distribution may be expressed using the first-order rate constant for distribution (Kd). The physiological determinants of the rate constant for distribution are most easily identified by considering the redistribution process, which is governed by the same physiological factors and has the same rate constant as those for distribution.

If the drug concentration in arterial blood suddenly became zero; the

rate of redistribution = Kd * At = Kd * (Kp * Cv * Vt) = |Q * (Ca – Cv)| (where Ca = 0) = |Q * –Cv| = Q * Cv

Thus, 

Kd = Q / Vt / Kp, when Ca sudden became zero.

The first-order rate constant for distribution is equal to tissue perfusion divided by the tissue: blood partition coefficient and the corresponding distribution half-life is computed via dividing LN(2) (0.693) by Kd.

Summary

The time it takes for distribution to occur is dependnet on tissue perfusion. Generally, drug distribute to well-perfused tissues such as the lungs and major organs faster than they do to poorly perfused tissues such as resting muscle and skin.

The duration of the distribution phase is also dependent on Kp. If a drug has a high Kp value, it may take a long time to achieve equilibrium even if the tissue perfusion is relatively high. If on the other hand, a drug has a high Kp value in a tissue with low perfusion, it will require an extended period of drug exposure to reach equilibrium.

The amount of drug in tissue at equilibrium depends on Kp and on the size of the tissue. A drug may concentrate in a tissue (high Kp), but if the tissue is physically small, the total amount of drug present in the tissue will be low. The distribution of a drug to such a tissue may not have a strong impact on the plasma concentration of the drug.

Redistribution of a drug from the tissues back to the blood is controlled by exactly the same principles. Thus, redistribution take less time when Kp value is small and the perfusion is high, and will take a long time when the Kp is high and the perfusion is low.

Diffusion-Controlled Drug Distribution

The epithelial junctions in some tissues, such as the brain, placenta, and testes, are very tightly knit, and the diffusion of more polar and/or large drugs may proceed slowly. As a result, drug distribution in these tissues may be diffusion controlled. In this case, drug distribution will proceed more slowly for polar drugs than for more lipophilic drugs. It must be pointed out that not all drug distribution to these sites is diffusion controlled. For example, small lipophilic drugs such as the intravenous anesthetics can easily pass membranes by the transcellular route and display perfusion-controlled distribution to the brain.

Diffusion-controlled distribution may be expressed by Fick's law

rate of uptake = Pm * SAm * (Cpu – Ctu)

where Pm is the permeability of the drug through the membrane (cm/h), SAm the surface area of the membrane (cm2), Cpu the unbound drug concentration in the plasma (mg/mL), and Ctu the unbound concentration in the tissue (mg/mL).

Initially, the drug concentration in the tissue is very low, Cpu >> Ctu, so the equation may be written

rate of uptake = Pm * SAm * Cpu

which can be seen that under these circumstances, the rate of diffusion approximates a first-order process.

Pharmacokinetics – Distribution Series

11-Nov-17

As a result of either direct systemic administration or absorption from an extravascular route, drug reaches the systemic circulation, where it very rapidly distributes throughout the entire volume of plasma water and is delivered to tissues around the body. Two aspects of drug distribution need to be considered: how radidly, and to what extent, the drug in the plasma gets taken up by the tissues. A lot of information on the rate of drug disribution can be obtained by observing the pattern of the changes in the plasma concentrations in the early period following drug administration. Information about the extent of drug distribution can be obtained by considering the value of the plasma concentration once distribution is complete. Thus, the plasma concentration constitutes a "window" for obtaining information on the distribution of the bulk of the drug in the body and how it changes over time.

Extent of Drug Distribution

A drug must reach its site of action to produce an effect. Generally, this involves only a very small amount of the overall drug in the body, and access to the site of action is generally a problem only if the site is located in a specialized area or space. The second important aspect of the extent of drug distribution is the relative distribution of a drug between plasma and the rest of the body. This affects the plasma concentration of the drug and is important because: 1) as discussed above, the plasma concentration is the "window" through which we are able to "see" the drug in the body. It is important to know how a measured plasma concentration is related to the total amount of drug in the body; 2) Drug is delivered to the organs of elimination via the blood. If a drug distributes extensively from the plasma to the tissues, the drug in the plasma will constitute only a small fraction of the drug in the body. Little drug will be delivered to the organs of elimination, and this will hamper elimination. Conversely, it a drug is very limited in its ability to distribute beyond the plasma, a greater fraction of the drug in the body will be physically located in the plasma. The organs of elimination will be well supplied with drug, and this will enhance the elimination processes.

Drug distribution to the tissues is driven primarily by the passive diffusion of free, unbound drug along its concentration gradient. Consider the administration of a single intravenous dose of a drug. In the early period after administration, the concentration of drug in the plasma is much higher than that in the tissues, and there is a net movement of drug from the plasma to the tissues; this period is known as the distribution phase. Eventually, a type of equilibrium is established between the tissues and plasma, at which point the ratio of the tissue to plasma concentration remains constant. At this time the distribution phase is complete and the tissue and plasma concentrations rise and fall in parallel; this period is known as the postdistribution phase. It should be noted that after a single dose, true equilibrium between the tissues and the plasma is not achieved in the postdistribution phase because the plasma concentration falls constinuously as drug is eliminated from the body. This breaks the equilibrium between the two and results in the redistribution of drug from the tissues to the plasma. Uptake and efflux transporters in certain tissues may also be involved in the distribution process and may enhance or limit a drug's distribution to specific tissues.

Physiologic Volumes

Three important physiological volumes – plasma water, extracellular fluid, and total body water, are shown in Figure 4.2. In the systemic circulation, drugs distribute throughout the volume of plasma water (about 3 L). Where a drug goes beyond this, including distribution to the cellular elements of the blood, depends on the physicochemical properties of the drug and the permeability characteristics of individual membranes.

The membranes of the capillary epithelial cells are generally very loose in nature and permit the paracellular passage of even polar and/or large drug molecules. Thus, most drugs are able to distribute throughout the volume of extracellular fluid, a volume of about 15 L. However, the capillary membranes of certain tissues, notably delicate tissues such as the central nervous system, the placenta, and the testes, have much more tightly knit membranes, which may limit the access of certain drugs, particularly large and/or polar drugs.

Once in the extracellular fluid, drugs are exposed to the individual cells of tissues. The ability of drugs to penetrate the membrane of these cells is dependent on a drug's physicochemical properties. Polar drugs and large molecular mass drugs will be unable to pass cell membranes by passive diffusion. However, polar drugs may enter cells if they are substrates for specialized uptake transporters. On the other hand, efflux transporters will restrict the distribution of their substrates. Small lipophilic drugs that can easily penetrate cell membranes can potentially distribute throughout the total body water, which is around 40 L.

In summary, drugs are able to pass through most of the capillary membranes in the body and distribute into a volume approximately equal to that of the extracellular fluid (about 15 L). The ability of a drug to distribute beyond this depends primarily on its physicochemical characteristics. Small, lipophilic drug molecules should penetrate biological membranes with ease and distribute throughout the total body water (about 40 L). A drug's distribution to specific tissues may be enhanced by uptake transporters. Conversely, efflux transporters will restrict the tissue distribution of their substrates. Total body water, about 40 L, represents the maximum volume into which a drug can distribute.

Tissue Binding and Plasma Protein Binding

Given that drug distribution is driven primarily by passive diffusion, it would be reasonable to assume that once distribution has occurred, the concentration of drug would be the same throughout its distribution volume. This is rarely the case because of tissue and plasma protein binding. Drugs frequently bind in a reversible manner to sites on proteins and other macromolecules in the plasma and tissues. At this time it is important to appreciate that bound drug cannot participate in the concentration gradient that drives the distribution process. The bound drug can be considered to be secreted away or hidden in tissue or plasma. Binding has a very important influence on a drug's distribution pattern. Consider a drug that binds extensively (90%) to the plasma proteins but does not bind to tissue macromolecules. In the plasma, 90% of the drug is bound and only 10% is free and able to diffuse to the tissues. At equilibrium, the unbound concentrations in the plasma and tissue will be the same, but the total concentration of drug in the plasma will be much higher than that in the tissues.

Plasma protein binding has the effect of limiting distribution and concentrating drug in the plasma. On the other hand, consider a drug that binds extensively to macromolecules in the tissues but does not bind to the plasma proteins. Assume that overall 90% of the drug in the tissue is bound and only 10% is free. As the distribution process occurs, a large fraction of the drug in the tissues will bind and be removed from participation in the diffusion gradient. As a result, more and more drug will distribute to the tissues. When distribution is complete, the unbound concentrations in the plasma and tissues wil be the same, but the total (bound plus free) average tissue concentration will be much larger than the plasma concentration. Tissue binding essentially draws drug from the plasma and concentrates it in the tissues. Drugs often bind to both the plasma proteins and tissue macromolecules. In this case the final distribution pattern will be determined by which is the dominant process.

Assessment of the Extent of Drug Distribution

Once distribution has gone to completion, the ratio of the total tissue concentration to the total plasma concentration remains constant. The actual tissue concentration (and the ratio) will vary from tissue to tisue, depending on the relative effects of tissue and plasma protein binding. It is not possible to measure individual tissue concentrations, and it is convenient to consider an overall average tissue concentration (Ct). The ratio of Ct to Cp will vary from drug to drug.

It is important to find a way to express a drug's distribution characteristics using a number or distribution parameter that can easily be estimated clinically. The ratio discussed above (Ct/Cp) expresses distribution but cannot be measured easily. Instead, we use the ratio of amount of drug in the body vs. plasma concentration at the same time to express a drug's distribution, that is, the apparent volume of distribution (Vd).

It is important to appreciate that the (apparent) volume of distribution is simply a ratio that has units of volume. It is not physiological volume and, despite its name, it is not the volume into which a drug distributes. The fact that drug A has a Vd value of 20 L does not mean that it distributes into a volume of 20 L, which is greater than extracellular fluid and less than the total body water.

The value of a drug's volume of distribution can be used to estimate the fraction of the drug in the body that is physically present in either the plasma or the tissues. The drug in the body (Ab) may be partitioned into drug in the plasma (Ap) and drug outside the plasma or in the tissues (At):

Ab = Ap + At

the fraction of the drug in the plasma,

fraction in plasma = Ap / Ab

After some algebra, we get

fraction in plasma = Vp / Vd

In a standard 70-kg adult male, Vp = 3 L:

fraction in plasma = 3 / Vd

The fraction of the drug in the body located in the tissues:

fraction in tissue = 1 – fraction in plasma = 1 – 3 / Vd

With this formula we can estimate the fraction of drug in plasma and in tissues, respectively.

Drug in the body is located in either the plasma or the tissues. The amount of drug in either of these spaces is the product of the concentration of drug and the volume of the space. And because Ab = Ap + At, we get

Cp * Vd = Cp *Vp + Ct * Vt

where Cp is the plasma concentration of the drug, Vd the volume of distribution, and Vp the volume of plasma water, Ct the average tissue concentration of the drug, the Vt the overall volume of tissues that the drug distributes.

And because the unbound (free) drug concentration equals the total drug concentration multiplying fraction of unbound, while the unbound drug concentrations between plasma and tissues (extraceullar space) must be the same after reaching distribution equilibrium, we get,

Cp*fu = Ct * fut

After some algebra, we have

Vd = Vp + Vt * fu / fut

where fu is the fraction of unbound drug in plasma and fut is the fraction of unbound drug in tissues. This final equation shows that a drug's volume of distribution is dependent on both the volume into which a drug distributes and on tissue and plasma protein binding. It also shows that increased tissue binding (fut gets smaller) or decreased plasma protein binding (fu gets larger) will result in an increase in the volume of distribution. Also, if a drug binds to neither the plasma proteins (fu = 1) nor the tissues (fut = 1), its volume of distribution will be equal to that of the volume into which the drug distributes (physiologic volume).

Summary

  • Vd is a ratio that reflects a drug's relative distribution between the plasma and the rest of the body.
  • It is dependent on the volume into which a drug distributes and a drug's binding characteristics.
  • It is a constant for a drug under normal conditions.
  • Conditions that alter body volume may affect its value.
  • Altered tissue and/or protein binding may alter its value.
  • It provides information about a drug's distribution pattern. Large values indicate extensive distribution of a drug to the tissues.
  • It can be used to calculate the amount of drug in the body if a drug's plasma concentration is known.

Plasma Protein Binding

A very large number of therapeutic drugs bind to certain sites on the proteins in plasma to form drug-protein complexes. The binding process occurs very rapidly, it is completely reversible, and equilibrium is quickly established between the bound and unbound forms of a drug. If the unbound or free drug concentration falls due to distribution or drug elimination, bound drug dissociates rapidly to restore equilibrium. Clinically, although the total drug concentration is measured routinely, pharmacological and toxicological activity is thought to reside with the free unbound drug (Cpu). It is only this component of the drug that is thought to be able to diffuse across membranes to the drug's site of action and to interact with the receptor. Binding is usually expressed using the parameter fraction unbound (fu), and the unbound pharmacologically active component can be calculated:

Cpu = Cp * fu

The three primary plasma proteins combining drugs include albumin, 𝛼1-acid glycoprotein (AAG), and the lipoproteins. AAG is present in lower concentration than albumin and binds primarily neutral and basic drugs. It is referred to as an acute-phase reactant protein because its concentration increases in response to a variety of unrelated stressful conditions, such as cancer, inflammation, and acute myocardial infarction. Given that the unbound concentration is the clinical important fraction and that it is the total concentration that is routinely measured, it is important to know how and when the unbound fraction may change for a drug.

The binding of drug and plasma protein could be regarded as a drug and "receptor" interaction (occupation). So the pharmacodynamic Emax model could be used to describe this interacton mathematically. After some algebra modifications, we get

 

where PT is the serum concentration of plasma binding protein, Kd the equilibrium dissociation constant, and the Cpu the plasma concentration of unbound (free) drug.

At low concentrations, binding increases in direct proportion to an increase in the free drug (fu remains constant as Cpu increases, where Cpu < Kd). As the free drug concentration increases further, some saturation of the proteins occurs, and proportionally less drug can bind (fu will increase as Cpu increases further). Eventually, at high drug concentrations, all the binding sites on the protein are taken and binding cannot increase further.

The Changes of fu

  • Affinity

The affinity of the drug for the protein is the main determinant of fu. Affinity is expressed by Kd, which is a reciprocal form of affinity. As affinity increases, Kd gets smaller. Drugs with small Kd values bind extensively, whereas those with large Kd values will not bind extensively.

  • Free drug concentration

Because the therapeutic plasma concentrations of most drugs are much less than their Kd values, binding is able to increase in proportion to increases in the total concentration: fu remains constant over therapeutic plasma concentrations. There are, however, a few drugs that have therapeutic plasma concentrations that are around the range of their Kd values. These drugs, which tend to be drugs that have very high therapeutic plasma concentrations, include valproic acid and salicylates, both of which bind to albumin, and disopyramide, which binds to AAG. The binding of these drugs uses a substantial amount of protein, and as a result they display concentration-dependent binding. As the drug concentration increases, some degree of saturation is observed, and the fraction unbound gets larger.

  • Plasma binding protein concentration

As predicted by the law of mass action, changes in the protein concentration will produce changes in the degree of binding. In the case of AAG, increases in the concentration are more common. Physiological stress caused by myocardial infarction, cancer, and surgery can lead to four- to fivefold increases in the AAG concentration. Lipoprotein concentrations vary widely in the population. They can decrease as a result of diet and therapy with HMG-CoA reductase inhibitors (statins), and increase due to alcoholism and diabetes mellitus.

  • Displacement

The binding of one drug may displace a second drug from its binding site. This displacement occurs because two drugs compete for a limited number of binding sites on the protein. Not surprisingly, displacers tend to be those drugs that achieve high concentrations in the plasma, use up a lot of protein, and display concentration-dependent binding.

  • Renal and hepatic disease

The binding of drugs to ablumin is often decreased in patients with severe renal disease. This appears to be the result of both decreased albumin levels and the accumulation of compounds that are normally eliminated, which may alter the affinity of drugs for albumin and/or compete for binding sites. The binding of several acidic drugs, including phenytoin and valproic acid, is reduced in severe renal disease. Plasma protein binding may also be reduced in hepatic disease.

Clinical Consequences of Changes in Plasma Protein Binding

Changes in fu as a result of altered protein concentration or displacement will result in a change in the fraction of the total drug that is unbound. Two issues need to be addressed when considering the clinical consequences of this: the potential changes in the unbound drug concentration at the site of action, and the interpretation and evaluation of the routinely measured total plasma concentrations.

When binding decreases, the pharmacologically active unbound component increases, and in theory, the response or toxicity could increase. However, the clinical consequences of altered plasma protein binding are minimized by two factors: 1) increased elimination and 2) little change in drug concentrations outside the plasma.

In many cases, only the unbound drug is accessible to the organs of elimination. This is known as restrictive elimination because elimination is restricted by protein binding and is limited to the unbound drug. For drugs display restrictive clearance, the increase in the unbound concentration that occurs when binding decreases results in an increase in elimination of the drug. The increase in elimination is usually proportional to the increase in unbound concentration. As a result, the unbound drug concentration in the plasma eventually falls to exactly the same value as that before the change in binding. In other words, the increase in the unbound concentration is canceled out by increased elimination.

The time it takes for the unbound concentration to return to its normal level is determined by the rate of elimination of the drug (the elimination half-life). If the drug is eliminated rapidly, the unbound concentration returns to its original level quickly. If the drug is eliminated slowly, it takes a long time for the unbound concentration to return to its original level. The time it takes to return can be important for drugs that have a narrow therapeutic index.

The plasma comprises a relative small physiological volume (3 L). Even when plasma protein binding is extensive, the fraction of the drug in the body that is located in the plasma is much less than that in the tissues. As a result, when the fraction unbound increases, the extra drug that distributes to the tissue is often very small in comparison to the amount of drug already present. This is particularly the case for drugs that have large volumes of distribution, where the majority of the drug in the body is in the tissues and only a very small fraction resides in the plasma.

Interpreting Cp

In clinical practice, drug therapy may be monitored by ensuring that plasma concentrations lie within the therapeutic range. The therapeutic range of a drug is expressed most conveniently in terms of concentration routinely measured, the total plasma concentration (Cp). But since the unbound concentration is the pharmacologically active component, the therapeutic range should more correctly be expressed in terms of this unbound concentration. Formulas have been developed for some drugs that will convert a measured plasma concentration of a drug to the value that it would be if the protein concentration were normal. We can prove the below formula with algebra modification.

 (when plasma drug concentration << Kd)

Linear Regression

16-Oct-17

The Regression Equation

When analyzing data, it is essential to first construct a graph of the data. A scatterplot is a graph of data from two quantitative variables of a population. In a scatterplot, we use horizontal axis for the observations of one variable and a vertical axis for the observations of the other variable. Each pair of observations is then plottted as a point. Note: Data from two quantitative variables of a population are called bivariate quantitative data.

To measure quantitatively how well a line fits teh data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. To decide which line fits the data better, we first compute the sum of the squared errors. Among all lines, the least-squares criterion is that the line having the smallest sum of squared errors is the one that fits the data best. Or, the least-squares criterion is that the line best fits a set of data points is the one having the smallest possible sum of squared errors.

Although the least-squares criterion states the property that the regression line for a set of data points must satify, it does not tell us how to find that line. This task is accomplished by Formula 14.1. In preparation, we introduce some notation that will be used throughout our study of regression and correlation.

Note although we have not used Syy in Formula 14.1, we will use it later.

For a linear regression y = b0 + b1x, y is the depdendent variable and x is the independent variable. However, in the context of regression analysis, we usually call y the response variable and x the predictor variable or explanatory variable (because it is used to predict or explain the values of the response variable).

Extrapolation

Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside the range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. To help avoid extrapolation, some researchers include the range of the observed values of the predictor variable with the regression equation.

Outliers and Influential Observations

Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. Thus, as usual, we need to identify outliers and remove them from the analysis when appropriate – for example, if we find that an outlier is a measurement or recording error.

We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is "pulled" toward such a data point without counteraction by other data points. If an influential observation is due to a measurement or recording error, or if for some other reason it clearly does not belong in the data set, it can be removed without further consideration. However, if no explanation for the influential observation is apparent, the decision whether to retain it is often difficult and calls for a judgment by the researcher.

A Warning on the Use of Linear Regression

The idea behind finding a regression line is based on the assumption that the data points are scattered about a line. Frequently, however, the data points are scattered about a curve instead of a line. One can still compute the values of b0 and b1 to obtain a regression line for these data points. The result, however, will yeild an inappropriate fit by a line, when in fact a curve should be used. Therefore, before finding a regression line for a set of data points, draw a scatterplot. If the data points do not appear to be scattered about a line, do not determine a regression line.

The Coefficient of Determination

In general, several methods exist for evaluating the utility of a regression equation for making predictions. One method is to determine the percentage of variation in the observed values of the response variable that is explained by the regression (or predictor variable), as discussed below. To find this percentage, we need to define two measures of variation: 1) the total variation in the observed values of the response variable and 2) the amount of variation in the observed values of the response variable that is explained by the regression.

To measure the total variation in the observed values of the response variable, we use the sum of squared deviations of the observed values of the response variable from the mean of those values. This measure of variation is called the total sum of squares, SST. Thus, SST = 𝛴(yiy[bar])2. If we divide SST by n – 1, we get the sample variance of the observed values of the response variable. So, SST really is a measure of total variation.

To measure the amount of variation in the observed values of the response variable that is explained by the regression, we first look at a particular observed value of the response variable, say, corresponding to the data point (xi, yi). The total variation in the observed values of the response variable is based on the deviation of each observed value from the mean value, yiy[bar]. Each such deviation can be decomposed into two parts: the deviation explained by the regression line, y^y[bar], and the remaining unexplained deviation, yiy^. Hence the amount of variation (squared deviation) in observed values of the response variable that is explained by the regression is 𝛴(yi^y[bar])2. This measure of variation is called the regression sum of squares, SSR. Thus, SSR = 𝛴(yi^y[bar])2.

Using the total sum of squares and the regression sum of squares, we can determine the percentage of variation in the observed values of the response variable that is explained by the regression, namely, SSR / SST. This quantity is called the coefficient of determination and is denoted r2. Thus, r2 = SSR/SST. In a same defintion, the deviation not explained by the regression, yiyi^. The amount of variation (squared deviation) in the observed values of the response variable that is not explained by the regression is 𝛴(yi – yi^)2. This measure of variation is called the error sum of squares, SSE. Thus, SSE = 𝛴(yi – yi^)2.

In summary, check Definition 14.6

And the coefficient of detrmination, r2, is the proportion of variation in the observed values of the response variable explained by the regression. The coefficient of determination always lies between 0 and 1. A vlaue of r2 near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of r2 near 1 suggests that the regression equation is quite useful for making predictions.

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares: SST = SSR + SSE. Because of the regression identity, we can also express the coefficient of determination in terms of the total sum of squares and the error sum of squares: r2 = SSR / SST = (SSTSSE) / SST = 1 – SSE / SST. This formula shows that, when expressed as a percentage, we can also interpret the cofficient of determination as the percentage reduction obtained in the total squared error by using the regression equation instead of the mean, y(bar), to predict the observed values of the response variable.

Correlation and Causation

Two variables may have a high correlation without being causally related. On the contrary, we can only infer that the two variables have a strong tendency to increase (or decrease) simultaneously and that one variable is a good predictor of another. Two variables may be strongly correlated because they are both associated with other variables, called lurking variables, that cause the changes in the two variables under consideration.


The Regression Model; Analysis of Residuals

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. In other words, we have the following definitions.

Using the terminology presented in Definition 15.1, we can now state the conditions required for applying inferential methods in regression analuysis.

Note: We refer to the line y = 𝛽0 + 𝛽1x – on which the conditional means of the response variable lie – as the population regression line and to its equation as the population regression equation. Observed that 𝛽0 is the y-intercept of the population regression line and 𝛽1 is its slop. The inferential procedure in regression are robust to moderate violations of Assumptions 1-3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don't violate any of those assumptions too badly.

Estimating the Regression Parameters

Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants 𝛽0, 𝛽1, and 𝜎 so that, for each value x of the predictor variable, the conditional distribution fo the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎.

Because the parameters 𝛽0, 𝛽1, and 𝜎 are usually unknown, we must estimate them from sample data. We use the y-intercept and slop of a sample regression line as point estimates of the y-intercept and slop, respectively, of the population regression line; that is, we use b0 to estimate 𝛽0 and we use b1 to estimate 𝛽1. We note that b0 is an unbiased estimator of 𝛽0 and that b1 is an unbiased estimator of 𝛽1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean.

The statistic used to obtain a point estimate for the common conditional standard deviation 𝜎 is called the standard error of the estimate. The standard error of the estimate could be compute by

Analysis of Residuals

Now we discuss how to use sample data to decicde whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assumptions 1-3. The method for checking Assumption 1-3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual, generically denoted e. Thus,

Residual = ei = yiyi^

We can show that the sum of the residuals is always 0, which, in turn, implies that e(bar) = 0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals (however, the exact standard deviation of the residuals is obtained by dividing by n – 1 instead of n – 2). Thus, the standard error of the estimate is sometimes called the residual standard deviation.

We can analyze the residuals to decide whether Assumptions 1-3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let's consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.

In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattererd about the x-aixs. In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about the x-axis.

Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distribution with mean 0 and standard deviation 𝜎. Thus a normal probability plot of the residuals should be roughly linear.

A plot of the residuals against the observed values of the predictor variable, which for brevity we call a residual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.

To illustrate the use of residual plots for regression diagnostics, let's consider the three plots in Figure 15.6. In Figure 15.6 (a), the residuals are scattered about the x-axis (residuals = 0) and fall roughly in a horizontal band, so Assumption 1 and 2 appear to be met. In Figure 15.6 (b) it is suggested that the relation between the variable is curved indicating that Assumption 1 may be violated. In Figure 15.6 (c) it is suggested that the conditional standard deviations increase as x increases, indicating that Assumption 2 may be violated.


Inferences for the Slope of the Population Regression Line

Suppose that the variables x and y satisfy the assumptions for regression inferences. Then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean 𝛽0 + 𝛽1x and standard deviation 𝜎. Of particular interest is whether the slope, 𝛽1, of the population regression line equals 0. If 𝛽1 = 0, then, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution having mean 𝛽0 and standard deviation 𝜎. Because x does not appear in either of those two parameters, it is useless as a predictor of y.

Of note, although x alone may not be useful for predicting y, it may be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is not useful for predicting y, we really mean that the regression equation with x as the only predictor variable is not useful for predicting y. Conversely, although x alone may be useful for predicting y, it may not be useful in conjunction with another variable or variables. Thus, in this section, when we say that x is useful for predicting y, we really mean that the regression equation with x as the only predictor variable is useful for predicting y.

We can decide whether x is useful as a (linear) predictor of y – that is, whether the regression equation has utility – by performing the hypothesis test

We base hypothesis test for 𝛽1 on the statistic b1. From the assumptions for regression inferences, we can show that the sampling distribution of the slop of the regression line is a normal distribution whose mean is the slope, 𝛽1, of the population regression line. More generally, we have Key Fact 15.3.

As a consequence of Key Fact 15.3, the standard variable

has the standard normal distribution. But this variable cannot be used as a basis for the required test statistic because the common conditional standard deviation, 𝜎, is unknown. We therefore replace 𝜎 with its sample estimate Se, the standard error of the estimate. As you might be suspect, the resulting variable has a t-distribution.

In light of Key Fact 15.4, for a hypothesis test with the null hypothesis H0: 𝛽1 = 0, we can use the variable t as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the regression t-test.

Confidence Intervals for the Slop of the Population Regression Line

Obtaining an estimate for the slop of the population regression line is worthwhile. We know that a point estimate for 𝛽1 is provided by b1. To determine a confidence-interval estimate for 𝛽1, we apply Key Fact 15.4 to obtain Procedure 15.2, called the regression t-interval procedure.

Estimating and Prediction

In this section, we examine how a sample regression equation can be used to make two important inferences: 1) Estimate the conditional mean of the response variable corresponding to a particular value of the predictor variable; 2) predict the value of the response variable for a particular value of the predictor variable.

In light of Key Fact 15.5, if we standardize the variable yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for a confidence-interval formula. Therefore, we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Recalling that 𝛽0 + 𝛽1x is the conditional mean of the response variable corresponding to the value xp of the predictor variable, we can apply Key Fact 15.6 to derivea confidence-interval procedure for means in regression. We call that procedure the conditional mean t-interval procedure.

Prediction Intervals

A primary use of a sample regression equation is to make predictions. Prediction intervals are similar to confidence intervals. The term confidence is usually reserved for interval estimates of parameters. The term prediction is used for interval estimate of variables.

In light of Key Fact 15.7, if we standardize the variable yp – yp^, the resulting variable has the standard normal distribution. However, because the standardized variable contains the unknown parameter 𝜎, it cannot be used as a basis for prediction-interval formula. So we replace 𝜎 by its estimate se, the standard error of the estimate. The resulting variable has a t-distribution.

Using Key Fact 15.8, we can derive a prediction-interval procedure, called the predicted value t-interval procedure.


Inferences in Correlation

Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between two cariables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line. Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, 𝜌. This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, 𝜌 actually describes the strength of the linear relationship between two variables; r is only an estimate of 𝜌 obtained from sample data.

The population linear correlation coefficient of two variables x and y always lies between -1 and 1. Values of 𝜌 near -1 or 1 indicate a strong linear relationship between the variables, whereas values of 𝜌 near 0 indicate a weak linear relationship between the variables. As we mentioned, a sample linear correlation coefficient, r, is an estimate of the population linear correlation coefficient, 𝜌. Consequently, we can use r as a basis for performing a hypothesis test for 𝜌.

In light of Key Fact 15.9, for a hypothesis test with the null hypothesis H0: 𝜌 = 0, we use the t-score as the test statistic and obtain the critical values or P-value from the t-table. We call this hypothesis-testing procedure the correlation t-test.

Inferences for Population Standard Deviations

05-Oct-17

Inferences for One Population Standard Deviation

Suppose that we want to obtain information about a population standard deviation. If the population is small, we can often determine 𝜎 exactly by first taking a census and then computing 𝜎 from the population data. However, if the population is large, which is usually the case, a census is generally not feasible, and we must use inferential methods to obtain the required information about 𝜎.

Logic Behind

Recall that to perform a hypothesis test with null hypothesis H0: 𝜇 = 𝜇0 for the mean, 𝜇, of a normally distributed variable, we do not use the variable x(bar) as the test statistic; rather, we use the variable t score. Similarly, when performing a hypothesis test with null hypothesis H0: 𝜎 = 𝜎0 for the standard deviatio, 𝜎, of a normally distributed variable, we do not use the variable s as the test statistic; rather, we use a modified version of that variable:

This variable has a chi-square distribution.

In light of Key Fact 11.2, for a hypothesis test with null hypothesis H0: 𝜎 = 𝜎0, we can use the variable 𝜒2 as the test statistic and obtain the critical value(s) form the 𝜒2-table. We call this hypothesis-testing procedure the one-standard-deviation 𝜒2-test.

Procedure 11.1 gives a step-by-step method for performing a one-standard-deviation 𝜒2-test by using either the critical-value approach or the P-value, but do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-test for one and two population means, the one-standard-deviation 𝜒2-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed or very nearly so.

Consequently, before applying Procedure 11.1, construct a normal probability plot. If the plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.1. We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for a population standard deviation. If you have doubts about the normality of the variable under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for a population standard deviation.

In addition, using Key Fact 11.2, we can also obtain a confidence-interval procedure for a population standard deviation. We call this procedure the one-standard-deviation 𝜒2-interval procedure and present it as Procedure 11.2. This procedure is also known as the 𝜒2-interval procedure for one population standard deviation. This confidence-interval procedure is often formulated in terms of variance instead of standard deviation. Like the one-standard-deviation 𝜒2-test, this procedure is not at all robust to violations of the normality assumption.


Inferences for Two Population Standard Deviation, Using Independent Samples

We now introduce hypothesis tests and confidence intervals for two population standard deviations. More precisely, we examine inferences to compare the standard deviations of one variable of two different populations. Such inferences are based on a distribution called the Fdistribution. In many statistical analyses that involve the F-distribution, we also need to determine F-values having areas 0.005, 0.01, 0.025, and 0.10 to their left. Although such F-values aren't available directly from Table VIII, we can obtain them indirectly from the table by using Key Fact 11.4.

Logic Behind

To perform hypothesis tests and obtain confidence intervals for two population standard deviations, we need Key Fact 11.5, that is, the distribution of the F-statistic for comparing two population standard deviations. By definition, the F-statistic.

In light of Key Fact 11.5, for a hypothesis test with null hypothesis H0: 𝜎1 = 𝜎2 (population standard deviations are equal), we can use the variable F = S12 / S22 as the test statistic and obtain the critical value(s) from the F-table. We call this hypothesis-testing procedure the two-standard-deviations F-test. Procedure 11.3 gives a step-by-step method for performing a two-standard-deviations F-test by using either critical-value approach or the P-value approach.

For the P-value approach, we could use F-table to estimate the P-value, but to do so is awkward and tedious; thus, we recommend using statistical software.

Unlike the z-tests and t-tests for one and two population means, the two-standard-deviation F-test is not robust to moderate violations of the normality assumption. In fact, it is so nonrobust that many statisticians advise against its use unless there is considerable evidence that the variable under consideration is normally distributed, or very nearly so, on each population.

Consequently, before applying Procedure 11.3, construct a normal probability plot of each sample. If either plot creates any doubt about the normality of the variable under consideration, do not use Procedure 11.3.

We note that nonparametric procedures, which do not require normality, have been developed to perform inferences for comparing two population standard deviations. If you have doubts about the normality of the variable on the two populations under consideration, you can often use one of those procedures to perform a hypothesis test or find a confidence interval for two population standard deviations.

Using Key Fact 11.5, we can also obtain a confidence-interval procedure, Procedure 11.4, for the ratio of two population standard deviations. We call it the two-standard-deviations F-interval procedure. Also it is known as the F-interval procedure for two population standard deviations and the two-sample F-interval procedure. This confidence-interval procedure is often formulated in terms of variances instead of standard deviations.

To interpret confidence intervals for the ratio 𝜎1 / 𝜎2, of two population standard deviations, considering three cases is helpful.

Case 1: The endpoints of the confidence interval are both greater than 1.

To illustrate, suppose that a 95% confidence interval for 𝜎1 / 𝜎2 is from 5 to 8. Then we can be 95% confident that 𝜎1 / 𝜎2 lies somewhere between 5 and 8 or, equivalently, 5𝜎2 < 𝜎1 < 8𝜎2. Thus, we can be 95% confident that 𝜎1 is somewhere between 5 and 8 times greater than 𝜎2.

Case 2: The endpoints of the confidence interval are both less than 1.

To illustrate, suppose that a 95% confidence interval for 𝜎1 / 𝜎2 is from 0.5 to 0.8. Then we can be 95% confident that 𝜎1 / 𝜎2 lies somewhere between 0.5 and 0.8 or, equivalently, 0.5𝜎2 < 𝜎1 < 0.8𝜎2. Thus, noting that 1/0.5 = 2 and 1/0.8 = 1.25, we can be 95% confident that 𝜎1 < is somewhere between 1.25 and 2 times less than 𝜎2.

Case 3: One endpoint of the confidence interval is less than 1 and the other is greater than 1.

To illustrate, suppose that a 95% confience interval for 5𝜎2 < 𝜎1 < 8𝜎2 is from 0.5 to 8. Then we can be 95% confident that 5𝜎2 < 𝜎1 < 8𝜎2 lies somewhere between 0.5 and 8 or, equivalentluy, 0.5𝜎2 < 𝜎1 < 8𝜎2. Thus, we can be 95% confident that 𝜎1 is somewhere between 2 time less than and 8 times greater than 𝜎2.

Sampling

02-Oct-17

If the information you need is not already available from a previous study, you might acquire it by conducting a census – that is, by obtaining information for the entire population of interest. However, conducting a census may be time consuming, costly, impractical, or even impossible.

Two methods other than a census for obtaining information are sampling and experimentation. If sampling is appropriate, you must decide how to select the sample; that is, you must choose the method for obtaining a sample from the population. Because the sample will be used to draw conclusions about the entire population, it should be a representative sample – that is, it should reflect as closely as possible the relevant characteristics of the population under consideration.

For instance, using the average weight of a sample of professional football players to make an inference about the average weight of all adult males would be unreasonable. Nor would it be reasonable to estimate the median income of California residents by sampling the incomes of Beverly Hills residents.

Most modern sampling procedures involve the use of probability sampling. In probability sampling, a random device – such as tossing a coin, consulting a table of random numbers, or employing a random-number generator – is used to decide which members of the population will constitute the sample instead of leaving such decisions to human judgment.

PS: Probability sampling is based on the fact that every member of a population has a known and equal chance of being selected. For example, if you had a population of 100 people, each person would have odds of 1 out of 100 of being chosen. With non-probability sampling, those odds are not equal. For example, a person might have a better chance of being chosen if they live close to the researcher or have access to a computer. Probability sampling gives you the best chance to create a sample that is truly representative of the population.

The use of probability sampling may still yield a nonrepresentative sample. However, probability sampling helps eliminate unintentional selection bias and permits the researcher to control the chance of obtaining a nonrepresentative sample. Furthermore, the use of probability sampling guarantees that the techniques of inferential statistics can be applied.

Simple Random Sampling

The inferential techniques considered most often are intended for use with only one particular sampling procedure: simple random sampling. A simple random sampling is a sampling procedure for which each possible sample of a given size is equally likely to be the one obtained. And simple random sample is a sample obtained by simple random sampling.

There are two types of simple random sampling. One is simple random sampling with replacement (SRSWR), whereby a member of the population can be selected more than once; the other is simple random sampling without replacement (SRS), whereby a member of the population can be selected at most once. Unless we specify otherwise, assume that simple random sampling is done without replacement. Technologies to do a simple random sampling include random-number tables and random-number generators.

Systematic Random Sampling

Simple random sampling is the most natural and easily understood method of probability sampling – it corresponds to our intuitive notion of random selection by lot. However, simple random sampling does have drawbacks. For instance, it may fail to provide sufficient coverage when information about subpopulations is required and may be impractical when the members of the population are widely scattered geographically.

One method that takes less effort to implement than simple random sampling is systematic random sampling. Proceudre 1.1 presents a step-by-step method for implementing systematic random sampling.

Systematic random sampling is easier to execute than simple random sampling and usually provides comparable results. The exception is the presence of some kind of cyclical pattern in the listing of the members of the population (e.g., male, female, male, female, …), a phenomenon that is relatively rare.

Cluster Sampling

Another sampling method is cluster sampling, which is particularly useful when the members of the population are widely scattered geographically. Procedure 1.2 provides a step-by-step method for implementing cluster sampling.

Many years ago, citizens' groups pressured the city council of Tempe, Arizona, to install bike paths in the city. The council members wanted to be sure that they were supported by a majority of the taxpayers, so they decided to poll the city's homeowners. Their first survey of public opinion was a questionnaire mailed out with the city's 18,000 homeowner water bills. Unfortunately, this method did not work very well. Only 19.4% of the questionnaires were returned, and a large number of those had written comments that indicated they came from avid bicyclists or from people who stronglye resented bicyclists. The city council realized that the questionnaire generally had not been returned by the average homeowner.

An employee in the city's planning department had sample surveyt experience, so the council asked her to do a survey. She was given two assistants to help her interview 300 homeowners and 10 days to complete the project. The planner first considered taking a simple random sample of 300 homes: 100 interviews for herself and for each of her two assistants. However, the city was so spread out that an interviewer of 100 randomly scattered homeowners would have to drive an average of 18 minutes from one interview to the next. Doing so would require approximately 30 hours of driving time for each interviewer and could delay completion of the report. The planner needed a different sampling design.

Although cluster sampling can save time and money, it does have disadvantages. Ideally, each cluster should mirror the entire population. In practice, however, members of a cluster may be more homogeneous than the members of the entire population, which can cause problems.

Stratified Sampling

Another sampling method, known as stratified sampling, is often more reliable than cluster sampling. In stratified sampling, the population is first divided into subpopulations, called strata, and then sampling is done from each stratum. Ideally, the members of each stratum should be homogeneous relative to the characteristic under consideration.

In stratified sampling, the strata are often sampled in proportion to their size, which is called proportional allocation. Procedure 1.3 presents a step-by-step method for implementing stratified (random) sampling with proportional allocation.

Multistage Sampling

Most large-scale surveys combine one or more of simple random sampling, systematic random sampling, cluster sampling, and stratified sampling. Such multistage sampling is used frequently by pollsters and government agencies.