Jessica Utts Home Page

Document Menu


Copyright Notice
In response to the paper Replication and Meta-Analysis in Parapsychology by Jessica Utts

[These papers were published in "Statistical Science," 1991, Vol. 6, No. 4.]

Comment

M. J. Bayarri and James Berger

1. INTRODUCTION

There are many fascinating issues discussed in this paper. Several concern parapsychology itself and the interpretation of statistical methodology therein. We are not experts in parapsychology, and so have one comment concerning such matters: In Section 3 we briefly discuss the need to switch from P-values to Bayes factors in discussing evidence concerning parapsychology.

A more general issue raised in the paper is that of replication. It is quite illuminating to consider the issue of replication from a Bayesion perspective, and this is done in Section 2 of our discussion.

2. REPLICATION

Many insightful observations concerning replication are given in the article, and these spurred us to determine if they could be quantified within Bayesian reasoning. Quantification requires clear delineation of the possible purposes of replication, and at least two are obvious. The first is simple reduction of random error, achieved by obtaining more observations from the replication. The second purpose is to search for possible bias in the original experiment. We use "bias" in a loose sense here, to refer to any of the huge number of ways in which the effects being measured by the experiment can differ from the actual effects of interest. Thus a clinical trial without a placebo can suffer a placebo "bias"; a survey can suffer a "bias" due to the sampling frame being unrepresentative of the actual population; and possible sources of bias in parapsychological experiments have been extensively discussed.

Replication to Reduce Random Error

If the sole goal of replication of an experiment is to reduce random error, matters are very straighforward. Reviewing the Bayesion way of studying this issue is, however, useful and will be done through the following simple example.

EXAMPLE 1. Consider the example from Tversky and Kahnemann (1982), in which an experiment results in a standardized test statistic of   z1 = 2.46.   (We will assume normality to keep computations trivial.) The second question is: What is the highest value of z2 in a second set of data that would be considered a failure to replicate? Two possible precise versions of this question are: Question 1: What is the probability of observing z2 for which the null hypothesis would be rejected in the replicated experiment? Question 2: What value of z2 would leave one's overall opinion about the null hypothesis unchanged?

Consider the simple case where Z1  N(z1 | , 1) and (independently) Z2  N(z2 | , 1), where  is the mean and 1 is the standard deviation of the normal distribution. Note that we are considering the case in which no experimental bias is suspected and so the means for each experiment are assumed to be the same.

Suppose that it is desired to test H0:   0 versus H1: > 0, and suppose that initial prior opinion about  can be described by the noninformative prior () = 1. We consider the one-sided testing problem with a constant prior in this section, because it is known that then the posterior probability of H0, to be denoted by P(H0 | data), equals the P-value, allowing us to avoid complications arising from differences between Bayesian and classical answers.

After observing z1 = 2.46, the posterior distribution of  is

Question 1 then has the answer (using predictive Bayesian reasoning)

P (rejecting at level  | z1)


where  is the standard normal cdf and c is the one-sided critical value corresponding to the level, , of the test. For instance, if  = 0.05, then this probability equals 0.7178, demonstrating that there is a quite substantial probability that the second experiment will fail to reject. If  is chosen to be the observed significance level from the first experiment, so that c = z1, then the probability that the second experiment will reject is just 1/2. This is nothing but a statement of the well-known martingale property of Bayesianism, that what you "expect" to see in the future is just what you know today. In a sense, therefore, question 1 is exposed as being uninteresting.

Question 2 more properly focuses on the fact that the stated goal of replication here is simply to reduce uncertainty in stated conclusions. The answer to the question follows immediately from noting that the posterior from the combined data (z1,z2) is

so that

Setting this equal to P(H0 | z1) and solving for z2 yields   Any value of z2 greater than this will increase the total evidence against H0, while any value smaller than 1.02 will decrease the evidence.

Replication to Detect Bias

The aspirin example dramatically raises the issue of bias detection as a motive for replication. Professor Utts observes that replication 1 gives results that are fully compatible with those of the original study, which could be interpreted as suggesting that there is no bias in the original study, while replication 2 would raise serious concerns of bias. We became very interested in the implicit suggestion that replication 2 would thus lead to less overall evidence against the null hypothesis than would replication 1, even though in isolation replication 2 was much more "significant" than was replication 1. In attempting to see if this is so, we considered the Bayesian approach to study of bias within the framework of the aspirin example.

EXAMPLE 2. For simplicity in the aspiring example, we reduce consideration to

= true difference in heart attack rates between aspirin and placebo populations multiplied by 1000;
Y = difference in observed heart attack rates between aspirin and placebo groups in original study multipled by 1000;
Xi = difference in observed heart attack rates between aspirin and placebo groups in Replication i multiplied by 1000.

We assume that the replication studies are extremely well designed and implemented, so that one is very confident that the Xi have mean . Using normal approximations for convenience, the data can be summarized as

X1  N(x1 | , 4.82),   X2  N(x2 | , 3.63)

with actual observations x1 = 7.704 and x2 = 13.07.

Consider now the bias issue. We assume that the original experiment is somewhat suspect in this regard, and we will model bias by defining the mean of Y to be

 =  + ,

where is the unknown bias. Then the data in the original experiment can be summarized by

Y  N(y | , 1.54),

with the actual observation being y = 7.707.

Bayesian analysis requires specification of a prior distribution, (), for the suspected amount of bias. Of particular interest then are the posterior distribution of , assuming replication i has been performed, given by

( | y, xi)

where is the variance (4.82 or 3.63) from replication i; and the posterior probability of H0, given by

P(H0 | y, xi)

Recall that our goal here was to see if Bayesian analysis can reproduce the intuition that the original experiment could be trusted if replication 1 had been done, while it could not be trusted (in spite of its much larger sample size) had replication 2 been performed. Establishing this requires finding a prior distribution () for which ( |  y, x1) has little effect on P(H0 | y, x1), but ( | y, x2) has a large effect on P(H0 | y, x2). To achieve the first objective, () must be tightly concentrated near zero. To achieve the second, () must be such that large | y - x2 | , which suggests presence of a large bias, can result in a substantial shift of posterior mass for away from zero.

A sensible candidate for the prior density () is the Cauchy (0, V) density

Flat-tailed densities, such as this, are well known to have the property that when discordant data is observed (e.g., when x| is large), substantial mass shifts away from the prior center towards the likelihood center. It is easy to see that a normal prior for can not have the desired behavior.

Our first surprise in consideration of these priors was how small V needed to be chosen in order for P(H0 | y, x1) to be unaffected by the bias. For instance, even with = 1.54/100 (recall that 1.54 was the standard deviation of Y from the original experiment), computation yields P(H0 | y, x1) = 4.3 X 10-5, compared with the P-value (and posterior probability from the original experiment assuming no bias) of 2.8 X 10-7. There is a clear lesson here; even very small suspicions of bias can drastically alter a small P-value. Note that replication 1 is very consistent with the process of no bias, and so the posterior distribution for the bias remains tightly concentrated near zero; for instance, the mean of the posterior for is then 7.2 X 10-6, and the standard deviation is 0.25.

When we turned attention to replication 2, we found that it did not seriously change the prior perceptions of bias. Examination quickly revealed the reason; even the maximum likelihood estimate of the bias is no more than 1.5 standard deviations from zero, which is not enough to change strong prior beliefs. We, therefore, considered a third experiment, defined in Table 1. Transforming to approximate normality, as before, yields

X3  N(x3 | , 3.48),

with x3 = 22.72 being the actual observation. The maximum likelihood estimate of bias is now 3.95 standard deviations from zero, so there is potential for a substantial change in opinion about the bias.

TABLE 1
Frequency of heart attacks in replication 3

 

Yes

No

Aspirin
Placebo

5
54

2309
2116

Sure enough, computation when V = 1.54/100 yields that E[ | y, x3] = -4.9 with (posterior) standard deviation equal to 6.62, which is a dramatic shift from prior opinion (that is Cauchy (0, 1.54/100)). The effect of this is to essentially ignore the original experiment in overall assessments of evidence. For instance, P(H0 | y, x3) = 3.81 X 10-11, which is very close to P(H0 | x3) = 3.29 X 10-11. Note that, if were set equal to zero, the overall posterior probability of H0 (and P-value) would be 2.62 X 10-13.

Thus Bayesian reasoning can reproduce the intuition that replication which indicates bias can cast considerable doubt on the original experiment, while replication which provides no evidence of bias leaves evidence from the original experiment intact. Such behavior seems only obtainable, however, with flat-tailed priors for bias (such as the Cauchy) that are very concentrated (in comparison with the experimental standard deviation) near zero.

3. P-VALUES OR BAYES FACTORS?

Parapsychology experiments usually consider testing of H0: No parapsychological effect exists. Such null hypotheses are often realistically represented as point nulls (see Berger and Delampady, 1987, for the reason that care must be taken in such representation), in which case it is known that there is a large difference between P-values and posterior probabilities (see Berger and Delampady, 1987, for review). The article by Jeffreys (1990) dramatically illustrates this, showing that a very small P-value can actually correspond to evidence for H0 when considered from a Bayesian perspective. (This is very related to the famous "Jeffreys" paradox.) The argument in favor of the Bayesian approach here is very strong, since it can be shown that the conflict holds for virtually any sensible prior distribution; a Bayesian answer can be wrong if the prior information turns out to be inaccurate, but a Bayesian answer that holds for all sensible priors is unassailable.

Since P-values simply cannot be viewed as meaningful in these situations, we found it of interest to reconsider the example in Section 5 from a Bayes factor perspective. We considered only analysis of the overall totals, that is, x = 122 successes out of n = 355 trials. Assuming a simple Bernoulli trial model with success probability , the goal is to test H0: = 1/4 versus H1:  1/4.

To determine the Bayes factor here, one must specify g(), the conditional prior density on H1. Consider choosing g to be uniform and symmetric, that is,

Crudely, r could be considered to be the maximum change in success probability that one would expect given that ESP exists. Also, these distributions are the "extreme points" over the class of symmetric unimodal conditional densities, so answers that hold over this class are also representative of answers over a much larger class. Note that here r 0.25 (because 0    1); for the given data the > 0.5 are essentially irrelevant, but if it were deemed important to take them into account one could use the more sophisticated binomial analysis in Berger and Delampady (1987).

For gr, the Bayes factor of H1 to H0, which is to be interpreted as the relative odds for the hypotheses provided by the data, is given by

This is graphed in Figure 1.

The P-value for this problem was 0.00005, indicating overwhelming evidence against H0 from a classical perspective. In contrast to the situation studied by Jeffreys (1990), the Bayes factor here does not completely reverse the conclusion, showing that there are very reasonable values of r for which the evidence against H0 is moderately strong, for example 100/1 or 200/1. Of course, this evidence is probably not of sufficient strength to overcome strong prior opinions against H0 (one obtains final posterior odds by multiplying prior odds by the Bayes factor). To properly assess strength of evidence, we feel that such Bayes factor computations should become standard in parapsychology.

As mentioned by Professor Utts, Bayesian methods have additional potential in situations such as this, by allowing unrealistic models of iid trials to be replaced by hierarchical models reflecting differing abilities among subjects.

ACKNOWLEDGEMENTS

M. J. Bayarri's research was supported in part by the Spanish Ministry of Education and Science under DGICYT Grant BE91-038, while visiting Purdue University. James Berger's research was supported by NSF Grant DMS-89-23071.

M. J. Bayarri is Titular Professor, Department of Statistics and Operations Research, University of Valencia
Avenida Dr. Moliner 50, 46100 Burjassot, Valencia, Spain

James Berger is the Richard M. Brumfield Distinguished Professor of Statistics, Purdue University
West Lafayette, Indiana 47907


COPYRIGHT NOTICE

The contents of this document are copyright ©1991 by the Institute of Mathematical Statistics. All rights reserved.

Jessica Utts Home Page

Document Menu