Copyright Notice
In response to the paper Replication
and Meta-Analysis in Parapsychology by
Jessica Utts
[These papers were published in "Statistical Science,"
1991, Vol. 6, No. 4.]
Comment
Ree Dawson
This paper offers readers interested in statistical science multiple views of the controversial history of parapsychology and how statistics has contributed to its development. It first provides an account of how both design and inferential aspects of statistics have been pivotal issues in evaluating the outcomes of experiments that study psi abilities. It then emphasizes how the idea of science as replication has been key in this field in which results have not been conclusive or consistent and thus meta-analysis has been at the heart of the literature in parapsychology. The author not only reviews past debate on how to interpret repeated psi studies, but also provides very detailed information on the Honorton-Hyman argument, a nice illustration of the challenges of resolving such debate. This debate is also a good example of how statistical criticism can be part of the scientific process and lead to better experiments and, in general, better science.
The remainder of the paper addresses technical issues of meta-analysis, drawing upon recent research in parapsychology for an in-depth application.Through a series of examples, the author presents a convincing argument that power issues cannot be overlooked in successive replications and that comparison of effect sizes provides a richer alternative to the dichotomous measure inherent in the use of p-values. This is particularly relevant when the potential effect size is small and resources are limited, as seems to be the case for psi studies.
The concluding section briefly mentions Bayesian techniques. As noted by the author, Bayes (or empirical Bayes) methodology seems to make sense for research in parapsychology. This discussion examines possible Bayesian approaches to meta-analysis in this field.
BAYES MODELS FOR PARAPSYCHOLOGY
The notion of repeatability maps well into the Bayesian set-up in which experiments, viewed as a random sample from some superpopulation of experiments, are assumed to be exchangeable. When subjects can also be viewed as an approximately random sample from some population, it is appropriate to pool them across experiments. Otherwise, analyses that partially pool information according to experimental heterogeneity need to be considered. Empirical and hierarchical Bayes methods offer a flexible modeling framework for such analyses, relying on empirical or subjective sources to determine the degree of pooling. These richer methods can be particularly useful to meta-analysis of experiments in parapsychology conducted under potentially diverse conditions.
For the recent ganzfeld series, assuming them to be
independent binomially distributed as discussed in Section 5, the
data can be summed (pooled) across series to estimate a common
hit rate. Honorton et al. (1990) assessed the homogeneity of
effects across the 11 series using a chi-quare test that compares
individual effect sizes to the weighted mean effect. The
chi-square statistic
= 16.25, not statistically
significant ( p = 0.093
), largely reflects the contribution of
the last "special" series (contributes 9.2 units to the
value),
and to a lesser extent the novice series with a negative effect
(contributes 2.5 units). The outlier series can be dropped from
the analysis to provide a more conservative estimate of the
presence of psi effects for this data (this result is reported in
Section 5). For the remaining 10 series, the chi-square value
= 7.01
strongly favors homogeneity, although more than one-third of its
value is due to the novice series (number 4 in Table 1). This
pattern points to the potential usefulness of a richer model to
accomodate series that may be distinct from others. For the
earlier ganzfeld data analyzed by Honorton (1985b), the appeal of
a Bayes or other model that recognizes the heterogeneity across
studies is clear cut:
= 56.6, p = 0.0001,
where only those studies with common chance hit rate have been
included (see Table 2).
TABLE 1 |
||||
| Series type | N Trials |
Hit rate |
Yi |
|
| Pilot Pilot Pilot Novice Novice Novice Novice Novice Experienced Experienced Experienced Overall |
22 355 |
0.36 0.34 |
-0.58 |
0.44 |
TABLE 2 |
|||
N Trials |
Hit rate |
Yi |
|
32 722 |
0.44 .38 |
-0.24 |
0.36 |
Historic reliance on voting-count approaches to determine the
presence of psi effects makes it natural to consider Bayes models
that focus on the ensemble of experimental effects from
parapsychological studies, rather than individual estimates.
Recent work in parapsychology that compares effect sizes across
studies, rather than estimating separate study effects,
reinforces the need to examine this type of model. Louis (1984)
develops Bayes and empirical Bayes methods for problems that
consider the ensemble of parameter values to be the primary goal,
for example, multiple comparisons. For the simple compound normal
model, Yi
N(
i,
1),
i
N(
,
2), the standard Bayes estimates (posterior
means)
where the
i
represent experimental effects of interest, are modified
approximately to
when an ensemble loss function is assumed. The new estimates
adjust the shrinkage factor D so that their sample
mean and variance match the posterior expectation and variance of
the
's.
Similar results are obtained when the model is generalized to the
case of unequal variances, Yi
N(
i,
).
For the above model, the fraction of
above (or below) a cut
point C is a consistent estimate of the fraction of
i > C
(or
i < C
). Thus, the use of ensemble, rather than
component-wise, loss can help detect when individual effects are
above a specified threshold by chance. For the meta-analysis of
ganzfeld experiments, the observed binomial proportions
transformed on the logit (or arcsin
) scale can be modeled in
this framework. Letting di
and mi
denote the number of direct hits and misses respectively for
the ith experiment, and pi as the corresponding
population proportion of direct hits, the Yi are the observed
logits
Yi = log( di / mi )
and
, estimated by maximum likelihood as 1 / di + 1/mi, is the
variance of Yi
conditional on
i = logit(pi). The threshold logit (0.25)
1.10 can be
used to identify the number of experiments for which the
proportion of direct hits exceeds that expected by chance.
Table 1 shows Yi
and
i for the 11 ganzfeld
series. All but one of the series are well above the threshold; Y4 marginally falls below -1.10.
Any shrinkage toward a common hit rate will lead to an estimate,
or
,
above the threshold. The use of ensemble loss (with its
consistency property) provides more convincing support that all
i >
-1.10, although posterior
estimates of uncertainty are needed to fully calibrate this. For
the earlier ganzfeld data in Table 2, ensemble loss can similarly
be used to determine the number of studies with
i < -1.10
and specifically whether the negative effects of studies 4 and 24
(Y4
= -1.21 and Y24 = -1.33) occurred as a result of chance fluctuation.
Features of the ganzfeld data in Section 5, such as the outlier series, suggest that further elaboration of the basic Bayesian set-up may be necessary for some meta-analyses in parapsychology. Hierarchical models provide a natural framework to specify these elaborations and explore how results change with the prior specification. This type of sensitivity analysis can expose whether conclusions are closely tied to prior beliefs, as observed by Jeffreys for RNG data (see Section 7). Quantifying the influence of model components deemed to be more subjective or less certain is important to broad acceptance of results as evidence of psi performance (or lack thereof).
Consider the initial model commonly used for Bayesian analysis of discrete data:
Yi | pi, ni
B(pi, ni),
i
N(
,
2),
i
= logit( pi),
with noninformative priors assumed for
and
2 (e.g., log
locally uniform). The
distinctiveness of the last "special" series (pilot
versus formal, novice versus experienced) raises the
question of whether the experimental effects follow a normal
distribution. Weighted normal polots (Ryan and Dempster, 1984)
can be used to graphically diagnose the adequacy of second-stage
normality (see Dempster, Selwyn and Weeks, 1983, for examples
with binary response and normal superpopulation).
Alternatively, if nonnormality is suspected, the model can be
revised to include some sort of heavy-tailed prior to accommodate
possibly outlying series or studies. West (1985) incorporates
additional scale parameters, one for each component of the model
(experiment), that flexibly adapt to a typical
i
and discount their influence on posterior estimates, thus
avoiding under- or over-shrinkage due to such
i
. For example, the second stage can
specify the prior as a scale mixture of normals:
i
N(
,
2
i-1),
k
i
![]()
,
-2
![]()
.
This approach for the prior is similar to others for maximum likelihood estimation that modify the sampling error distribution to yield estimates that are "robust" against outlying observations.
Like its maximum likelihood counterparts, in addition to the
robust effect estimates
, the Bayes model provides (posterior)
scale stimates
. These can be interpreted as the weight given to the
data for each
i in the analysis and are
useful to diagnosing which model components (series or studies)
are unusual and how they influence the shrinkage. When more
complex groupings among the
i are
suspected, for example, bimodal distribution of studies from
different sites or experimenters, other mixture specifications
can be used to further relax the shrinkage toward a common value.
For the 11 ganzfeld series, the last
"outlier" series, quite distinct from the others
(hit rate = 0.64), is moderately precise (N = 25).
Omitting it from the analysis causes the overall hit rate to drop
from 0.344 to 0.321. The scale mixture model is a compromise
between these two values (on the logit scale), discounting the
influence of series 11 on the estimated posterior common hit rate
used for shrinkage. The scale factor
, an indication of how
separate
11 is from the other parameters, also
causes
to be shrunk less toward the common hit rate than
other, more homogeneous
i , giving
more weight to individual information for that series (see West,
1985). The heterogeneity of the earlier ganzfeld data is more
pronounced, and studies are taken from a variety of sources over
time. For these data, the
can be used to explore atypical studies
(e.g., study 6, with hite rate = 0.90, contributes more than 25%
to the
value for homogeneity) and groupings among effects,
as well as protect the analysis from misspecification of
second-stage normality.
Variation among ganzfeld series or studies and the degree to
which pooling or shrinking is appropriate can be investigated
further by considering a range of priors for
2.
If the marginal likelihood of
2 dominates the
prior specification, then results should not vary as the prior
for
2
is varied. Otherwise, it is important to identify the degree to
which subject information about interexperimental variability
influences the conclusions. This sensitivity analysis is a
Bayesian enrichment of the simpler test of homogeneity directed
toward determining whether or not complete pooling is
appropriate.
To assess how well heterogeneity among historical control
groups is determined by the data. Dempster, Selwyn and Weeks
(1983) propose three priors for
2 in the
logistic-normal model. The prior distributions range from
strongly favoring individual estimates, p(
2)d
![]()
![]()
-1,
to the uniform reference prior p(
2)d
![]()
![]()
-2,
flat on the log
scale, to strongly favoring complete pooling, p(
2)d
![]()
![]()
-3,
(the latter forcing complete pooling for the compound normal
model; see Morris, 1983). For their two examples, the results
(estimates of linear treatment effects) are largely insensitive
to variation in the prior distribution, but the number of studies
in each example was large (70 and 19 studies available for
pooling). For the 11 ganzfeld series,
2 may be
less well determined by the data. The posterior estimate of
2
and its sensitivity to p(
2
)d
will also depend on whether individual scale
parameters are incorporated into the model. Discounting the
influence of the last series will both shift the marginal
likelihood toward smaller values of
2 and
concentrate it more in that region.
The issue of objective assessment of experiment results is one that extends well beyond the field of parapsychology, and this paper provides insight into issues surrounding the analysis and interpretation of small effects from related studies. Bayes methods can contribute to such meta-analysis in two ways. They permit experimental and subjective evidence to be formally combined to determine the presence or absence of effects that are not clear cut or controversial (e.g., psi abilities). They can also help uncover sources and degree of uncertainty in the scientific conclusions.
Ree Dawson is Senior Statistician, New England Biomedical Research Foundation,
and Statistical Consultant, RFE/RL Research Institute.
Dr. Dawson's mailing address is 177 Morrison Avenue, Somerville, Massachusetts 02144.
The contents of this document are copyright ©1991 by the Institute of Mathematical Statistics. All rights reserved.