What the Astrologically Curious Should Know About P-values
By Robert Currey
Astrological editor and researcher says p-value is the only statistic that measures the level of certainty that a claim is supported by evidence.
A few years ago, in those heady days when we could jet around the world with ease, I landed up at a conference in Florida. A researcher presented an impressive looking graph supporting an astrological claim. I asked why there was no p-value. The response came as a shock: “P-values are overdone, and no one uses them these days.”
P-values are based on probabilities and, on their own, can be a problem (more on that later). But I feel that rejecting them outright is not the solution. At Correlation, the Astrological Association Journal of Research in Science, we publish both qualitative and quantitative research. For the latter, the p-value is critical. It is the only statistic that measures the level of certainty that a claim is supported by evidence or not.
The p-value represents the probability that a result is random. In the social sciences, a p value of .05 or below is generally considered to be the threshold to determine that a result is significant or unlikely to be a fluke. At this level (p = .05) there is expected to be only a 5% chance of that the result is random and a 95% chance that there is a meaningful correlation or a 1 in 20 chance of a fluke result versus a 19 in 20 chance that result supports an astrological claim.
The p-value represents the best protection against random artifacts that are inevitable given the high combinatorial complexity of astrology. A p-value answers key questions: Can the apparent coincidence between the results and astrological claims be dismissed as chance? Or, if the result is close to significance, could a larger sample add sufficient statistical power to confirm or reject the null hypothesis? Given the controversial nature of astrology, this measure will be a critic’s first checkpoint on the list.
For astrological research, the p-value remains our most important starting metric.
More importantly, statistical significance is the gateway and not the destination. As research into astrology advances and results with significant correlations accumulate, we can start to analyze the size of its apparent effects. A p-value does not answer the question “Is this significant correlation meaningful and strong?” On a practical level, a consultant astrologer needs to know if it is probable that this correlation will manifest in personal consultations. Or whether it is only evident in certain conditions, such as in combination with another reinforcing chart feature. For this we need the Effect Size (ES).
The Effect Size is a quantitative measure of the strength of a phenomenon. In astrology, this will mainly be the difference between the observed group and the control group or the expected values. The key metrics: effect size, sample size and p-value are all related. The problem with evaluating a study or comparing different studies by p-value alone is that a large sample with a weak effect size can have the same p-value as a small sample with a large effect size. Astrology studies are typically a small sample with a small effect size. The ES enables comparisons among studies of different sizes.
While many astrology researchers are adept at p-value calculations, the Effect Size presents a challenge. The methods of calculating ES vary with different scales, and online documentation does not address some of the unique tests that apply to our field. Nevertheless, the metrics that accompany p-values (such as ES) not only make the findings more informative, they also help to verify the methods.
At the moment, science is entangled in a “reproducibility crisis.”
Stanford professor John Ioannidis claims that simulations indicate most current published research findings are false (2005). Most are either not replicated or cannot be replicated. This predicament is most prevalent in medicine and in the social sciences, notably psychology. Science students, lecturers and funded researchers are often under pressure to demonstrate statistical significance. Failure to reach this artificial gold standard diminishes their academic prospects and funding.
A significance threshold (alpha) of .05 means that 1 in 20 tests is likely to be significant by chance alone. This leaves plenty of scope for tweaking the results to obtain significance. This is mainly done through what is known as p-hacking, which involves the practice of re-analyzing data in many different ways to yield a targeted result. Astrology is vulnerable to this because there are many variables and many techniques. For example, this can occur with the mixed use of multiple celestial points (such as asteroids, minor planets, nodes and fixed stars) or inventing new systems. To counter p-hacking, tests that do not follow the principle of parsimony (the simplest explanation is usually best), or assess established claims need greater scrutiny.
On the other hand, astrology researchers are not compromised by academic or commercial pressure. No one has an endless supply of fruit flies to sample until the results fit. Even in the era of Big Data, large samples of homogenous groups with full birth data are increasingly rare with data protection laws.
Most p-hacking results in false positives known as Type I errors, but sometimes samples can be manipulated in ways to create false negative results – Type II errors. For decades, p-hacking has been used by some critics of astrology to debunk experiments that show support for astrology when all rational criticism has failed. In what is also known as “data butchery”, samples are sliced and diced into small units to raise the p-value into statistical insignificance.
If done unintentionally, it is a poor use of statistics since it goes against the first goal of research, which is to generate measurable and testable data. If it is done intentionally, then any attempt to ‘divide and discredit’ is an unethical cover-up.
The Carlson Experiment (1985) is still rated as the most famous test that falsifies astrology.
Yet, there is no legitimate explanation as to why Carlson’s results from one test were split into three smaller samples using results from a different test. In reviewing the test, I accounted for this as a sampling error. But given that there was a four-year gap between the experiment and publication it is hard to rule out the possibility that the significant results (for astrology) were deliberately disguised by p-hacking.
In Dean’s critical studies of Extraversion and Neuroticism (1981-86), the original sample set of 1,198 participants was reduced to a set of 288 subjects (34 percent) with extreme personality scores. These were sub-divided into eight blocks of 36 subjects. It was only when the small samples were re-combined that a pattern correlating with the four astrological elements could be measured to a significant level.
To illustrate how sample size affects how significant an outcome is, let’s look at this simple example. When you toss a coin, you have a 50:50 chance of guessing heads or tails correctly. Now, if you do ten successive tosses and find that you have correctly guessed seven times, you might think that you are doing very well and have some super guessing ability because you’ve guessed more than 50% correct. But in fact, with only 10 tosses, you actually have a one in six chance of getting seven correct just by chance, which statistically is not significant. However, if you increase the number of tosses to 100, and you get 70 correct (still 70 percent), the odds of getting that many correct is roughly one in 25,000. And this, being such a remote possibility, can’t be put down to chance.
If there are no obvious flaws in this highly significant result, a critic may attempt to debunk it by cutting down the number of coin tosses (the sample size). This is done by dividing the results into many smaller samples and/or by eliminating most of the coin tosses using incidental criteria.
Nowadays, data butchering is just as rife.
In Tests of Astrology, the authors (Geoffrey Dean et al.) reviewed Paul Westran’s 2005 study of natal and progressed aspects between 1,300 couples. What the authors don’t report is a staggering p-value result for Westran’s key finding, which was based on seven major and minor aspects between the Sun and planet Venus in the compared natal and progressed birth charts of these couples. The Sun and the planet Venus were involved forming positive or challenging aspects (angles) a lot more often than expected by chance at the start or windup of romantic relationships or marriages. The odds were 244 billion to one against the likelihood this result was random.
Even broken down into 28 smaller samples the statistical result was still significant. Without pausing to acknowledge another remarkable p-value of their first reductionist test, our undaunted critical reviewers sliced up the sample further. First, they removed almost two-thirds of the couples. The remaining 447 pairs were then sub-divided into 56 tiny samples for no valid reason. Instead of simplifying the data, the reviewers then added another 56 tiny samples by including Sun-Sun and Venus-Venus interactions, which were not part of Westran’s hypothesis. Not surprisingly, with the data divided into 112 small samples (some with as low a frequency as 3) the reviewers concluded that the previous emphasis on 0-degree, 120-degree and 180-degree aspects between the couples’ progressed and natal Sun and Venus placements “disappeared.”
A follow-up peer-reviewed study was published recently in Correlation. By repeating his results in the second study, Westran confirmed that the conclusion reached by Dean and his associates was misleading. It was a clever ruse that broke faith with a researcher who had trusted and cooperated with the authors. But the desperation of this attempted debunk confirms that the research astrologer has come up with a compelling result.
Westran was methodical in collecting data, and all couples included at least one notable partner in the public domain. Anyone can check birth details with biographies published online. In decisions about inclusion or exclusion (for example, due to uncertainty of the birth data), Westran appears to follow consistent and logical rules. Overall, we are impressed by his diligence, transparency, and authenticity.
Editor’s Note: This article first appeared in Correlation magazine and is republished here with the publication’s permission.
Learn more about the Correlation magazine, Robert Currey, p-hacking and data-butchery in the pilot video of the ANS Research Series #1.
If you enjoyed the video, read another ANS classic about Robert Currey and astrology research – Astrology study validates popular astrology test.
About the author
Tags: astrology research, Astrology-protoscience project, astrology-science debate, Correlation magazine, data-butchery, p-hacking, Paul Westran, Robert Currey, Shawn Carlson, The Carlson Study, The Effect Size (ES)