Admittedly not a statistician, but I think the article is missing the point. The reason why people circle the P values is because nobody actually cares about the thing the p-value is measuring. What they actually care about is whether the null hypothesis is true or some other hypothesis is true. You can wave your hands around about how actually when you said it was significant what you were really saying was something technical about a hypothetical world where the null hypothesis is factually true, and so it's unfair to circle your p value because technically your statement about this hypothetical world is still true. This is not a good argument against p value circling, but rather it merely demonstrates that the technical definition of a p value is not relevant to the real world.
The fact remains that for things which are claimed to be true but turn out to not be true later, the p values that were provided in the paper are very often near the significance threshold. Not so much for things which are obviously and strongly true. This is direct evidence of something that we already know, which is thst nobody cares about p values per se, they only use them to communicate information about something being true or false in the real world, and the technical claim of "well maybe x or y is true, but when I said p=0.49 I was only talking about a hypothetical world where x is true, and my statement about that world still holds true" is no solace.
This is an interesting post but the author’s usage of Lindley’s paradox seems to be unrelated to the Lindley’s paradox I’m familiar with:
> If we raise the power even further, we get to “Lindley’s paradox”, the fact that p-values in this bin can be less likely then they are under the null.
Lindley’s paradox as I know it (and as described by Wikipedia [1]) is about the potential for arbitrarily large disagreements between frequentist and Bayesian analyses of the same data. In particular, you can have an arbitrarily small p-value (p < epsilon) from the frequentist analysis while at the same time having arbitrarily large posterior probabilities for the null hypothesis model (P(M_0|X) > 1-epsilon) from the Bayesian analysis of the same data, without any particularly funky priors or anything like that.
I don’t see any relationship to the phenomenon given the name of Lindley’s paradox in the blog post.
Ultimately I think the paradox comes from mixing two paradigms that aren't really designed to be mixed.
That said you can give a Bayesian argument for p-circling provided you have a prior on the power of the test. The details are almost impossible to work out except for a case by case calculation because unless I'm mistake the shape of the p-value distribution when the null-hypothesis does not hold is very ill defined.
However it's quite possible to give some examples where intuitively a p-value of just below 0.05 would be highly suspicious. You just need to mix tests with high power with unclear results. Say for example you're testing the existence of gravity with various objects and you get a probability of <0.04% that objects just stay in the air indefinitely.
> One could specify a smallest effect size of interest and compare the plausibility of seeing the reported p-value under that distribution compared to the null distribution. 6 Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level
Huh, I’d never thought to do that before. You pretty much have to choose a smallest effect size of interest in order to do a power analysis in the first place, to figure out how many samples to collect, so this is a neat next step to then base significance level off of it.
In a perfect world everybody would be putting careful thought into their desired (acceptable) type I and type II error rates as part of the experimental design process before they ever collected any data.
Given rampant incentive misalignments (the goal in academic research is often to publish something as much as—or more than—to discover truth), having fixed significance levels as standards across whole fields may be superior in practice.
The real problem is that you very often don't have any idea about what your data are going to look like before you collect them; type 1/2 errors depend a lot on how big the sources of variance in your data are. Even a really simple case -- e.g. do students randomly assigned to AM vs PM sessions of a class score better on exams? -- has a lot of unknown parameters: variance of exam scores, variance in baseline student ability, variance of rate of change in score across the semester, can you approximate scores as gaussian or do you need beta, ordinal, or some other model, etc.
Usually you have to go collect data first, then analyze it, then (in an ideal world where science is well-incentivized) replicate your own analysis in a second wave of data collection doing everything exactly the same. Psychology has actually gotten to a point where this is mostly how it works; many other fields have not.
I read the page on Lindsey's paradox, and it's astonishing bullshit. It's well known that with sufficiently insane priors you can come up with stupid conclusions. The page asserts that a Bayesian would accept as reasonable priors that it's equally likely that the probability of child being born male is precisely 0.5 as it is that it has some other value, and also that if it has some other value that all values in the interval from zero to one are equally likely. But nobody on God's green earth would accept those as reasonable values, least of all a Bayesian. A Bayesian would say there's zero chance of it being precisely 0.5, but it is almost certainly really close to 0.5, just like a normal human being would.
A few points because I actually think Lindley’s paradox is really important and underappreciated.
(1) You can get the same effect with a prior distribution concentrated around a point instead of a point prior. The null hypothesis prior being a point prior is not what causes Lindley’s paradox.
(2) Point priors aren’t intrinsically nonsensical. I suspect that you might accept a point prior for an ESP effect, for example (maybe not—I know one prominent statistician who believes ESP is real).
(3) The prior probability assigned to each of the two models also doesn’t really matter, Lindley’s
paradox arises from the marginal likelihoods (which depend on the priors for parameters within each model but not the prior probability of each model).
The fact remains that for things which are claimed to be true but turn out to not be true later, the p values that were provided in the paper are very often near the significance threshold. Not so much for things which are obviously and strongly true. This is direct evidence of something that we already know, which is thst nobody cares about p values per se, they only use them to communicate information about something being true or false in the real world, and the technical claim of "well maybe x or y is true, but when I said p=0.49 I was only talking about a hypothetical world where x is true, and my statement about that world still holds true" is no solace.
This is an interesting post but the author’s usage of Lindley’s paradox seems to be unrelated to the Lindley’s paradox I’m familiar with:
> If we raise the power even further, we get to “Lindley’s paradox”, the fact that p-values in this bin can be less likely then they are under the null.
Lindley’s paradox as I know it (and as described by Wikipedia [1]) is about the potential for arbitrarily large disagreements between frequentist and Bayesian analyses of the same data. In particular, you can have an arbitrarily small p-value (p < epsilon) from the frequentist analysis while at the same time having arbitrarily large posterior probabilities for the null hypothesis model (P(M_0|X) > 1-epsilon) from the Bayesian analysis of the same data, without any particularly funky priors or anything like that.
I don’t see any relationship to the phenomenon given the name of Lindley’s paradox in the blog post.
[1] https://en.wikipedia.org/wiki/Lindley%27s_paradox
That said you can give a Bayesian argument for p-circling provided you have a prior on the power of the test. The details are almost impossible to work out except for a case by case calculation because unless I'm mistake the shape of the p-value distribution when the null-hypothesis does not hold is very ill defined.
However it's quite possible to give some examples where intuitively a p-value of just below 0.05 would be highly suspicious. You just need to mix tests with high power with unclear results. Say for example you're testing the existence of gravity with various objects and you get a probability of <0.04% that objects just stay in the air indefinitely.
> One could specify a smallest effect size of interest and compare the plausibility of seeing the reported p-value under that distribution compared to the null distribution. 6 Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level
Huh, I’d never thought to do that before. You pretty much have to choose a smallest effect size of interest in order to do a power analysis in the first place, to figure out how many samples to collect, so this is a neat next step to then base significance level off of it.
Given rampant incentive misalignments (the goal in academic research is often to publish something as much as—or more than—to discover truth), having fixed significance levels as standards across whole fields may be superior in practice.
Usually you have to go collect data first, then analyze it, then (in an ideal world where science is well-incentivized) replicate your own analysis in a second wave of data collection doing everything exactly the same. Psychology has actually gotten to a point where this is mostly how it works; many other fields have not.
(1) You can get the same effect with a prior distribution concentrated around a point instead of a point prior. The null hypothesis prior being a point prior is not what causes Lindley’s paradox.
(2) Point priors aren’t intrinsically nonsensical. I suspect that you might accept a point prior for an ESP effect, for example (maybe not—I know one prominent statistician who believes ESP is real).
(3) The prior probability assigned to each of the two models also doesn’t really matter, Lindley’s paradox arises from the marginal likelihoods (which depend on the priors for parameters within each model but not the prior probability of each model).