I am a founder of a company managing 500k endpoints of different kind of tracking hardwares. It is funny how a thing that less likely to happen than once a month or year, like a bogus sample of data or an event lost gains significance when you have a client of just 200 endpoints who actually is interested in the service.
They just find mistakes, people come that their report for this guy was missing this and that and the commissions are not precise etc. Or they just get a bogus alert sometimes.
We could just shrug, but this erodes the trust about the whole service and the system itself. So we worked a lot to never have these. We have like 25 people on customer service and I can tell you it could have been easily 50 people otherwise.
The interesting part is how we had to kind of break people's common sense about what is significant problem and what is not. The "this is almost never happens I can assure you" from new colleagues were a good source of laughter after some time.
> "I've become increasingly convinced that the idea of averaging is one of the biggest obstacles to understanding things... it contains the insidious trap of feeling/sounding "rigorous" and "quantitative" while making huge assumptions that are extremely inappropriate for most real world situations."
Semi-related, I found UniverseHacker's take[0] on the myth of averaging apt in regards to leaning to heavily on sample means. Moving beyond p-values and inter/intra group averages, there's fortunately a world of JASP[1].
- DON'T is very clear and specific. Don't say "Stat-Sig", don't conclude causal effect, don't conclude anything based on p>0.05.
- DO is very vague and unclear. Do be thoughtful, do accept uncertainty, do consider all relevant information.
Obviously, thoughtful consideration of all available information is ideal. But until I get another heuristic for "should I dig into this more?" - I'm just gonna live with my 5-10% FPR, thank you very much.
bioinformatician here. nobody has intuition or domain knowledge on all ~20,000 protein coding genes in the human body. That's just not a thing. Routinely comparing what a treatment does we do actually get 20,000 p-values. Feed that into FDR correction, filter for p < 0.01. Now I have maybe 200 genes.Then we can start applying domain knowledge. If you start trying to apply domain knowledge at the beginning, you're actually going to artificially constrict what is biologically possible. Your domain knowledge might say well there's no reason an olfactory gene should be involved in cancer, so I will exclude these (etc etc). You would be instantly wrong. People discovered that macrophages (which play a large role in cancer) can express olfactory receptors. So when I had olfactory receptors coming up in a recent analysis... the p values were onto something and I had to expand my domain knowledge to understand. This is very common. I ask for validation of targets in tissue --> then you see proof borne out that the p-value thresholding business WORKS.
> You know that in your research field p < 0.01 has importance.
A p-value does not measure "importance" (or relevance), and its meaning is not dependent on the research field or domain knowledge: it mostly just depends on effect size and number of replicates (and, in this case, due to the need to apply multiple comparison correction for effective FDR control, it depends on the number of things you are testing).
If you take any fixed effect size (no matter how small/non-important or large/important, as long as it is nonzero), you can make the p-value be arbitrarily small by just taking a sufficiently high number of samples (i.e., replicates). Thus, the p-value does not measure effect importance, it (roughly) measures whether you have enough information to be able to confidently claim that the effect is not exactly zero.
Example: you have a drug that reduces people's body weight by 0.00001% (clearly, an irrelevant/non-important effect, according to my domain knowledge of "people's expectations when they take a weight loss drug"); still, if you collect enough samples (i.e., take the weight of enough people who took the drug and of people who took a placebo, before and after), you can get a p-value as low as you want (0.05, 0.01, 0.001, etc.), mathematically speaking (i.e., as long as you can take an arbitrarily high number of samples). Thus, the p-value clearly can't be measuring the importance of the effect, if you can make it arbitrarily low by just having more measurements (assuming a fixed effect size/importance).
What is research field (or domain knowledge) dependent is the "relevance" of the effect (i.e., the effect size), which is what people should be focusing on anyway ("how big is the effect and how certain am I about its scale?"), rather than p-values (a statement about a hypothetical universe in which we assume the null to be true).
Yeah, it seems like those bullet point have the problem that they don't really contain actionable information.
Here's the way I'd put things - correlation by itself does causation at all. You need correlation plus a plausible model of the world to have a chance.
Now science, at its best, involves building up these plausible models, so a scientist creates an extra little piece of the puzzle and has to be careful also the piece is a plausible fit.
The problem you hit is that the ruthless sink-or-swim atmosphere, previous bad science and fields that have little merit make it easy to be in the "just correlation" category. And whether you're doing a p test or something else doesn't matter.
A way to put is that a scientist has to care about the truth in order to put together all the pieces of models and data in their field.
I think the main issue with "p < 0.05" can be summed up as people use it to "prove" a phenomenon exists rather than a pragmatic cutoff to screen for interesting things to investigate further.
Not publishing results with p >= 0.05 is the reason p-values aren't that useful. This is how you get the replication crisis in psychology.
The p-value cutoff of 0.05 just means "an effect this large, or larger, should happen by chance 1 time out of 20". So if 19 failed experiments don't publish and the 1 successful one does, all you've got are spurious results. But you have no way to know that, because you don't see the 19 failed experiments.
This is the unresolved methodological problem in empirical science that deal with weak effects.
> "an effect this large, or larger, should happen by chance 1 time out of 20"
More like "an effect this large, or larger, should happen by chance 1 time out of 20 in the hypothetical universe where we already know that the true size of the effect is zero".
Part of the problem of p-values is that most people can't even parse what it means (not saying it's your case). P-values are never a statement about probabilities in the real world, but always a statement about probabilities in a hypothetical world where we all effects are zero.
"Effect sizes", on the other hand, are more directly meaningful and more likely to be correctly interpreted by people on general, particularly if they have the relevant domain knowledge.
(Otherwise, I 100% agree with the rest of your comment.)
Publishing only significant results is a terrible idea in the first place. Publishing should be based on how interesting the design of the experiment was, not how interesting the result was.
p-value doesn't measure interestingness directly of course, but I think people generally find nonsignificant results uninteresting because they think the result is not difficult to explain by the definitionally-uninteresting "null hypothesis".
My point was basically that the reputation / carrer / etc of the experimenter should be mostly independent of the study results. Otherwise you get bad incentives. Obviously we have limited ability to do this in practice, but at least we could fix the way journals decide what to publish.
Both of those statements are false. Everything has a result. And the p-value is very literally a quantified measure of how interesting a result was. That's the only thing it purports to measure.
"Woman gives birth to fish" is interesting because it has a p-value of zero: under the null hypothesis ("no supernatural effects"), a woman can never give birth to a fish.
I ate cheese yesterday and a celebrity died today: P >> 0.05. There is no result and you can't say anything about whether my cheese eating causes or prevents celebrity deaths. You confuse hypothesis testing with P-values.
The result is "a celebrity died today". This result is uninteresting because, according to you, celebrities die much more often than one per twenty days.
I suggest reading your comments before you post them.
In THEORY yes, but in practice, there are not a ton of journals I think that will actually publish well done research that does not come to some interesting conclusion and find some p<.05. So....
Plenty of journals do, just mostly in fields that don't emphasize P-values. Chemistry and materials science tend to focus on the raw data in the form of having the instrument output included, and an interpretation in the results section.
The peaks in your spectra, the calculation results, or the microscopy image either support your findings or they don't, so P-values don't get as much milage. I can't remember the last time I saw a P-value in one of those papers.
This does create a problem similar to publishing null result P-values, however: if a reaction or method doesn't work out, journals don't want it because it's not exciting. So much money is likely being wasted independently duplicating failed reactions over and over because it just never gets published.
The 0.05 threshold is indeed arbitrary, but the scientific method is sound.
A good researcher describes their study, shows their data and lays their own conclusions. There is just no need (nor possibility) of a predefined recipe to resume the study result into a "yes" or a "no".
Research is about increasing knowledge; marketing is about labelling.
Right, the decisions made by future researchers about what to base their work on are the real evaluation, hence citation counts as a core metric. It's easy to claim positive results by making your null hypothesis dogshit (and choice of "p" is easily the least inspired way to sabotage a null hypothesis), but researchers learn this game early and tend to not gamble their time following papers where they suspect this is what's going on. The whole thing kinda works, in a world where the alternatives don't work at all.
Sounds good but is that true? A single unreplicated paper could be science couldn't it? Science is a framework within which there are many things, including theories, mistakes, false negatives, replication failures, etc... Science progresses due to quantity more than quality, it is brute force in some sense that way, but it is more a journey than a destination. You "do" science moreso than you "have" science.
A single brick on the ground, all by itself, is not a wall.
But if you take a lot of bricks and arrange them appropriately, then every single one of those bricks is wall.
In other words, just like the article points out down in the "dos" section, it depends on how you're treating that single unreplicated paper. Are you cherry-picking it, looking at it in isolation, and treating it as if it were definitive all by itself? Or are you considering it within a broader context of prior and related work, and thinking carefully about the strengths, limitations, and possible lacunae of the work it represents?
Only scientists care about doing science. Most people are not scientists. Even scientists are not scientists in every field. We as the genereral population (including scientists in a different field) however care about science because of the results. The results of science is modern health care, engineering (bridges that don't collapse...), and many other such things that we get because we "have" science.
I think you and the OP are agreeing with each other. The issue with a "single unreplicated paper" is exactly the issue you bring up with science as a journey. It's possible that this paper has found a genuine finding or that it is nonsense (people can find isolated published papers supporting almost anything they want even if they don't reflect the scientific consensus), but if no other researchers are even bothering to replicate the findings in it it hasn't joined the journey.
If a new paper with an outrageous claim pops up, people are automatically suspicious. Until it’s been reproduced by a few labs, it’s just “interesting”.
Then once it’s been validated and new science is built off of it, it’s not really accepted as foundational.
As a scientist, I don’t think there is any specific scientific method or protocol - other than something really general like “think of all the ways people have been deceived in the past and carefully avoid them.” Almost no modern research follows anything like the “scientific method” I was taught in public school.
The way I do research is roughly Bayesian- I try to see what the aggregate of published experiments, anecdotes, intuition, etc. suggests are likely explanations for a phenomenon. Then I try to identify what realistic experiment is likely to provide the most evidence distinguishing between the top possibilities. There are usually many theories or hypotheses in play, and none are ever formally confirmed or rejected- only seen as more or less likely in the light of new evidence.
> A good researcher describes their study, shows their data and lays their own conclusions.
Tangent: I think that this attitude of scientific study can be applied to journalism to create a mode of articles between "neutral" reports and editorials. In the in-between mode, journalists can and should present their evidence without sharing their own conclusions, and then they should present their first-order conclusions (e.g. what the author personally thinks that this data says about reality) in the same article even if their conclusions are opinionated, but should restrain from second-order opinions (e.g. about what the audience should feel or do).
> The 0.05 threshold is indeed arbitrary, but the scientific method is sound.
I guess it depends on what you're referring to as the "scientific method. As the article indicates, a whole lot of uses of p-values in the field - including in many scientific papers - actually invoke statistics in invalid or fallacious ways.
> The scientific method is sound != every experiment that claims to use the scientific method is sound
Sure, which is why I asked OP to define what they meant by "scientific method". The statement doesn't mean a whole lot if we're defining "scientific method" in a way that excludes 99% of scientific work that's actually produced.
People need tools to filter results. Using a somewhat arbitrary cutoff for what to work with is actually fine because people need to make decisions. Further, papers that report false positives do not tend to lead to huge branches of successful science because over time the findings do not replicate.
But I am curious about something else. I am not a statistical mechanics person, but my understanding of information theory is that something actually refined emerges with a threshold (assuming it operates on SOME real signal) and the energy required to provide that threshold is important to allow "lower entropy" systems to emerge. Isn't this the whole principle behind Maxwell's Demon? That if you could open a little door between two equal temperature gas canisters you could perfectly separate the faster and slower gas molecules and paradoxically increase the temperature difference? But to only open the door for fast molecules (thresholding them) the little door would require energy (so it is no free lunch)? And that effectively acts as a threshold on the continuous distributions? I guess what I am asking is that isn't there a fundamental importance to thresholds in generating information? Isn't that how neurons work? Isn't that how AI models work?
Skimming through the other articles in that special issue, I thought the most practical advice was in Abandon Statistical Significance[1], which discusses the practicalities of not treating p-values as a threshold.
Even if the p < .05 threshold is removed, we’re still left with the important threshold of getting published or not getting published. Judging a paper by the quality of its statistical method won’t be enough to sort the publishable from the unpublishable in top journals. As a result, low p-values will continue to be favored and p-hacking will continue. To me it seems like a consequence of scarcity of academic positions and the tenure system.
Seems to me a more reasonable threshold would be at most 10^-9 probability of the conclusion being false (in the Bayesian sense of course), with the prior chosen by the editor and reviewers (and the models being selected upon also being agreeable to them).
As someone who has studied genetics on my own for the last twenty years I am very glad to read this editorial.
For example, take a population of 100 people, and let us say one of them has gene changes in their Fatty Acid Desaturase genes (FADS1 an d FADS2) that change how important Long Chain Omega 3 Fatty Acids (like from fish) are for them. This happens more often in people from indigenous arctic populations.
So the researcher tests if omega 3 effects cardiovascular outcome in these hundred people by adding a lot more fish oil to the diet of these 100 people. Since only one of them really needs it, the P value will be insignificant and everyone will say fish oil does nothing. Yet for that one person it was literally everything.
This is talked about only quietly in research, but I think the wider population needs to understand this to know how useless p < 0.05 is when testing nutritional effects in genetically diverse populations.
Isn't that just a bad study? You have confounding factors - such as ethnicity - that weren't controlled/considered/eliminated.
I do get what you're saying, if you miss something in the study that is important, but I don't see how this is a case to drop the value of statistical significance?
In medicine, it is essentially impossible to control for all possible factors. Case in point, ethnicity is not biologically realized either; it's a social tool we use to characterize broad swathes of phenotypic [and sociocultural] differences that are more likely (but not guaranteed) to occur in certain populations. But the example provided of indigenous arctic people is itself imprecise. You can't control for that, not without genetic testing - and even then, that presupposes we've characterized the confounding factors genetically, and that the confounding factors are indeed genetic in orgin at all.
Put another way, the population is simply too variable to attempt to eliminate all confounding factors. We can, at best, eliminate some of the ones we know about, and acknowledge the ones we can't.
Not exactly. What I mean to say is this: We know there are certain phenotypes that predominantly appear in certain populations, in broad strokes. But while we do have lists of correlates, we don't have good definitions for what an "ethnicity" is biologically, and there is very good reason to believe no satisfactory definition exists.
To use OP's example, we know that the gene mentioned is frequently found in the Inuit population. But if an Inuk does not have that gene, it does not somehow make them less Inuit. We can't quantify percentage Inuitness, and doing so is logically unsound. This is because the term "Inuit" doesn't mean its biological correlates. It simply has biological correlates.
To use an example of a personal friend, slightly anonymized: My friend is an Ashkenazi Jew. There is absolutely no uncertainty about this; Jewishness is matrilineal, and their mother was an Ashkenazi Jew, and her mother before her, going back over eight documented generations of family history. But alas - their grandfather was infertile, a fact that was posthumously revealed. Their maternal grandmother had a sperm donor. The sperm donor was not an Ashkenazi Jew. Consequently, can said friend be said to be "only 75% Jewish," having missed the "necessary" genetic correlates? Of course not. By simple matrilineage they are fully an Ashkenazi Jew.
Why are these terms used in medicine, then? Because, put simply, it's the best we can do. Genetic profiling is a useful tool under some limited circumstances, and asking medical subjects their ethnicity is often useful in determining medical correlates. But there is nothing in the gene that says "I am Inuk, I am Ashkenazi," because these ideas are social first, not genetic first.
I don't disagree with this, but this is very not consistent with "ethnicity is not biologically realized", which suffers from the same logical error but in the other direction.
I often wonder how many entrenched culture battles could be ~resolved (at least objectively) by fixing people's cognitive variable types.
In the spirit of randomization and simulation, every culture war debate should be repeated at least 200 times, each with randomly assigned definitions of “justice” and “freedom” drawn from an introductory philosophy textbook. Eating meat is wrong, p = 12/200.
As a layman who doesn't work with medical studies it always struck me that one of the bits of data that isn't (normally) collected along with everything else is genetic samples of all participants. It should be stored alongside everything else so that if the day comes when genetic testing becomes cheap enough it can be used to provide vastly greater insight into the study's results.
Even something as simple as a few strands of hair sealed in a plastic bag in a filing cabinet somewhere would be better than nothing at all.
That throws out anonymity. I don't see this getting approved, or people signing up for such studies, apart from those who don't care that there genetic data gets collected and stored.
Even if there is no name saved with the genetic sample, the bar for identification is low. The genes are even more identifying than a name after all. Worse, it contains deep information about the person.
I was trawling studies for some issues of my own and sort of independently discovered this many years ago. It's very easy for an intervention to be life saving for 5%, pretty good for 10%, neutral for %84, and to have some horrible effect for %1, and that tends to average out to some combination of "not much effect", "not statistically significant", and depending on that 1% possible "dangerous to everyone". (Although with the way studies are run, there's a certain baseline of "it's super dangerous" you should expect because studies tend to run on the assumption that everything bad that happened during them was the study's fault, even though that's obvious not true. With small sample sizes this can not be effectively "controlled away".) We need some measure that can capture this outcome and not just neuter it away, because I also found there were multiple interventions that would have this pattern out outcome. Yet they would all be individually averaged away and the "official science consensus" was basically "yup, none of these treatments 'work'", resulting in what could be a quite effective treatment plan for some percentage of the population being essentially defeated in detail [1].
What do you mean? They all "work". None of them work for everyone, but that doesn't mean they don't work at all. As the case I was looking at revolved around nutritional deficiencies (brought on by celiac in my case) and their effects on the heart, it is also the case that the downside of the 4 separate interventions if it was wrong was basically nil, as were the costs. What about trying a simple nutritional supplement before we slam someone on beta blockers or some other heavy-duty pharmaceutical? I'm not against the latter on principle or anything, but if there's something simpler that has effectively no downsides (or very, very well-known ones in the cases of things like vitamin K or iron), let's try those first.
I think we've lost a great deal more to this weakness in the "official" scientific study methodology than anyone realizes. On the one hand, p-hacking allows us to "see" things where they don't exist and on the other this massive, massive overuse of "averaging" allows us to blur away real, useful effects if they are only massively helpful for some people but not everybody.
Last I heard, 5 sigma was the standard for genetic studies now. p<0.05 is 1.96 sigma, 5 sigma would be p < 0.0000006.
But even though I'm not happy with NHST (the testing paradigm you describe), in that paradigm it is a valid conclusion for the group the hypothesis was tested on. It has been known for a long, long time that you can't find small, individual effects when testing a group. You need to travel a much harder path for those.
> So the researcher tests if omega 3 effects cardiovascular outcome in these hundred people by adding a lot more fish oil to the diet of these 100 people. Since only one of them really needs it, the P value will be insignificant and everyone will say fish oil does nothing. Yet for that one person it was literally everything.
But... that's not a problem with the use of the p-value, because that's (quite probably) a correct conclusion about the target (unrestricted) population addressed by the study as a whole.
That's a problem with not publishing complete observations, or not reading beyond headline conclusions to come up with future research avenues. That effects which are not significant in a broad population may be significant in a narrow subset (and vice versa) are well-known truths (they are the opposites of the fallacies of division and composition, respectively.)
The real underlying problem is that in your case, genetic variants are not accounted for. As soon as you include these crucial moderating covariates, it‘s absolutely possible to find true effects even for (rather) small samples (one out of a hundred is really to few for any reasonable design unless it‘s longitudinal)
Anything in health sciences has millions of variants not accounted for, that also interact between themselves so you'd need to account for every combination of them.
And it's usually discouraged by regulators because it can lead to p-hacking. I.e., with a good enough choice of control I can get anything down to 5%
The fundamental problem is the lack of embrace of causal inference techniques - i.e., the choice of covariates/confounders is on itself a scientific problem that needs to be handled with love
It is also not easy if you have many potential covariates! Because statistically, you want a complete (explaining all effects) but parsimonious (using as few predictors as possible) model. Yet you by definition don‘t know the true underlying causal structure. So one needs to guess which covariates are useful. There are also no statistical tools that can, given your data, explain whether the model sufficiently explains the causal phenomenon, because statistics cannot tell you about potentially missing confounders.
From a methods perspective, wouldn't this be more of a statistical power issue (too small of sample size) than a random effect issue? Granted, we do a terrible job discussing statistical power.
Watching from the sidelines, I’ve always wondered why everything in the life sciences seems to assume unimodal distributions (that is, typically a normal bell curve).
Multimodal distributions are everywhere, and we are losing key insights by ignoring this. A classic example is the difference in response between men and women to a novel pharmaceutical.
It’s certainly not the case that scientists are not aware of this fact, but there seems to be a strong bias to arrange studies to fit into normal distributions by, for example, being selective about the sample population (test only on men, to avoid complicating variables). That makes pragmatic sense, but I wonder if it perpetuates an implicit bias for ignoring complexity.
It’s because statistical tests are based on the distribution of the statistic, not the data itself. If the central limit holds, this distribution will be a bell curve as you say
This effect would be even more pronounced in a larger sample size. Consider how small a fraction indigenous arctic populations are of the population as a whole. In other words, larger sample sizes would be even worse off in this particular occasion.
But it is more complicated. I have Sami Heritage but it goes back to my great great grandparents. I did not know this until I started digging deeply into my ancestry, but I carry many of the polymorphisms from these people.
So although I look like a typical European Caucasian, my genetics are very untypical of that population. And this also explains my family history of heart diseases and mood disorders which are also non-typical of European Caucasians.
The best way to talk about this is IMO effect heterogeneity. Underlying that you have the causal DAG to consider, but that‘s (a) a lot of effort and (b) epistemologically difficult!
Even if you did a study with the whole planet there would be no statistical significance since the genetic variation in in FADS genes are still in the minority. (The majority of the world is warm and this is a cold weather/diet adaptation).
In most African populations this Polymorphism does not exist at all. And even in Europeans it is only about 12% of the population.
Confidence intervals are incredibly potent, both in describing the extent that a phenomenon is happening _and_ critically being simple to create for researchers and consumed by lay persons.
"Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p < 0.05).
Don’t believe that an association or effect exists just because it was statistically significant.
Don’t believe that an association or effect is absent just because it was not statistically significant.
Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)."
Hopefully this can help address the replication crisis[0] in (social) science.
I think NHST is kind of overstated as a cause of the replication crisis.
People do routinely misuse and misinterpret p-values — the worst of it I've seen is actually in the biomedical and biological sciences, but I'm not sure that matters. Attending to the appropriate use of them, as well as alternatives, is warranted.
However, even if everyone started focusing on, say, Bayesian credibility intervals I don't think it would change much. There would still be some criterion people would adopt in terms of what decision threshold to use about how to interpret a result, and it would end up looking like p-values. People would abuse that in the same ways.
Although this paper is well-intended and goes into actionable reasonable advice, it suffers some of the same problems I think is typical of this area. It tends to assume your data is fixed, and the question is how to interpret your modeling and results. But in the broader scientific context, that data isn't a fixed quantity ideally: it's collected by someone, and there's a broader question of "why this N, why this design", and so forth. So yes, ps are arbitrary, but they're not necessarily arbitrary relative to your study design, in the sense that if p < 0.05 is the standard the field has adopted, and you have a p = 0.053, the onus is on you to increase your N or choose a more powerful or more convincing design to demonstrate something at whatever threshold the field has settled on.
I'm not trying to argue for p-values per se necessarily, science is much more than p-values or even statistics, and think the broader problem lies with vocational incentives and things like that. But I do think at some level people will often, if not usually, want some categorical decision criterion to decide "this is a real effect not equal to null" and that decision criterion will always produce questionable behavior around it.
It's uncommon in science in general to be in a situation where the question of interest is to genuinely want to estimate a parameter with precision per se. There are cases of this, like in physics for example, but I think usually in other fields that's not the case. Many (most?) fields just don't have the precision of prediction of the physical sciences, to the point where differences of a parameter value from some nonzero theoretical one make a difference. Usually the hypothesis of a nonzero effect, or of some difference from an alternative; moreover, even when there is some interest in estimating a parameter value, there's often (like in physics) some implicit desire to test whether or not the value deviates significantly from a theoretical one, so you're back to a categorical decision threshold.
> Hopefully this can help address the replication crisis[0] in (social) science.
I think it isn't just p-hacking.
I've participated in a bunch of psychology studies (questionaires) for university and I've frequently had situations where my answer to some question didn't fit into the possible answer choices at all. So I'd sometimes just choose whatever seems the least wrong answer out of frustration.
It often felt like the study author's own beliefs and biases strongly influence how studies are designed and that might be the bigger issue. It made me feel pretty disillusioned with that field, I frankly find it weird they call it a science. Although that is of course just based on the few studies I've seen.
> the study author's own beliefs and biases strongly influence how studies are designed
While studies should try to be as "objective" as possible, it isn't clear how this can be avoided. How can the design of a study not depend on the author's beliefs? After all, the study is usually designed to test some hypothesis (that the author has based on their prior knowledge) or measure some effect (that the author thinks exists).
Which is a great idea if we ignore all other issues in academia, e.g. pressure to publish etc. Taking such a hard-line stance I fear will just yield much less science being done.
There is a difference between a belief and an idea. I might have an idea about what causes some bug in my code, but it isn't a belief. I'm not trying to defend it, but to research it. Though I have met people who do hold beliefs about why code is broken. They refuse to consider the larger body of evidence and will cherry pick what we know about an incident to back their own view conclusions.
Can we recognize the beliefs we have that bias our work and then take action to eliminate those biases? I think that is possible when we aren't studying humans, but beliefs we have about humans are on a much deeper level and psychology largely doesn't have the rigor to account for them.
Psychology is IMO in the state alchemy was before chemistry. And there's no guarantee it will evolve beyond that. Not unless we can fully simulate the mind.
"It is difficult to get a man to understand something, when his salary depends on his not understanding it."
I am sure there are plenty of people who misunderstand or misinterpret statistics. But in my experience these are mostly consumers. The people who produce "science" know damn well what they are doing.
This is not a scientific problem. This is a people problem.
I haven't found this to be true at all. In fact, I'd say the majority of studies I read - even from prestigious journals - is fraught with bad statistics. I have no idea how some of these studies were even allowed to be published. Some fields are worse than others, but it's still a huge problem pretty much across the board.
People conduct science, and a lot of those people don't understand statistics that well. This quote from nearly 100 years ago still rings true in my experience:
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."
> I am sure there are plenty of people who misunderstand or misinterpret statistics. But in my experience these are mostly consumers. The people who produce "science" know damn well what they are doing.
As a statistician, I could not disagree more. I would venture to say that most uses of statistics by scientists that I see are fallacious in some way. It doesn't always invalidate the results, but that doesn't change the fact that it is built on a fallacy nonetheless.
In general, most scientists actually have an extremely poor grasp of statistics. Most fields require little more than a single introductory course to statistics with calculus (the same one required for pre-med students), and the rest they learn in an ad-hoc manner - often incorrectly.
They just find mistakes, people come that their report for this guy was missing this and that and the commissions are not precise etc. Or they just get a bogus alert sometimes.
We could just shrug, but this erodes the trust about the whole service and the system itself. So we worked a lot to never have these. We have like 25 people on customer service and I can tell you it could have been easily 50 people otherwise.
The interesting part is how we had to kind of break people's common sense about what is significant problem and what is not. The "this is almost never happens I can assure you" from new colleagues were a good source of laughter after some time.
Semi-related, I found UniverseHacker's take[0] on the myth of averaging apt in regards to leaning to heavily on sample means. Moving beyond p-values and inter/intra group averages, there's fortunately a world of JASP[1].
[0]: https://news.ycombinator.com/item?id=41631448
[1]: https://jasp-stats.org/2023/05/30/jasp-0-17-2-blog/
- DON'T is very clear and specific. Don't say "Stat-Sig", don't conclude causal effect, don't conclude anything based on p>0.05.
- DO is very vague and unclear. Do be thoughtful, do accept uncertainty, do consider all relevant information.
Obviously, thoughtful consideration of all available information is ideal. But until I get another heuristic for "should I dig into this more?" - I'm just gonna live with my 5-10% FPR, thank you very much.
Why do you need a heuristic? In what areas are you doing research where you don't have any other intuition or domain knowledge to draw on?
And if you don't have that background, contextual knowledge, are you the right person to be doing the work? Are you asking the right questions?
That's the domain knowledge. p-values are useful not the fixed cut-off. You know that in your research field p < 0.01 has importance.
A p-value does not measure "importance" (or relevance), and its meaning is not dependent on the research field or domain knowledge: it mostly just depends on effect size and number of replicates (and, in this case, due to the need to apply multiple comparison correction for effective FDR control, it depends on the number of things you are testing).
If you take any fixed effect size (no matter how small/non-important or large/important, as long as it is nonzero), you can make the p-value be arbitrarily small by just taking a sufficiently high number of samples (i.e., replicates). Thus, the p-value does not measure effect importance, it (roughly) measures whether you have enough information to be able to confidently claim that the effect is not exactly zero.
Example: you have a drug that reduces people's body weight by 0.00001% (clearly, an irrelevant/non-important effect, according to my domain knowledge of "people's expectations when they take a weight loss drug"); still, if you collect enough samples (i.e., take the weight of enough people who took the drug and of people who took a placebo, before and after), you can get a p-value as low as you want (0.05, 0.01, 0.001, etc.), mathematically speaking (i.e., as long as you can take an arbitrarily high number of samples). Thus, the p-value clearly can't be measuring the importance of the effect, if you can make it arbitrarily low by just having more measurements (assuming a fixed effect size/importance).
What is research field (or domain knowledge) dependent is the "relevance" of the effect (i.e., the effect size), which is what people should be focusing on anyway ("how big is the effect and how certain am I about its scale?"), rather than p-values (a statement about a hypothetical universe in which we assume the null to be true).
Here's the way I'd put things - correlation by itself does causation at all. You need correlation plus a plausible model of the world to have a chance.
Now science, at its best, involves building up these plausible models, so a scientist creates an extra little piece of the puzzle and has to be careful also the piece is a plausible fit.
The problem you hit is that the ruthless sink-or-swim atmosphere, previous bad science and fields that have little merit make it easy to be in the "just correlation" category. And whether you're doing a p test or something else doesn't matter.
A way to put is that a scientist has to care about the truth in order to put together all the pieces of models and data in their field.
So the problem is ultimately institutional.
The p-value cutoff of 0.05 just means "an effect this large, or larger, should happen by chance 1 time out of 20". So if 19 failed experiments don't publish and the 1 successful one does, all you've got are spurious results. But you have no way to know that, because you don't see the 19 failed experiments.
This is the unresolved methodological problem in empirical science that deal with weak effects.
More like "an effect this large, or larger, should happen by chance 1 time out of 20 in the hypothetical universe where we already know that the true size of the effect is zero".
Part of the problem of p-values is that most people can't even parse what it means (not saying it's your case). P-values are never a statement about probabilities in the real world, but always a statement about probabilities in a hypothetical world where we all effects are zero.
"Effect sizes", on the other hand, are more directly meaningful and more likely to be correctly interpreted by people on general, particularly if they have the relevant domain knowledge.
(Otherwise, I 100% agree with the rest of your comment.)
My point was basically that the reputation / carrer / etc of the experimenter should be mostly independent of the study results. Otherwise you get bad incentives. Obviously we have limited ability to do this in practice, but at least we could fix the way journals decide what to publish.
"Woman gives birth to fish" is interesting because it has a p-value of zero: under the null hypothesis ("no supernatural effects"), a woman can never give birth to a fish.
I suggest reading your comments before you post them.
The peaks in your spectra, the calculation results, or the microscopy image either support your findings or they don't, so P-values don't get as much milage. I can't remember the last time I saw a P-value in one of those papers.
This does create a problem similar to publishing null result P-values, however: if a reaction or method doesn't work out, journals don't want it because it's not exciting. So much money is likely being wasted independently duplicating failed reactions over and over because it just never gets published.
This is a damning criticism of the most common techniques used by my scientists. The answer isn't to shrug and keep doing the same thing as before.
A good researcher describes their study, shows their data and lays their own conclusions. There is just no need (nor possibility) of a predefined recipe to resume the study result into a "yes" or a "no".
Research is about increasing knowledge; marketing is about labelling.
Agreed. A single published paper is not science, a tree data structure of published papers that all build off of each other is science.
But if you take a lot of bricks and arrange them appropriately, then every single one of those bricks is wall.
In other words, just like the article points out down in the "dos" section, it depends on how you're treating that single unreplicated paper. Are you cherry-picking it, looking at it in isolation, and treating it as if it were definitive all by itself? Or are you considering it within a broader context of prior and related work, and thinking carefully about the strengths, limitations, and possible lacunae of the work it represents?
If a new paper with an outrageous claim pops up, people are automatically suspicious. Until it’s been reproduced by a few labs, it’s just “interesting”.
Then once it’s been validated and new science is built off of it, it’s not really accepted as foundational.
The way I do research is roughly Bayesian- I try to see what the aggregate of published experiments, anecdotes, intuition, etc. suggests are likely explanations for a phenomenon. Then I try to identify what realistic experiment is likely to provide the most evidence distinguishing between the top possibilities. There are usually many theories or hypotheses in play, and none are ever formally confirmed or rejected- only seen as more or less likely in the light of new evidence.
Tangent: I think that this attitude of scientific study can be applied to journalism to create a mode of articles between "neutral" reports and editorials. In the in-between mode, journalists can and should present their evidence without sharing their own conclusions, and then they should present their first-order conclusions (e.g. what the author personally thinks that this data says about reality) in the same article even if their conclusions are opinionated, but should restrain from second-order opinions (e.g. about what the audience should feel or do).
I guess it depends on what you're referring to as the "scientific method. As the article indicates, a whole lot of uses of p-values in the field - including in many scientific papers - actually invoke statistics in invalid or fallacious ways.
No quotes needed, scientific method is well defined: https://en.wikipedia.org/wiki/Scientific_method
Sure, which is why I asked OP to define what they meant by "scientific method". The statement doesn't mean a whole lot if we're defining "scientific method" in a way that excludes 99% of scientific work that's actually produced.
But I am curious about something else. I am not a statistical mechanics person, but my understanding of information theory is that something actually refined emerges with a threshold (assuming it operates on SOME real signal) and the energy required to provide that threshold is important to allow "lower entropy" systems to emerge. Isn't this the whole principle behind Maxwell's Demon? That if you could open a little door between two equal temperature gas canisters you could perfectly separate the faster and slower gas molecules and paradoxically increase the temperature difference? But to only open the door for fast molecules (thresholding them) the little door would require energy (so it is no free lunch)? And that effectively acts as a threshold on the continuous distributions? I guess what I am asking is that isn't there a fundamental importance to thresholds in generating information? Isn't that how neurons work? Isn't that how AI models work?
[1] https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1...
In case you want your own PDF of the paper.
Questions like how many samples to collect, what methods to fit the uncertain data etc.
https://allendowney.github.io/ModSimPy/
For example, take a population of 100 people, and let us say one of them has gene changes in their Fatty Acid Desaturase genes (FADS1 an d FADS2) that change how important Long Chain Omega 3 Fatty Acids (like from fish) are for them. This happens more often in people from indigenous arctic populations.
https://www.sciencedirect.com/science/article/pii/S000291652...
So the researcher tests if omega 3 effects cardiovascular outcome in these hundred people by adding a lot more fish oil to the diet of these 100 people. Since only one of them really needs it, the P value will be insignificant and everyone will say fish oil does nothing. Yet for that one person it was literally everything.
This is talked about only quietly in research, but I think the wider population needs to understand this to know how useless p < 0.05 is when testing nutritional effects in genetically diverse populations.
Interpreting Clinical Trials With Omega-3 Supplements in the Context of Ancestry and FADS Genetic Variation https://www.frontiersin.org/journals/nutrition/articles/10.3...
I do get what you're saying, if you miss something in the study that is important, but I don't see how this is a case to drop the value of statistical significance?
Put another way, the population is simply too variable to attempt to eliminate all confounding factors. We can, at best, eliminate some of the ones we know about, and acknowledge the ones we can't.
What does this mean? Is it contrary to what OP is saying above?
To use OP's example, we know that the gene mentioned is frequently found in the Inuit population. But if an Inuk does not have that gene, it does not somehow make them less Inuit. We can't quantify percentage Inuitness, and doing so is logically unsound. This is because the term "Inuit" doesn't mean its biological correlates. It simply has biological correlates.
To use an example of a personal friend, slightly anonymized: My friend is an Ashkenazi Jew. There is absolutely no uncertainty about this; Jewishness is matrilineal, and their mother was an Ashkenazi Jew, and her mother before her, going back over eight documented generations of family history. But alas - their grandfather was infertile, a fact that was posthumously revealed. Their maternal grandmother had a sperm donor. The sperm donor was not an Ashkenazi Jew. Consequently, can said friend be said to be "only 75% Jewish," having missed the "necessary" genetic correlates? Of course not. By simple matrilineage they are fully an Ashkenazi Jew.
Why are these terms used in medicine, then? Because, put simply, it's the best we can do. Genetic profiling is a useful tool under some limited circumstances, and asking medical subjects their ethnicity is often useful in determining medical correlates. But there is nothing in the gene that says "I am Inuk, I am Ashkenazi," because these ideas are social first, not genetic first.
I often wonder how many entrenched culture battles could be ~resolved (at least objectively) by fixing people's cognitive variable types.
Even something as simple as a few strands of hair sealed in a plastic bag in a filing cabinet somewhere would be better than nothing at all.
Even if there is no name saved with the genetic sample, the bar for identification is low. The genes are even more identifying than a name after all. Worse, it contains deep information about the person.
What do you mean? They all "work". None of them work for everyone, but that doesn't mean they don't work at all. As the case I was looking at revolved around nutritional deficiencies (brought on by celiac in my case) and their effects on the heart, it is also the case that the downside of the 4 separate interventions if it was wrong was basically nil, as were the costs. What about trying a simple nutritional supplement before we slam someone on beta blockers or some other heavy-duty pharmaceutical? I'm not against the latter on principle or anything, but if there's something simpler that has effectively no downsides (or very, very well-known ones in the cases of things like vitamin K or iron), let's try those first.
I think we've lost a great deal more to this weakness in the "official" scientific study methodology than anyone realizes. On the one hand, p-hacking allows us to "see" things where they don't exist and on the other this massive, massive overuse of "averaging" allows us to blur away real, useful effects if they are only massively helpful for some people but not everybody.
[1]: https://en.wikipedia.org/wiki/Defeat_in_detail
But even though I'm not happy with NHST (the testing paradigm you describe), in that paradigm it is a valid conclusion for the group the hypothesis was tested on. It has been known for a long, long time that you can't find small, individual effects when testing a group. You need to travel a much harder path for those.
But... that's not a problem with the use of the p-value, because that's (quite probably) a correct conclusion about the target (unrestricted) population addressed by the study as a whole.
That's a problem with not publishing complete observations, or not reading beyond headline conclusions to come up with future research avenues. That effects which are not significant in a broad population may be significant in a narrow subset (and vice versa) are well-known truths (they are the opposites of the fallacies of division and composition, respectively.)
Yes, but this is not usually done.
The fundamental problem is the lack of embrace of causal inference techniques - i.e., the choice of covariates/confounders is on itself a scientific problem that needs to be handled with love
A cool, interesting, horrible problem to have :)
Multimodal distributions are everywhere, and we are losing key insights by ignoring this. A classic example is the difference in response between men and women to a novel pharmaceutical.
It’s certainly not the case that scientists are not aware of this fact, but there seems to be a strong bias to arrange studies to fit into normal distributions by, for example, being selective about the sample population (test only on men, to avoid complicating variables). That makes pragmatic sense, but I wonder if it perpetuates an implicit bias for ignoring complexity.
For example, checking account balances are far from a normal distribution!
So although I look like a typical European Caucasian, my genetics are very untypical of that population. And this also explains my family history of heart diseases and mood disorders which are also non-typical of European Caucasians.
I agree, but then all these cheaper, easier studies are useless.
In most African populations this Polymorphism does not exist at all. And even in Europeans it is only about 12% of the population.
https://en.wikipedia.org/wiki/Simpson%27s_paradox
We're all just fancy monkeys with lightning rocks, it's fine dude
Don’t believe that an association or effect exists just because it was statistically significant.
Don’t believe that an association or effect is absent just because it was not statistically significant.
Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)."
Hopefully this can help address the replication crisis[0] in (social) science.
[0]: https://en.wikipedia.org/wiki/Replication_crisis
Edit: Formatting (sorry formatting is hopeless).
People do routinely misuse and misinterpret p-values — the worst of it I've seen is actually in the biomedical and biological sciences, but I'm not sure that matters. Attending to the appropriate use of them, as well as alternatives, is warranted.
However, even if everyone started focusing on, say, Bayesian credibility intervals I don't think it would change much. There would still be some criterion people would adopt in terms of what decision threshold to use about how to interpret a result, and it would end up looking like p-values. People would abuse that in the same ways.
Although this paper is well-intended and goes into actionable reasonable advice, it suffers some of the same problems I think is typical of this area. It tends to assume your data is fixed, and the question is how to interpret your modeling and results. But in the broader scientific context, that data isn't a fixed quantity ideally: it's collected by someone, and there's a broader question of "why this N, why this design", and so forth. So yes, ps are arbitrary, but they're not necessarily arbitrary relative to your study design, in the sense that if p < 0.05 is the standard the field has adopted, and you have a p = 0.053, the onus is on you to increase your N or choose a more powerful or more convincing design to demonstrate something at whatever threshold the field has settled on.
I'm not trying to argue for p-values per se necessarily, science is much more than p-values or even statistics, and think the broader problem lies with vocational incentives and things like that. But I do think at some level people will often, if not usually, want some categorical decision criterion to decide "this is a real effect not equal to null" and that decision criterion will always produce questionable behavior around it.
It's uncommon in science in general to be in a situation where the question of interest is to genuinely want to estimate a parameter with precision per se. There are cases of this, like in physics for example, but I think usually in other fields that's not the case. Many (most?) fields just don't have the precision of prediction of the physical sciences, to the point where differences of a parameter value from some nonzero theoretical one make a difference. Usually the hypothesis of a nonzero effect, or of some difference from an alternative; moreover, even when there is some interest in estimating a parameter value, there's often (like in physics) some implicit desire to test whether or not the value deviates significantly from a theoretical one, so you're back to a categorical decision threshold.
I think it isn't just p-hacking.
I've participated in a bunch of psychology studies (questionaires) for university and I've frequently had situations where my answer to some question didn't fit into the possible answer choices at all. So I'd sometimes just choose whatever seems the least wrong answer out of frustration.
It often felt like the study author's own beliefs and biases strongly influence how studies are designed and that might be the bigger issue. It made me feel pretty disillusioned with that field, I frankly find it weird they call it a science. Although that is of course just based on the few studies I've seen.
While studies should try to be as "objective" as possible, it isn't clear how this can be avoided. How can the design of a study not depend on the author's beliefs? After all, the study is usually designed to test some hypothesis (that the author has based on their prior knowledge) or measure some effect (that the author thinks exists).
If you can't do science, don't call it science.
This isn't obviously a bad thing, in the context of a belief that most results are misleading or wrong.
But surely let's have a "hard-line stance" on not drowning in BS?
We live in a money-dependent world. We cannot go without it.
Can we recognize the beliefs we have that bias our work and then take action to eliminate those biases? I think that is possible when we aren't studying humans, but beliefs we have about humans are on a much deeper level and psychology largely doesn't have the rigor to account for them.
It means more than that to some people, and it shouldn't.
I am sure there are plenty of people who misunderstand or misinterpret statistics. But in my experience these are mostly consumers. The people who produce "science" know damn well what they are doing.
This is not a scientific problem. This is a people problem.
People conduct science, and a lot of those people don't understand statistics that well. This quote from nearly 100 years ago still rings true in my experience:
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."
- Ronald Fisher (1938)
As a statistician, I could not disagree more. I would venture to say that most uses of statistics by scientists that I see are fallacious in some way. It doesn't always invalidate the results, but that doesn't change the fact that it is built on a fallacy nonetheless.
In general, most scientists actually have an extremely poor grasp of statistics. Most fields require little more than a single introductory course to statistics with calculus (the same one required for pre-med students), and the rest they learn in an ad-hoc manner - often incorrectly.