Studies prove - often enough nothing at all
Reports on scientific investigations that produce surprising results (or refute them) are readily noted with interest. For readers (especially for laymen, unfortunately also for some experts) it is hardly recognizable whether the investigation result is also correct. Correct means whether the result of the investigation can be repeated (replicated) by other studies and whether the conclusions are correct.
If you pay attention to certain criteria, you can better assess the significance and resilience of an examination.
However, this only helps to a limited extent. More than half of all results of scientific studies are simply wrong.1
Knowledge of some basic statistical measures helps to assess the significance of a study. At the very least, however, it helps to get a realistic picture of the fact that a single study does not make a summer any more than a swallow and should, at most, be an occasion for being checked and questioned by other studies by other authors.
Only when several authors have arrived at comparable results in different studies (replication of results) is there a reasonable certainty that a finding is correct.
A clean scientific methodology would be
- Register a planned examination
- First define a thesis
- For which data are collected in the second step
- Which are evaluated in the third step
- To then publish whether the thesis is confirmed or not.
This methodology can be subverted in many places.
- 1. Thesis (re)formulation after data collection
- 2. Data collection error
- 3. Data evaluation error
- 4. Publication of only favorable results
- 5. Interpretation errors by readers
1. Thesis (re)formulation after data collection
Sometimes the thesis is formulated after the data has been collected and analyzed. This happens especially when the original thesis has not proven true.
The criticism of this procedure could be countered by saying that it is pure chance which thesis a scientist is pregnant with before he collects the data. As a thought experiment, imagine a large number of research teams all doing the same data collection and - depending on the research team - starting from randomly generated different theses. Some find their thesis confirmed, others find it disproved. Does this change anything in terms of truth, if the team that had the correct thesis ends up presenting the result?
The answer of the statisticians is: yes, it changes something. Because a thesis is not an arbitrarily interchangeable view.
In our thought experiment, many theses would be considered refuted if all research teams had formulated their theses beforehand. The fact that the one research team that had the correct thesis is among them then has a different weight.
Regardless, this thought experiment demonstrates the importance of replication studies.
Research results should not be considered robust until they have been replicated many times. New and unexpected results may be more entertaining - but in terms of reality, they are about as helpful as the articles in some “newspapers” that are read daily for their surprise and unexpectedness rather than for their factual information content. Man bites dog attracts more attention than dog bites man. But which is closer to reality?
Nothing against earning money with it, be it as a journalist or as a researcher. One should only clarify what one is selling. Calling entertainment reality reports is a deception that not all readers are capable of seeing through.
Science would be well advised to separate entertainment and realism more cleanly.
Hiding results that have not yet been replicated would help avoid many errors.
2. Data collection error
2.1. Sample size too small or too large (n)
2.1.1. Sample size too small
The problem with many studies is a sample size (n) that is too small.
Studies with 10, 15, or 20 subjects are common.
Nobel laureate in economics Daniel Kahneman2 points out that studies with too small a sample cannot make any statement about the thesis under investigation.
If the sample size (n) is too small, the influence of chance is greater than that of the data situation. The result of a study with a sample size that is too small means that the result of the study no longer says anything about whether the hypothesis under investigation is true or false - the result is no more than a random result.
The fact is that most scientists (including Kahneman himself for a time, as he noted), when intuitively determining the required sample size (n), set a sample that is clearly too small.
On the other hand, a small sample (e.g., 20) is not always harmful, but can be quite useful. The prerequisite is that the groups are sufficiently matched and certain biases are controlled. However, the resulates found always require replication.
Overly large samples also have disadvantages. They can make very small, intrinsically meaningless differences appear significant. If the result is then evaluated only in terms of significance, without assessing the strength of the factors found, this can be just as misleading as a sample size that is too small.
It is therefore important to determine the optimal sample size (e.g. using G-POWER).
Estimate for yourself:
Given a die (6 possibilities), how many die rolls are required to say with 95% confidence that there is even a single 6 in the rolls?
This requires a sample of n = 17 litters.
How large must the sample size n be in order to predict the approval rating for a party to within 1% in an election poll?
This requires a sample size of n = 2167 voters.3
A third example:
Coin tosses know only heads or tails. How many coin tosses are necessary to be able to say with a certainty of 95% (which is the aim of most scientific studies) that the distribution between heads and tails is at most 49:51 (which is much less precise than an exact 50:50 distribution) ?
This requires a sample of n = 9604 coin tosses. And this, although there are only 2 possibilities: Coat of arms and number.4
A study with 20 or 30 subjects therefore has only very limited significance and must therefore be viewed with considerable caution.
Rule of thumb: A single study in the psychological or medical field with fewer than 50 subjects (n = 50) should not be noted until further studies confirm the results.
Studies can be very easily manipulated by random results.5
For this topic, we would appreciate expert input from accomplished statisticians who could explain what samples are required in typical investigations in psychological / neuro(physio)logical questions (such as those on ADHD) to obtain a reasonably reliable statement.
2.1.2. Sample size too large
A sample size that is too large involves the risk that (statistically) significant results are found whose effect / effect strength / significance is, however, vanishingly small. If the found significance is then not put in relation to the relevance, a misconception about an existing relevance is caused among readers who do not analyze the results in detail, which in the end leads to a considerable misinformation.
Example of significance versus relevance
Two car manufacturers offer their cars in 3 colors: white, black, red. The cars are bought
from manufacturer A 41% black, 40% white, 19% red,
of manufacturer B 41% black, 31% white, 28% red.
Research at both manufacturers will find that black is the customers’ favorite color. For manufacturer A, the sample must be larger until it is statistically significant (that it is certain that the result is not a coincidence with at least 95% probability).
Nevertheless, the result that buyers significantly prefer black hardly says anything for A, but very much so for B.
2.2. Data collection until the result fits
In a survey of 2,000 scientists, more than half admitted to first testing the significance of results in their own scientific investigations and then deciding whether to collect further data. Data collection then continues until a positive result emerges. 40% of the survey participants had produced and published selective studies in this way. Moreover, most considered this to be correct.67
To clarify the thinking error in this approach:
Again and again, people believe that they can win at roulette simply by doubling their bets over and over again when betting on red or black.
The fact that this does not work for roulette is already proven to common sense by the fact that casinos still exist today. If this system would work, all casinos would have gone bankrupt long ago.
Mathematically, this does not work because there is zero in roulette, the bank. This small probability of 1 in 37 is enough to make the probability of being able to play a series of doubles until your suit falls less than the probability of the bank winning.8
If there were no bank in roulette, doubling down on red and black would be a sure way to win (and all casinos would be broke).
Since there is no bench in science, the continued collection of data, until at some point a data set happens to be constructed in such a way as to confirm the hypothesis, is simply a matter of diligence and perseverance and not a matter of the correctness of the hypothesis that has been established.
3. Data evaluation error
3.1. Thesis supporting selected data evaluation criteria
Further biases in study results follow from subjective selection of data evaluation criteria. Silberzahl and Uhlmann9 had 29 groups of scientists examine an identical large data set (n > 2000). As expected (according to the Gaussian distribution curve), the results of most groups essentially agreed and results of individual groups deviated considerably.
More important, however, was the realization of which factors led to these deviations in results: on the one hand, it was the choice of the mathematical statistical models that were used (cluster analysis, logical regression or linear models) and, on the other hand, primarily the choices made with regard to the evaluation technique of the data sets. Decisions, in other words, that a reader of a study result cannot even perceive. It is not about a deliberate distortion of the results by the scientists, but about massive influences on the results, which do not originate from the factual question itself.
Silberzahn and Uhlmann9 conclude that a single investigation, even with a high sample size (n, see above), does not allow a reliable statement on whether the investigation result is correct.
Only the summary of several studies on the same subject with the same or different data sets provides certainty regarding the accuracy of the results.
Conclusion: A study with a not too small number of subjects by a renowned research group with the cleanest data transparency is a good indication. However, before trusting the result, one should wait and see whether the observation is confirmed by further investigations (replication).
However, even a high number of studies does not protect against data bias in certain areas. Grawe10 describes very vividly how studies on the treatment of depression are distorted by economic or other interests.
Pharmaceutical manufacturers financed 28 of the 29 studies on medication for depression analyzed by Grawe, and all of the 48 studies on psychological treatment methods examined further were financed by public agencies. None of the pharmacological studies had a catamnesis (long-term outcome assessment), whereas 30 of the 48 studies on psychological treatment did.
Depression very often shows spontaneous remission (symptoms disappear without intervention) within 10 weeks.
The pharmacological studies primarily used the HAMD, MADRS, and CGI to measure success, all of which focus on externally assessable symptoms. These are those symptoms that go away particularly well in spontaneous remission.
The control groups of the pharmacological tests showed an average effect size of 1.82 after the MADRS, while the medications showed an effect size of 1.88. This means that the symptom improvement by the drugs after the MADRS was only better by an effect size of 0.06 than the spontaneous symptom improvement in the control subjects who received no treatment.
The studies of the psychological treatment methods primarily used the BDI and self-related measures, which show a significantly weaker effect size for spontaneous remission. Here, the (non-treated) control groups of psychological treatment methods showed an effect size of 0.97. Cognitive therapy had a net effect size of 1.33, 0.36 higher than the control group, cognitive behavioral therapy had a net effect size of 1.54, 0.57, interpersonal therapy had a net effect size of 0.50, present-oriented psychodynamic brief therapies had a net effect size of 0.79, and couples therapies had a net effect size of 0.96.
And yet, according to Grawe, even among patients receiving psychological treatment, only 13 to 14 % are permanently freed from their depression. 25% fundamentally reject disorder-oriented treatment, and another 13% to 25% discontinue ongoing therapy. Of the remaining 64%, half achieve clinically significant improvement. Of the 32% who were successfully treated in the short term, nearly two-thirds relapse within 2 years.10
Now, the pharmaceutical industry has to be credited firstly with the fact that Grawe is more likely to be a representative of the psychotherapeutic line and secondly that antidepressants do have a helpful effect - even if not nearly as strong as the studies on this would like to convey.
Nevertheless, the illustration shows how the data evaluation can be manipulated very much in the desired direction by the selection of suitable measuring instruments. And this does not only concern the pharmaceutical industry either. The studies on psychological treatment methods have also used the evaluation standards that are more favorable to them.
In our view, drug treatment for depression is very different from that for ADHD because ADHD is a lifelong disorder and the effect size of ADHD drugs can be determined a priori only for the time period in which they are taken.
3.2. Data analysis until the result fits (Torture your data untill they confess)
Another method that affects the robustness of results is when, contrary to clean scientific methodology, the collected data are analyzed (with different methods) until they confirm the thesis under some aspect.
As a rule, the method of data evaluation is not already determined with the definition of the thesis. This leeway is sometimes inappropriately exploited.
The publication itself regularly does not describe the data analysis methods that were previously tried and rejected.
3.3. Incorrect application of statistical methods
in 2016, it was found that the three most common evaluation programs for fMRI scans produced false-positive results that were inflated by up to 13 times (up to 70% instead of correctly up to 5%) due to incorrect use.11 This calls into question the results of about 40 000 examinations in which fRMT was used.
Affected are mainly recent studies on emotions and thought processes, in which data of several subjects are combined.12 If the statistical tools are used correctly, these errors do not occur. However, many scientists do not work carefully enough here.
A different analysis method for fMRI data, although much more computationally intensive, would avoid the potential errors.11
3.4. Measurement error
Another error in fMRI evaluations resulted from the fact that many thousands of studies analyzed inferences about amygdala activity - when in fact the measurements said nothing about the amygdala but about blood flow in a nearby vein.1314
3.5. Excel error
Scientists report data falsification due to incorrect use of Excel. Up to 20 % of Excel files on genetic data are falsified by Excel errors.15
In the summer of 2020, it became known that quite a few genes would be renamed because they often led to evaluation errors in Excel because they corresponded to dates in Excel. The error would not occur if the scientists would consistently format the name fields of the genes as text (which would be very simple to accomplish).
The fact that such easy-to-fix errors nevertheless occur so frequently that renaming of genes takes place is a strong indication that Excel is frequently used incorrectly even with respect to the simplest handling.
4. Publication of only favorable results
Unfortunately, a fairly common method of manipulation is to conduct a large number of studies, publishing only those that show results that are agreeable to the funder or authors.
Since studies can show a certain range of different results, even if the facts are certain, there is a statistical scatter of results that roughly corresponds to a Gaussian distribution curve. Most results are close to the actual facts. The further the results deviate from this, the less frequently they will occur.
Figuratively described, this roughly corresponds to a pile of sand created by grains of sand falling from above at a precise point. The pile of sand indicates at its highest point where the grains of sand fall down on it.
But even a little wind can falsify the result. Gusty wind even more so. And so there are many factors that can influence a result.
The method of publishing only favorable results requires high resources (money, time). Only market participants with correspondingly high (usually economic) interests can afford this method.
Registering a study before it is conducted helps prevent such manipulation.
5. Interpretation errors by readers
Another source of error arises from the fact that the study results are misinterpreted by (even expert) readers.
5.1. The false positive trap
A good test has high sensitivity and high specificity.
Sensitivity is quality of correct-positive prognosis: how many given test targets (infections, cancer cases, ADHD) are actually detected?
Specificity is the quality of the correct-negative prognosis. How many non-existent test targets are detected as non-existent?
If a test procedure has a sensitivity and a specificity of 95 percent each (e.g., common rapid scarlet fever tests) and the base rate (the actual rate of affected or infected persons) is 0.5 %, this means: out of 20000 test persons, 95 are correctly identified out of 100 actually affected persons (assumed base rate of scarlet fever) - but 5 are not. At the same time, 995 unaffected persons are falsely diagnosed as positive.16
Similar results are found for breast cancer diagnoses, which is why mammography screenings are under considerable criticism, as the number of women who have their breasts removed unnecessarily due to a false positive result is many times higher than the number of women who do so due to a true positive result.
Even many gynecologists who diagnose breast cancer succumb (or succumbed until intensive education on this in recent years) to misconceptions in the evaluation for this reason.
5.2. The p-value misunderstanding
The p-value (from probability) indicates whether a measured result can also be explained by chance. However, the p-value says nothing about the actually interesting question “Is the hypothesis correct?1718
It is therefore wrong that a low p-value of less than 5%, i.e. less than 0.05, says anything about the certainty with which the hypothesis would be correct. It merely indicates the probability with which the test result would be obtained if in reality not the test hypothesis but its opposite, the so-called null hypothesis, were true.19 However, this is not a statement about the correctness of the hypothesis.
The p-value says nothing about this,
- How correct or reliable a scientific test result is
- How reliably a result can be repeated
In addition, in certain constellations, study results that are in themselves quite clear receive a miserable p-value, which leads to absurd interpretations that extend to the opposite of the study results.19
Many scientists advocate the abolition of the p-value, while others want to significantly increase the threshold above which a study result is considered significant (currently 0.05, i.e. 95%).
It is probably most useful to ensure that a result has been confirmed not only by one, but by as many studies as possible, each with a high n-number and a solid p-value. Even with an optimal p-value, a single study is no proof for the investigated thesis.
Kahneman, Daniel, Schnelles Denken, langsames Denken, 2011, Siedler, Seite 139, 142 ff, ein äußerst lesenswertes Buch ↥
Yong (2013): SZIENTOMETRIE: Jede Menge Murks. Viele wissenschaftliche Studien lassen sich nicht reproduzieren. Das wirft Fragen zum Forschungsbetrieb auf – und zur Veröffentlichungspraxis von Fachzeitschriften. Spektrum ↥
Boubela, Kalcher, Huf, Seidel, Derntl, Pezawas, Našel, Moser (2015): fMRI measurements of amygdala activation are confounded by stimulus correlated signal fluctuation in nearby veins draining distant brain regions; Scientific Reports 5, Article number: 10499 (2015) doi:10.1038/srep10499 ↥
Nuzzo (2014): UMSTRITTENE STATISTIK – Wenn Forscher durch den Signifikanztest fallen. Grobe Fehler in Statistik: Der “p-Wert” gilt als Goldstandard, doch er führt in die Irre. Er schadet damit seit Jahren der Wissenschaft. Spektrum. ↥