Header Image
Studies prove - often enough nothing at all

Studies prove - often enough nothing at all

Reports on scientific studies that produce (or refute) surprising results are read with interest. For readers (especially laypeople, but unfortunately also for some experts), it is barely recognizable whether the results of the study are correct. Correct means whether the research result can be repeated (replicated) by other studies and whether the conclusions are correct.

More than half of all the results of scientific studies are simply wrong.1

If you pay attention to certain criteria, you can better assess the significance and reliability of an examination.
Knowing some statistical basics helps to understand that a single study does not make a summer any more than a swallow and can at best be a reason to be checked and questioned by other studies by other authors.
Only when several authors have come to comparable results in different studies (replication of results) is there a certain degree of certainty that a finding is correct.

A clean scientific methodology would be

  • Register a planned examination
  • First define a thesis
  • For which data is collected in the second step
  • Which are evaluated in the third step
  • And then publish whether the thesis is confirmed or not.
  • Double-blind testing of active ingredients (neither subject nor assessor knows who is receiving active ingredient and who is receiving placebo)
    • In open-label studies, on the other hand, the subject and assessor know who will receive the active ingredient and when

This methodology can be undermined in many places. Here are some common sources of error.

1. (Re)formulation of theses after data collection

Sometimes the thesis is only formulated after the data has been collected and analyzed. This happens in particular when the original thesis has not proven to be true.

The criticism of this approach could be countered with the argument that it is pure chance which thesis a scientist is pregnant with before collecting the data. As a thought experiment, imagine a large number of research teams who all collect the same data and who - depending on the research team - start from different hypotheses generated by chance. Some find their thesis confirmed, others find it refuted. Does this change the truth if the team that had the correct thesis presents the result at the end?
The statisticians’ answer is: Yes, it changes something. Because a thesis is not an arbitrarily interchangeable view.
In our thought experiment, many theses would be considered disproved if all research teams had formulated their theses beforehand. The fact that one of the research teams had the correct thesis would then carry different weight.

Irrespective of this, this thought experiment shows the importance of replication studies.

Research results should only be considered reliable once they have been replicated several times. New and unexpected results may be more entertaining - but in terms of reality, they are about as helpful as the articles in some “newspapers”, which are read daily for their surprise and unexpectedness and not for their factual information content. Man bites dog attracts more attention than dog bites man. But which is closer to reality?
There is nothing wrong with earning money with it, whether as a journalist or a researcher. You should just make it clear what you are selling. Describing entertainment as reality reports is a deception that not all readers are able to see through.

Science would be well advised to separate entertainment and knowledge of reality more clearly.
Hiding results that have not yet been replicated would help to avoid many errors.

2. Errors in data collection

2.1. Sample size too small or too large (n)

2.1.1. Sample size too small (too little power)

The problem with many studies is that the sample size (n) is too small, which leads to insufficient power.
Studies with 10, 15 or 20 test subjects are common.

In his highly readable book, Nobel Prize-winning economist Daniel Kahneman2 points out that studies with samples that are too small cannot make any statement about the thesis under investigation.
If the sample size (n) is too small, the influence of chance is greater than that of the data situation. The result of a study with a sample size that is too small means that the result of the study no longer says anything about whether the hypothesis investigated is true or false - the result is nothing more than a random result.
The fact is that most scientists (including Kahneman himself for a while, as he noted) intuitively determine the required sample size (n) with a sample that is clearly too small.

Studies need high statistical significance AND high power to be meaningful. The current rather one-sided fixation on high statistical significance leads to non-reproducible and therefore probably false results. A large-scale study from 2015 was unable to replicate around 2/3 of the studies examined. This led to the so-called replication crisis.

On the other hand, a small sample (e.g. 20) is not always harmful, but can be quite useful. The prerequisite is that the groups are sufficiently matched and certain biases are controlled. However, the results found always require replication.

Samples that are too large also have disadvantages. They can make very small, insignificant differences appear significant. If the result is then only evaluated in terms of significance without evaluating the strength of the factors identified, this can be just as misleading as a sample size that is too small.

It is therefore important that the optimum sample size is determined (e.g. using G-POWER).

Guess for yourself:

How many rolls of the dice are required to be able to say with the 95% certainty (p = 0.05) aimed for in most scientific studies that there is even a single 6 in the rolls?

Solution

This requires a sample of n = 17 litters.

Another example:

How large must the sample size n be in order to predict the approval rating for a party in an election poll to within 1% (p = 0.01)?

Solution

This requires a sample size of n = 2,167 voters.3

A third example:

Coin tosses only know heads or tails. How many coin tosses are necessary to be able to say with a certainty of 95% (p = 0.05) that the distribution between heads and tails is at most 49:51 (which is much less precise than an exact 50:50 distribution)?

Solution

This requires a random sample of n = 9,604 coin tosses. And this despite the fact that there are only 2 possibilities: Coat of arms and tails.4

A study with 20 or 30 test subjects therefore only has very limited significance and must therefore be viewed with considerable caution.

Rule of thumb: A single study in the psychological or medical field with fewer than 50 test subjects (n = 50) should only be taken into account if further studies confirm the results.

Studies can very easily be manipulated by random results.5

For this topic, we would appreciate an expert addition from experienced statisticians who could explain which samples are required for typical studies in psychological / neuro(physio)logical questions (such as those on ADHD) in order to obtain an appropriately reliable statement.

2.1.2. Sample size too large

If the sample size is too large, there is a risk that statistically Significant results will be found whose impact/effect size/significance is negligible (low impact/relevance/effect size). If the significance found is then not put in relation to the relevance, readers who do not analyze the results in detail will be misled about the relevance, which ultimately leads to considerable misinformation.

Example of significance versus relevance

Two car manufacturers offer their vehicles in 3 colors: white, black and red. The cars are purchased

from manufacturer A is 41 % black, 40 % white and 19 % red,
from manufacturer B are 41% black, 31% white and 28% red.

Studies of both manufacturers will show that black is the customers’ favorite color. For manufacturer A, the sample must be larger until it is statistically significant (i.e. it is certain that the result is not a coincidence with at least 95% probability).

Nevertheless, the result that buyers significantly prefer black barely says anything for A due to its proximity to white’s value, but very much so for B due to its distance from white.

2.2. Data collection until the result fits

In a survey of 2,000 scientists, more than half admitted that they first check the significance of the results of their own scientific studies and then decide whether to collect further data. Data is then collected until a positive result emerges. 40% of the survey participants had produced and published selective studies in this way. Most of them also considered this to be correct.67
If, after 20 test data sets that showed non-significant results, a further 10 test data sets are collected, this leads to a 50 % increase in the false positive rate (7.7 % instead of 5 %), although these are reported with a statistical significance of p less than 0.05.8

To illustrate the flaw in this approach:
Time and time again, people believe that they can win at roulette simply by doubling their stake again and again when betting on red or black.
The fact that this does not work with roulette is proven to common sense by the fact that there are still casinos today. If this system worked, all casinos would have gone out of business long ago.
Mathematically speaking, this does not work because in roulette there is zero, the bank. This small probability of 1 in 37 is enough to make the probability of being able to play a series of doubles until your color falls lower than the probability of the bank winning.9
If there were no bank in roulette, doubling down on red and black would be a sure way to win (and all casinos would be out of business).

Since there is no bank in science, the continuous further collection of data until at some point a data set is accidentally constructed in such a way that it confirms the hypothesis is merely a question of diligence and perseverance and not a question of the correctness of the hypothesis put forward.

2.3. Sample selection

Ideally, test subjects would be selected on a truly random basis. This is often difficult to implement in reality.
If the sample is influenced by selection factors, this distorts the result.
Example: A survey among students in which the participants can decide whether they want to take part leads to an increased proportion of participants who are interested in the topic, which is often the case because the topic affects them. This leads to an increased proportion of people with ADHD among the participants.
Irrespective of this, the limitation of the subject field to students is already a restriction.

3. Errors in data evaluation

3.1. Data evaluation criteria selected to support the thesis

Further distortions of research results result from the subjective selection of data evaluation criteria. Silberzahl and Uhlmann10 had 29 groups of scientists examine an identical large data set (n = 2,000). As expected (according to the Gaussian distribution curve), the results of most groups were essentially consistent, while the results of individual groups differed considerably.
More important, however, was the realization of which factors led to these deviations in results: on the one hand, it was the choice of mathematical statistical models that were used (cluster analysis, logical regression or linear models) and, on the other hand, primarily the decisions made with regard to the evaluation technique of the data sets. In other words, decisions that a reader of a research result cannot even perceive. This is not about a deliberate distortion of the results by the scientists, but about massive influences on the results that do not originate from the factual question itself.

Silberzahn and Uhlmann10 conclude from this that a single study, even with a large sample size (n, see above), does not allow a reliable statement to be made as to whether the study result is correct.
Only the summary of several studies on the same topic with the same or different data sets (by different groups of authors with different mathematical calculation models) provides certainty regarding the accuracy of the results.

Conclusion: A study with a not too small number of test subjects by a renowned research group with the cleanest data transparency is a good indication. However, before trusting the result, one should wait and see whether the observation is confirmed by further studies (replication).

However, even a high number of studies does not protect against data bias in certain areas. Grawe11 vividly describes how studies on the treatment of depression are distorted by economic or other interests.

Pharmaceutical manufacturers financed 28 of the 29 studies on medication for depression analyzed by Grawe; all of the 48 studies on psychological treatment methods examined further were financed by public bodies. None of the pharmacological studies had a catamnesis (long-term success test), whereas 30 of the 48 studies on psychological treatment did.

Depression very often shows spontaneous remission within 10 weeks (the symptoms disappear without intervention).

The pharmacological studies primarily used the HAMD, MADRS and CGI to measure success, all of which are based on externally assessable symptoms. These are the symptoms that go away particularly well in spontaneous remission.
The control groups in the pharmacological tests showed an average Effect size of 1.82 after the MADRS, while the medication showed an Effect size of 1.88. This means that the symptom improvement due to the medication after the MADRS was only 0.06 better than the spontaneous symptom improvement in the control subjects who received no treatment.

The studies of the psychological treatment methods primarily used the BDI and self-related measures, which show a significantly weaker Effect size in spontaneous remission. The (non-treated) control groups of the psychological treatment methods showed an Effect size of 0.97. Cognitive therapy had a net Effect size of 1.33, 0.36 higher than the control group, cognitive behavioral therapy had a net Effect size of 1.54, 0.57, interpersonal therapy had a net Effect size of 0.50, present-oriented psychodynamic brief therapies had a net Effect size of 0.79 and couple therapies had a net Effect size of 0.96.

And yet, according to Grawe, only 13% to 14% of psychologically treated patients are permanently free of their depression. 25 % reject disorder-oriented treatment in principle, and a further 13 % to 25 % discontinue ongoing therapy. Of the remaining 64%, half achieve a clinically significant improvement. Of the 32% who were successfully treated in the short term, almost two thirds suffer a relapse within two years.12

Firstly, the pharmaceutical industry must be given credit for the fact that Grawe is more likely to be a representative of the psychotherapeutic line, and secondly, that antidepressants do have a helpful effect - albeit nowhere near as strong as the studies would like to convey.
Nevertheless, the illustration shows how the data analysis can be manipulated in the desired direction by selecting suitable measurement instruments. And this does not only apply to the pharmaceutical industry. The studies on psychological treatment methods have also used evaluation standards that are more favorable to them.

In our opinion, the drug treatment of depression is very different from that of ADHD, as ADHD is a lifelong disorder and the Effect size of ADHD medication can only be determined from the outset for the period in which it is taken.

3.2. Data analysis until the result fits (Torture your data until they confess, p-value hacking)

Another method that impairs the reliability of results is when, contrary to proper scientific methodology, the data collected is analyzed (using different methods) until it confirms the thesis in some aspect.

As a rule, the method of data analysis is not already determined with the definition of the thesis. This leeway is sometimes used inappropriately.
In the publication itself, the previously tried and rejected data analysis methods are usually not described.

By adding further parameters to the analysis, the probability that a hypothesis can also be explained by chance increases dramatically. One study showed that in an evaluation in which 3 additional parameters were added to the original question of how two parameters correlate with each other (e.g. gender), more than 60% of the hypotheses were incorrect instead of the 5% that the p-value of 0.05 would have suggested.8

3.3. Incorrect application of statistical methods

in 2016, it was found that the three most common evaluation programs for fMRI images produced false-positive results that were up to 13 times too high (up to 70 % instead of correct results of up to 5 %) due to incorrect use.13 This calls into question the results of around 40,000 examinations in which fRMT was used.
People with ADHD are primarily affected by more recent studies on emotions and thought processes in which data from several test subjects are combined.14 These errors do not occur if the statistical tools are used correctly. However, many scientists do not work carefully enough here.

A different analysis method for fMRI data, although much more computationally intensive, would avoid the possible errors.13

3.4. Measurement error

Another error in fMRI evaluations resulted from the fact that in many thousands of studies, conclusions about the activity of the amygdala were analyzed - while in reality the measurements said nothing about the amygdala, but about the blood flow in a nearby vein.1516

3.5. Excel error

Scientists report data falsification due to incorrect use of Excel. Up to 20 % of Excel files on genetic data are falsified by Excel errors.17

In summer 2020, it became known that a number of genes were being renamed because they frequently led to evaluation errors in Excel because they corresponded to dates in Excel. The error would not occur if the scientists consistently formatted the name fields of the genes as text (which would be very easy to do).
The fact that errors that are so easy to correct nevertheless occur so frequently that genes are renamed is a strong indication that Excel is often used incorrectly, even with regard to the simplest operations.

4. Publication of only favorable results

An unfortunately quite common method of manipulation is to conduct a large number of studies, of which only those are published that show results that are acceptable to the funder or the authors.

As studies can show a certain range of different results, there is a statistical scattering of results, which roughly corresponds to a Gaussian distribution curve, even if the facts are certain. Most results are close to the actual situation. The further the results deviate from this, the less frequently they will occur.
Described figuratively, this corresponds roughly to a pile of sand that is created by grains of sand falling from above at a precise point. At its highest point, the pile of sand indicates where the grains of sand fall onto it.
But even a little wind can distort the result. Gusty winds even more so. And so there are many factors that can influence a result.

The method of publishing only acceptable results requires high resources (money, time). Only market participants with correspondingly high (usually economic) interests can afford this method.

Registering a study before it is conducted helps to prevent such manipulations.

A similar case is that studies that do not find a statistically significant result are often not published. This leads to a distorted picture of the overall state of knowledge, as the studies that found a statistically significant result for the question in question remain unchallenged.

5. Interpretation errors by readers

Another source of error arises from the fact that the test results are misinterpreted by readers (including experts).

5.1. The false positive trap

A good test has a high sensitivity and a high specificity.

Sensitivity is the quality of the correct-positive prognosis: How many given test targets (infections, cancer cases, ADHD) are actually recognized?

Specificity is the quality of the correct-negative prognosis. How many non-existent test targets are recognized as non-existent?

If a test procedure has a sensitivity and a specificity of 95% each (e.g. standard scarlet fever rapid tests) and the base rate (the actual rate of people with ADHD or infected persons) is 0.5%, this means that out of 20,000 test persons, 95 are correctly identified out of 100 actually affected persons (assumed base rate of scarlet fever) - but 5 are not. At the same time, 995 people with ADHD who are not affected are incorrectly diagnosed as positive.18

The results are similar for breast cancer diagnoses, which is why mammography screenings are subject to considerable criticism, as the number of women who have their breasts removed unnecessarily due to a false positive result is many times higher than the number of women who have their breasts removed due to a true positive result.

Even many gynecologists who diagnose breast cancer succumb (or succumbed until intensive education on the subject in recent years) to misjudgments in their assessments for this reason.

5.2. The p-value misunderstanding

The p-value (from probability) indicates whether a measured result can also be explained by chance. However, the p-value says nothing about the actually interesting question “Is the hypothesis correct”.1920

It is therefore wrong that a low p-value of less than 5%, i.e. less than 0.05, says anything about the certainty with which the hypothesis would be correct. It merely indicates the probability with which the test result would be obtained if in reality not the test hypothesis but its opposite, the so-called null hypothesis, were true.21 However, this is not a statement about the correctness of the hypothesis.

The p-value says nothing about this,

  • How correct or reliable a scientific test result is
    or
  • How reliably a result can be repeated

In addition, in certain constellations, quite unambiguous study results are given a miserable p-value, which leads to absurd interpretations that go as far as the opposite of the study results.21

Many scientists are in favor of abolishing the p-value, others want to significantly increase the threshold above which a test result is considered significant (currently 0.05, i.e. 95%) (e.g. to 99% = 0.01 or 99.5% = 0.005).

From an observer’s point of view, it currently makes the most sense to ensure that a result has been confirmed not just by one, but by as many studies as possible, each with a high n-number (power) and a solid p-value (statistical significance). Even with an optimal p-value, a single study is not proof that the hypothesis was not confirmed by chance.

5.3. Correlation is not causality

Correlation means that there is a statistical relationship between data.
Positive correlation means that both data rise or fall with each other.
Negative correlation means that value 1 rises while value 2 falls, or vice versa.
Correlation does not mean that one value is causal for the other.

Causality means that one factor is a cause of another factor, i.e. that one factor has an effect on the other factor.

Examples:

  • That maternal smoking during pregnancy correlates with increased ADHD in the children does not yet say how much of the smoking is causal for the ADHD, since ADHD correlates with increased smoking. Since ADHD is strongly heritable, mothers with ADHD are more likely to smoke and more likely to have children with ADHD, whether they smoke or not. Nevertheless, smoking is likely to have a causal influence on the child’s ADHD - just probably not as strong as the correlation makes it seem.
    To separate the influence of genetics from the influence of smoking, one could now study mothers with ADHD who a. smoked or b. did not smoke during pregnancy, or mothers without ADHD who a. smoked or b. did not smoke during pregnancy.
  • ADHD in children correlates with low household socioeconomic status. This does not mean that low household income causally causes ADHD, because ADHD (of the parents) correlates with lower education, and lower education correlates with lower income. Low household income is therefore also likely to be a consequence of parental ADHD.
  • Only-child status correlates with increased ADHD. Nevertheless, ADHD probably does not follow from the fact that someone has no siblings, but probably from birth complications. Birth complications correlate with increased ADHD. A woman’s first birth correlates with more birth complications (some physicians even call first-time mothers high-risk patients). First-time mothers therefore have a higher risk of birth complications.

A nice collection of absurd correlations that contain no causality whatsoever can be found at Tyler Vigen: spurious correlations.

6. ADHD studies without ADHD diagnosis

In recent years, there has unfortunately been an increase in the bad habit of using the ASRS alone to “diagnose” “ADHD” in studies,
The ASRS is a screening instrument comprising only 6 questions that are used to determine ADHD. Even though the ASRS has surprisingly good significance and specificity values, it is not an adequate means of diagnosing ADHD. It would be malpractice to make a diagnosis or even prescribe medication solely on the basis of an ASRS result.
All the less should scientific “findings” be produced on this basis.
Nevertheless, we did not exclude studies conducted in this way from ADxS because they can provide additional insights when viewed in conjunction with other studies.

The fact that only 35% of all RCTs on ADHD diagnose the test subjects by a doctor or psychologist is also problematic.22

7. Rules for good science

to solve the problem of false-positive publications, Simmons et al. propose8:

  1. Requirements for authors
    1.1. authors must define the rule for completing data collection before starting data collection and state this rule in the article
    1.2. authors must collect at least 20 observations per cell or provide a convincing justification for the cost of data collection
    1.3. authors must list all variables collected in a study
    1.4. authors must report all experimental conditions, including failed manipulations
    1.5. if observations are eliminated, the authors must also indicate what the statistical results would be if these observations were included
    1.6. if an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate
  2. Guidelines for experts
    2.1. reviewers should ensure that authors comply with the requirements
    2.2. reviewers should be more tolerant of imperfections in the results
    2.3. reviewers should require authors to demonstrate that their results do not depend on arbitrary analytical decisions
    2.4. if the justifications for data collection or analysis are not convincing, reviewers should require authors to perform an exact replication

  1. Ioannidis (2005): Why Most Published Research Findings are False. In: PLoS Medicine 2, e124, 2005

  2. Kahneman, Daniel, Schnelles Denken, langsames Denken, 2011, Siedler, Seite 139, 142 ff

  3. https://de.wikipedia.org/wiki/Zufallsstichprobe#Stichprobenumfang

  4. http://www.math.uni-sb.de/ag/wittstock/lehre/SS03/wth/ewth_12.loesung.pdf

  5. http://www.spiegel.de/wissenschaft/mensch/umfragen-in-deutschland-wir-wuerfeln-uns-eine-studie-a-1052493.html

  6. John, Loewenstein, Prelec, (2012): Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling. In: Psychological Science 23, S. 524–532, 2012

  7. Yong (2013): SZIENTOMETRIE: Jede Menge Murks. Viele wissenschaftliche Studien lassen sich nicht reproduzieren. Das wirft Fragen zum Forschungsbetrieb auf – und zur Veröffentlichungspraxis von Fachzeitschriften. Spektrum

  8. Simmons JP, Nelson LD, Simonsohn U (2011): False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011 Nov;22(11):1359-66. doi: 10.1177/0956797611417632. PMID: 22006061.

  9. Genauer dazu: http://www.casinozocker.com/roulette/wahrscheinlichkeiten-und-mathematik/

  10. Silberzahn, Uhlmann: Crowdsourced research: Many hands make tight work, Nature, 2015, Vol 526, 189 – 191

  11. Grawe (2004): Neuropsychotherapie, S. 216 - 230

  12. Grawe (2004): Neuropsychotherapie, Seiten 216-230

  13. Eklund, Nichols, Knutsson (2016): Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, n = 499

  14. Charisius: Trugbilder im Hirnscan, Süddeutsche Zeitung, 06.07.2016, Seite 16

  15. Schönberger (2015): Neuroforschung: Ein Fehler stellt Tausende Gehirnstudien infrage; Profil

  16. Boubela, Kalcher, Huf, Seidel, Derntl, Pezawas, Našel, Moser (2015): fMRI measurements of amygdala activation are confounded by stimulus correlated signal fluctuation in nearby veins draining distant brain regions; Scientific Reports 5, Article number: 10499 (2015) doi:10.1038/srep10499

  17. Förster (2016): Verfälschte Tabellen: Excel bereitet Genforschern Probleme

  18. Christensen, Christensen (2016): Tücken der Statistik: Denken Sie immer falsch positiv! Spiegel Online

  19. Nuzzo (2014): UMSTRITTENE STATISTIK – Wenn Forscher durch den Signifikanztest fallen. Grobe Fehler in Statistik: Der “p-Wert” gilt als Goldstandard, doch er führt in die Irre. Er schadet damit seit Jahren der Wissenschaft. Spektrum.

  20. Honey (2016): Eine signifikante Geschichte; Spektrum

  21. Amrhein (2017): Das magische P, Süddeutsche Zeitung, Wissenschaft, Printausgabe 23./24.09.17, Seite 37

  22. Studart I, Henriksen MG, Nordgaard J (2025): Diagnosing ADHD in adults in randomized controlled studies: a scoping review. Eur Psychiatry. 2025 Apr 14;68(1):e64. doi: 10.1192/j.eurpsy.2025.2447. PMID: 40226998; PMCID: PMC12188335.

Diese Seite wurde am 18.11.2025 zuletzt aktualisiert.