Why scientific studies are often inconclusive

Reports on scientific studies that produce (or refute) surprising results are read with interest. For readers (especially laypeople, but unfortunately also for some experts), it is barely recognizable whether the results of the study are correct. Correct means whether the research result can be repeated (replicated) by other studies and whether the conclusions are correct.

More than half of all the results of scientific studies are simply wrong.¹

If you pay attention to certain criteria, you can better assess the significance and reliability of an examination.
Knowing some statistical basics helps to understand that a single study does not make a summer any more than a swallow and can at best be a reason to be checked and questioned by other studies by other authors.
Only when several authors have come to comparable results in different studies (replication of results) is there a certain degree of certainty that a finding is correct.

A clean scientific methodology would be

Register a planned examination
First define a thesis
For which data is collected in the second step
Which are evaluated in the third step
To then publish whether the thesis is confirmed or not.
Double-blind testing of active ingredients (neither subject nor assessor know who is receiving active ingredient and who is receiving placebo)
- In open-label studies, on the other hand, the subject and the evaluator know who will receive the active ingredient and when

This methodology can be undermined in many places. Here are some common sources of error.

1. (Re)formulation of theses after data collection
2. Errors in data collection
- 2.1. Sample size too small or too large (n)
  - 2.1.1. Sample size too small
  - 2.1.2. Sample size too large
- 2.2. Data collection until the result fits
3. Errors in data evaluation
4. Publication of only favorable results
5. Interpretation errors by readers
- 5.1. The false positive trap
- 5.2. The p-value misunderstanding

1. (Re)formulation of theses after data collection

Sometimes the thesis is only formulated after the data has been collected and analyzed. This is particularly the case when the original thesis has not proven to be true.

The criticism of this approach could be countered with the argument that it is pure chance which thesis a scientist is pregnant with before collecting the data. As a thought experiment, imagine a large number of research teams who all collect the same data and who - depending on the research team - start from different hypotheses generated by chance. Some find their thesis confirmed, others find it refuted. Does this change the truth if the team that had the correct thesis presents the result at the end?
The statisticians’ answer is: yes, it changes something. Because a thesis is not an arbitrarily interchangeable view.
In our thought experiment, many theses would be considered disproved if all research teams had formulated their theses beforehand. The fact that one of the research teams had the correct thesis would then carry different weight.

Irrespective of this, this thought experiment shows the importance of replication studies.

Research results should only be considered reliable once they have been replicated several times. New and unexpected results may be more entertaining - but in terms of reality, they are about as helpful as the articles in some “newspapers”, which are read daily for their surprise and unexpectedness and not for their factual information content. Man bites dog attracts more attention than dog bites man. But which is closer to reality?
There is nothing wrong with earning money with it, whether as a journalist or a researcher. You should just make it clear what you are selling. Describing entertainment as reality reports is a deception that not all readers are able to see through.

Science would be well advised to separate entertainment and knowledge of reality more clearly.
Hiding results that have not yet been replicated would help to avoid many errors.

2. Errors in data collection

2.1. Sample size too small or too large (n)

2.1.1. Sample size too small

The problem with many studies is that the sample size (n) is too small.
Studies with 10, 15 or 20 test subjects are common.

The Nobel laureate in economics Daniel Kahneman² points out that studies with samples that are too small cannot make any statement about the thesis under investigation.
If the sample size (n) is too small, the influence of chance is greater than that of the data situation. The result of a study with a sample size that is too small means that the result of the study no longer says anything about whether the hypothesis investigated is true or false - the result is nothing more than a random result.

The fact is that most scientists (including Kahneman himself for a while, as he noted), when intuitively determining the required sample size (n), set a sample that is clearly too small.

On the other hand, a small sample (e.g. 20) is not always harmful, but can be quite useful. The prerequisite is that the groups are sufficiently matched and certain biases are controlled. However, the results found always require replication.

Samples that are too large also have disadvantages. They can make very small, insignificant differences appear significant. If the result is then only evaluated in terms of significance without evaluating the strength of the factors identified, this can be just as misleading as a sample size that is too small.

It is therefore important that the optimum sample size is determined (e.g. using G-POWER).

Estimate for yourself:

How many dice rolls are required to say with 95% certainty that there is only one 6 on the dice (6 possibilities)?

Solution

This requires a sample of n = 17 litters.

Another example:

How large must the sample size n be in order to predict the approval rating for a party in an election poll to within 1%?

Solution

This requires a sample size of n = 2167 voters.³

A third example:

Coin tosses only know heads or tails. How many coin tosses are necessary to be able to say with a certainty of 95% (which is the goal of most scientific studies) that the distribution between heads and tails in tosses is at most 49:51 (which is much less precise than an exact 50:50 distribution)?

Solution

This requires a random sample of n = 9604 coin tosses. And this despite the fact that there are only 2 possibilities: Coat of arms and number.⁴

A study with 20 or 30 test subjects therefore only has very limited significance and must therefore be viewed with considerable caution.

Rule of thumb: A single study in the psychological or medical field with fewer than 50 test subjects (n = 50) should only be taken into account if further studies confirm the results.

Studies can very easily be manipulated by random results.⁵

For this topic, we would appreciate an expert addition from experienced statisticians who could explain which samples are required for typical studies in psychological / neuro(physio)logical questions (such as those on ADHD) in order to obtain an appropriately reliable statement.

2.1.2. Sample size too large

If the sample size is too large, there is a risk that (statistically) significant results will be found whose impact / Effect size / significance is negligible. If the significance found is then not put in relation to the relevance, readers who do not analyze the results in detail will be misled about the relevance, which ultimately leads to considerable misinformation.

Example of significance versus relevance

Two car manufacturers offer their vehicles in 3 colors: white, black and red. The cars are purchased

from manufacturer A is 41 % black, 40 % white and 19 % red,
from manufacturer B are 41% black, 31% white and 28% red.

Studies of both manufacturers will show that black is the customers’ favorite color. For manufacturer A, the sample must be larger until it is statistically significant (i.e. it is certain that the result is not a coincidence with at least 95% probability).

Nevertheless, the result that buyers significantly prefer black barely says anything for A due to its proximity to white’s value, but very much so for B due to its distance from white.

2.2. Data collection until the result fits

In a survey of 2,000 scientists, more than half admitted that they first check the significance of the results of their own scientific studies and then decide whether to collect further data. Data is then collected until a positive result emerges. 40% of the survey participants had produced and published selective studies in this way. Most of them also considered this to be correct.⁶⁷

To illustrate the flaw in this approach:
Time and time again, people believe that they can win at roulette simply by doubling their stake again and again when betting on red or black.
The fact that this does not work with roulette is proven to common sense by the fact that there are still casinos today. If this system worked, all casinos would have gone out of business long ago.
Mathematically speaking, this does not work because in roulette there is zero, the bank. This small probability of 1 in 37 is enough to make the probability of being able to play a series of doubles until your color falls lower than the probability of the bank winning.⁸
If there were no bank in roulette, doubling down on red and black would be a sure way to win (and all casinos would be out of business).

Since there is no bank in science, the further collection of data, until at some point a data set is accidentally constructed in such a way that it confirms the hypothesis, is merely a question of diligence and perseverance and not a question of the correctness of the hypothesis put forward.

3. Errors in data evaluation

3.1. Data evaluation criteria selected to support the thesis

Further distortions of research results result from the subjective selection of data evaluation criteria. Silberzahl and Uhlmann⁹ had 29 groups of scientists examine an identical large data set (n > 2000). As expected (according to the Gaussian distribution curve), the results of most groups were essentially consistent and the results of individual groups differed considerably.
More important, however, was the realization of which factors led to these deviations in results: on the one hand, it was the choice of mathematical statistical models that were used (cluster analysis, logical regression or linear models) and, on the other hand, primarily the decisions made with regard to the evaluation technique of the data sets. In other words, decisions that a reader of a research result cannot even perceive. It is not a question of a deliberate distortion of the results by the scientists, but of massive influences on the results that do not originate from the factual question itself.

Silberzahn and Uhlmann⁹ conclude from this that a single study, even with a large sample size (n, see above), does not allow a reliable statement to be made as to whether the study result is correct.

Only the summary of several studies on the same topic with the same or different data sets provides certainty regarding the accuracy of the results.

Conclusion: A study with a not too small number of test subjects by a renowned research group with the cleanest data transparency is a good indication. However, before trusting the result, one should wait and see whether the observation is confirmed by further studies (replication).

However, even a high number of studies does not protect against data bias in certain areas. Grawe¹⁰ vividly describes how studies on the treatment of depression are distorted by economic or other interests.

Pharmaceutical manufacturers financed 28 of the 29 studies on medication for depression analyzed by Grawe; all of the 48 studies on psychological treatment methods examined further were financed by public bodies. None of the pharmacological studies had a catamnesis (long-term success test), whereas 30 of the 48 studies on psychological treatment did.

Depression very often shows spontaneous remission within 10 weeks (the symptoms disappear without intervention).

The pharmacological studies primarily used the HAMD, MADRS and CGI to measure success, all of which are based on externally assessable symptoms. These are the symptoms that go away particularly well in spontaneous remission.
The control groups in the pharmacological tests showed an average Effect size of 1.82 after the MADRS, while the medication showed an Effect size of 1.88. This means that the symptom improvement due to the medication after the MADRS was only 0.06 better than the spontaneous symptom improvement in the control subjects who received no treatment.

The studies of the psychological treatment methods primarily used the BDI and self-related measures, which show a significantly weaker Effect size in spontaneous remission. The (non-treated) control groups of the psychological treatment methods showed an Effect size of 0.97. Cognitive therapy had a net Effect size of 1.33, 0.36 higher than the control group, cognitive behavioral therapy had a net Effect size of 1.54, 0.57, interpersonal therapy had a net Effect size of 0.50, present-oriented psychodynamic brief therapies had a net Effect size of 0.79 and couple therapies had a net Effect size of 0.96.

And yet, according to Grawe, only 13% to 14% of psychologically treated patients are permanently free of their depression. 25 % reject disorder-oriented treatment in principle, and a further 13 % to 25 % discontinue ongoing therapy. Of the remaining 64%, half achieve a clinically significant improvement. Of the 32% who were successfully treated in the short term, almost two thirds suffer a relapse within 2 years.¹⁰

Firstly, the pharmaceutical industry must be given credit for the fact that Grawe is more likely to be a representative of the psychotherapeutic line and, secondly, that antidepressants do have a helpful effect - albeit nowhere near as strong as the studies would like to convey.
Nevertheless, the illustration shows how the data analysis can be manipulated in the desired direction by selecting suitable measurement instruments. And this does not only apply to the pharmaceutical industry. The studies on psychological treatment methods have also used evaluation standards that are more favorable to them.

In our opinion, the treatment of depression with medication is very different from that of ADHD, as ADHD is a lifelong disorder and the Effect size of ADHD medication can only be determined for the period of time it is taken.

3.2. Data analysis until the result fits (Torture your data untill they confess)

Another method that impairs the reliability of results is when, contrary to proper scientific methodology, the data collected is analyzed (using different methods) until it confirms the thesis in some aspect.

As a rule, the method of data analysis is not already determined with the definition of the thesis. This leeway is sometimes used inappropriately.
In the publication itself, the previously tried and rejected data analysis methods are usually not described.

3.3. Incorrect application of statistical methods

in 2016, it was found that the three most common evaluation programs for fMRI images produced false-positive results that were up to 13 times too high (up to 70 % instead of correct results of up to 5 %) due to incorrect use.¹¹ This calls into question the results of around 40,000 examinations in which fRMT was used.
People with ADHD are primarily affected by more recent studies on emotions and thought processes in which data from several test subjects are combined.¹² These errors do not occur if the statistical tools are used correctly. However, many scientists do not work carefully enough here.

A different analysis method for fMRI data, although much more computationally intensive, would avoid the possible errors.¹¹

3.4. Measurement error

Another error in fMRI evaluations resulted from the fact that in many thousands of studies, conclusions about the activity of the amygdala were analyzed - while in reality the measurements said nothing about the amygdala, but about the blood flow in a nearby vein.¹³¹⁴

3.5. Excel error

Scientists report data falsification due to incorrect use of Excel. Up to 20 % of Excel files on genetic data are falsified by Excel errors.¹⁵

In summer 2020, it became known that a number of genes were being renamed because they frequently led to evaluation errors in Excel because they corresponded to dates in Excel. The error would not occur if the scientists consistently formatted the name fields of the genes as text (which would be very easy to do).
The fact that errors that are so easy to correct nevertheless occur so frequently that genes are renamed is a strong indication that Excel is often used incorrectly, even with regard to the simplest operations.

4. Publication of only favorable results

An unfortunately quite common method of manipulation is to conduct a large number of studies, of which only those are published that show results that are acceptable to the funder or the authors.

As studies can show a certain range of different results, there is a statistical scattering of results, which roughly corresponds to a Gaussian distribution curve, even if the facts are certain. Most results are close to the actual situation. The further the results deviate from this, the less frequently they will occur.
Described figuratively, this corresponds roughly to a pile of sand created by grains of sand falling from above at a precise point. At its highest point, the pile of sand indicates where the grains of sand fall onto it.
But even a little wind can distort the result. Gusty winds even more so. And so there are many factors that can influence a result.

The method of publishing only acceptable results requires high resources (money, time). Only market participants with correspondingly high (usually economic) interests can afford this method.

Registering a study before it is conducted helps to prevent such manipulations.

5. Interpretation errors by readers

Another source of error arises from the fact that the test results are misinterpreted by readers (including experts).

5.1. The false positive trap

A good test has a high sensitivity and a high specificity.

Sensitivity is the quality of the correct-positive prognosis: how many given test targets (infections, cancer cases, ADHD) are actually recognized?

Specificity is the quality of the correct-negative prognosis. How many non-existent test targets are recognized as non-existent?

If a test procedure has a sensitivity and a specificity of 95% each (e.g. standard scarlet fever rapid tests) and the base rate (the actual rate of people with ADHD or infected persons) is 0.5%, this means that out of 20,000 test persons, 95 of 100 actually affected persons (assumed base rate of scarlet fever) are correctly identified - but 5 are not. At the same time, 995 people with ADHD who are not affected are incorrectly diagnosed as positive.¹⁶

The results are similar for breast cancer diagnoses, which is why mammography screenings are subject to considerable criticism, as the number of women who have their breasts removed unnecessarily due to a false positive result is many times higher than the number of women who have their breasts removed due to a true positive result.

Even many gynecologists who diagnose breast cancer succumb (or succumbed until intensive education on the subject in recent years) to misjudgments in their assessments for this reason.

5.2. The p-value misunderstanding

The p-value (from probability) indicates whether a measured result can also be explained by chance. However, the p-value says nothing about the actually interesting question “Is the hypothesis correct”.¹⁷¹⁸

It is therefore wrong that a low p-value of less than 5%, i.e. less than 0.05, says anything about the certainty with which the hypothesis would be correct. It merely indicates the probability with which the test result would be obtained if in reality not the test hypothesis but its opposite, the so-called null hypothesis, were true.¹⁹ However, this is not a statement about the correctness of the hypothesis.

The p-value says nothing about this,

How correct or reliable a scientific test result is
or
How reliably a result can be repeated

In addition, in certain constellations, quite unambiguous study results are given a miserable p-value, which leads to absurd interpretations that go as far as the opposite of the study results.¹⁹

Many scientists are in favor of abolishing the p-value, while others want to significantly increase the threshold above which a test result is considered significant (currently 0.05, i.e. 95%).

It is probably most sensible to ensure that a result has been confirmed not just by one, but by as many studies as possible, each with a high n-number and a solid p-value. Even with an optimal p-value, a single study is not proof of the hypothesis under investigation.

Studies prove - often enough nothing at all

1. (Re)formulation of theses after data collection¶

2. Errors in data collection¶

2.1. Sample size too small or too large (n)¶

2.1.1. Sample size too small¶

2.1.2. Sample size too large¶

2.2. Data collection until the result fits¶

3. Errors in data evaluation¶

3.1. Data evaluation criteria selected to support the thesis¶

3.2. Data analysis until the result fits (Torture your data untill they confess)¶

3.3. Incorrect application of statistical methods¶

3.4. Measurement error¶

3.5. Excel error¶

4. Publication of only favorable results¶

5. Interpretation errors by readers¶

5.1. The false positive trap¶

5.2. The p-value misunderstanding¶

1. (Re)formulation of theses after data collection

2. Errors in data collection

2.1. Sample size too small or too large (n)

2.1.1. Sample size too small

2.1.2. Sample size too large

2.2. Data collection until the result fits

3. Errors in data evaluation

3.1. Data evaluation criteria selected to support the thesis

3.2. Data analysis until the result fits (Torture your data untill they confess)

3.3. Incorrect application of statistical methods

3.4. Measurement error

3.5. Excel error

4. Publication of only favorable results

5. Interpretation errors by readers

5.1. The false positive trap

5.2. The p-value misunderstanding