top of page


By Hindemburg Melon Jr


On 3/20/2019, Nature magazine published an article with the title: “Scientists stand up against statistical significance” and subtitled “Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to the allegations sensationalism and the rejection of possible crucial effects”.

I started reading the article and was shocked by the number of scientists who are apparently not correctly interpreting the results of statistical tests.  

I also found it surprising that the referees and editors of one of the most reputable scientific journals have approved this publication, in which the authors imply the suggestion that subjective and pseudoscientific evaluations are superior to objective and rigorous methods, when, in fact, the what happens is a misuse of statistical tools, with inadequate interpretations of the results obtained. Trading scientific methods for subjective opinions would be a throwback to the Middle Ages.  

I believe that the authors of the article, as well as the signatories, are well-meaning people who are sincerely concerned about the inconsistency they are seeing between the results they are getting in their statistical tests and what their intuition suggests to them, but they are not realizing that the The problem is not in the statistical tools they use, but in the way they are using them and especially in the way they are interpreting the results.  

Very briefly, these “scientists” are not considering the asymmetry between the probabilities in the test result and/or the uncertainties in the values.  

Suppose a study comparing the effectiveness of a placebo with that of an active ingredient, in which the results show that there is no difference at a significance level of 0.05. This means that there is less than a 95% probability that the active ingredient is more effective than the placebo. This result may or may not be inconclusive, depending on other factors.  

If the probability is less than 95%, it is possible that the test has shown that the active ingredient has a 94% probability of being more efficient than the placebo, but since the criterion is 95%, then the result did not meet the criterion. that had been previously established, but this obviously does not mean that the placebo and the active ingredient are equally efficient. On the contrary, the results may have indicated that the probability that the active ingredient is more efficient is 94% versus 6% for placebo.  

To deal with these situations, researchers need to decide whether to increase the sample, so that the test becomes more sensitive, or to relax the criterion, using a higher degree of significance, such as 0.1, for example.  

If the purpose of the study is to choose, between two procedures, which is the most suitable, a significance level of 0.5 should be used as a cut-off. But there is a problem with this, which is the reliability that the results obtained are statistically valid. The higher the degree of significance, the greater the sensitivity of the test to detect small differences in very small samples, on the other hand, the greater the uncertainty in the result. 
Let's look at two examples:

Case 1:

A double-blind trial uses data collected on 12 patients, 6 of whom received placebo and another 6 received a treatment that is wanted to investigate whether it produces better results than placebo. At the end of the study, it is found that 5 patients who received the drug recovered within 30 days, while 3 who received the placebo recovered within the same 30-day interval.  

Leaving aside subjective aspects about what it means to “have recovered” and about the particularities of the patients, such as age, general level of health, eating habits, lifestyle, etc., and assuming that the 6 patients in a group are approximately equivalent compared to the 6 patients in the other group, yet random factors have a very large weight in such small groups, and the predominance of 5x3 could not be regarded as sufficient evidence that the treatment is better than placebo. If the same study were repeated with other groups of 6 people each, with those 12 people being equivalent to the 12 in the first study, there would be a very high risk that the result would be completely different, something like 1x5 instead of 5x3.  

In this context, a significance level of 0.05 would correctly show that the 5x3 result is inconclusive. If a significance level of 0.5 was used, the interpretation of the result would be that the treatment was better than the placebo, but it is evident that this result could not be regarded as satisfactory evidence, because the samples are very small and the results can be heavily distorted by random fluctuations. Therefore, it would be necessary to increase the sample to at least a few dozen people and verify if the treatment is still superior to the placebo.  

Case 2:

Now let's look at a different example: a study compares math scores of afternoon students with those of morning students. The study includes 700 afternoon and 750 morning students and, as a result, the grade point averages were 6.43 in the morning and 6.22 in the afternoon. At a significance level of 0.05, there was also no statistically significant difference, that is, the probability that the morning students' grades are higher than those of the afternoon students is less than 95%. However, intuitively, it is clear that, unlike Case 1, the morning results seem to be consistently better and these samples with hundreds of students seem to be sufficient for the observed difference to be significant. 
Let's say that in cases 1 and 2 the significance level was the same and equal to 0.12, that is, the probability that the treatment is better than the placebo is 88% and the probability that the morning students are more skilled in Mathematics is also 88%. So what is the difference between cases 1 and 2?

The answer is quite simple: with larger samples, the uncertainty in the significance level is smaller. In the first case, let's say the significance level is 0.12 ± 0.45, while in the second case it is 0.120 ± 0.038. In both cases it is 0.12, but since the samples are about 120 times larger in the second case, the uncertainty should be about 11 times smaller (root of 120). In the first case, the uncertainty is even greater than the measured variable, which makes the result of the study almost useless.  

Before proceeding, a small addition is necessary: as statistical significance is a variable that can only assume values between 0 and 1, it would be necessary to place this variable on an appropriate scale, before calculating the uncertainty, to avoid absurd results such as probability greater than 0 that the significance is greater than 1 or negative. But since this topic is covered in other articles (see the article on Hans Rosling, the lecture on the Sharpe ratio, etc.) and 0.120 ± 0.038, as this does not interfere with the argumentation for this specific problem.  

Therefore, when the study is based on sufficiently large samples, in addition to the test being more sensitive to detect differences, the uncertainty in this sensitivity is also smaller, which allows the researcher to use higher levels of significance without compromising the quality of the test. study.  

In the second case, instead of using 0.05 significance level, we could use 0.15 or 0.20, and we could say that the results suggest that there is a 98% probability that there is more than an 84% probability that the morning students get better grades than afternoon students. 
In the first case, there is a 98% probability that there is more than a 43% probability that the treatment is better than the placebo.

This is because the uncertainty in the first case is 0.45, and (1-0.12)-0.45=0.43. If the uncertainty represents a confidence interval of 2 standard deviations, then there is a 98% probability that the correct significance level value is less than 0.57. In the second case the uncertainty is 0.038, so (1-0.12)-0.038=0.842, so there is a 98% probability that the correct significance level value is less than 0.158.  

Thus, the difference between the two situations and between the interpretations in the two results is quite clear, in one case it is necessary to increase the samples, while in the other case it is possible to choose between increasing the samples or increasing the degree of significance. 
If the samples are too small, there is no way to make reasonably safe inferences and there is no alternative but to increase the samples. If the samples are sufficiently numerous for the uncertainty in the level of significance to be acceptable, then one can relax the rigor of the cut-off criterion, without compromising the quality of the result.
The problem is not inherent to the statistical tool used. If it is used correctly, and the results are interpreted as they should, if they are described and presented as they should, then the problem cited by the scientists in the Nature article disappears.

It is also interesting to briefly discuss the difference between “significance degree” and “significance level”. Some authors treat both terms as if they had the same meaning. Others treat the term “significance level” as the probability that a measure falls within a certain range. It can also serve to determine whether it is above a certain value, or below a certain value, or outside a certain range.  

The “degree of significance” is distinguished from the “level of significance” for being defined a priori and used as a criterion to decide on a given hypothesis. For example: You want to know if there is a gender difference in essay grades at universities in a city. A significance level of 0.05 is adopted as a criterion. When checking the grades of college and university students, it appears that university students have a higher grade at a significance level of 0.036, therefore, as 0.036 is less than 0.05 (degree of significance), it is concluded that there is a statistically significant difference to degree 0.05. 
If the degree of significance chosen a priori was 0.01, the conclusion would be that there is no statistically significant difference at degree 0.01.
It is important to note this flexibility in choosing the degree of significance before carrying out the experiment. Usually 0.05 is used, but this is completely flexible.

Therefore, it is much more appropriate to use the significance level, which is the value obtained a posteriori 0.036, and interpret the result considering the uncertainties in the values found. In addition to this procedure being more informative, it is more realistic. Because it does not put the result of the study as “yes or no”. No study can be viewed as providing a “yes” or “no” result. What the studies show as a result are the odds that it's yes and that it's no.  

One can never be sure that a given treatment will produce better results than a placebo, or that an eclipse will occur at exactly a certain date and time, or that the average height of a group of men will be greater than the average height of a group of men. women. The best that can be known are the probabilities that each outcome will occur, or the probability that the variation in outcomes will fall within a certain range of values. There is a 99.73% probability that the solar eclipse will occur between 12h:02m:28s,09212 and 12h:02m:28s,09226. Or, there is a 78% probability that a group of 10 men has an average height greater than the average height of a group of 10 women of the same age group and from the same population. If the groups had 100 men and 100 women, the probability would change completely. This is the proper way to represent the results of the studies, rather than saying that a drug was superior to a placebo, it is correct to say that the results of the study suggested that there is a certain probability that the drug is superior to the placebo. Also, it is important to report the probability that this probability is correct. For example:  

One study shows that there is an 80%±40% probability that people who drink more than 6 glasses of water a day are more likely to develop cancer. This is very different from a study showing that there is an 80%±2% probability that people who smoke more than 6 cigarettes a day are more likely to develop cancer. In both cases the probability is 80%, but in the first case the uncertainty in this probability is 40%, while in the second case the probability is only 2%, so it is very likely that in the second case the “correct” probability is fine. close to 80%, while in the first case there is a very high risk that the correct probability is much less than 80%, even less than 50%, which would indicate that the opposite is more likely, that is, people who drink more than 6 glasses of water a day is not more likely to develop cancer. Note that this is different from saying that people who drink more than 6 glasses of water a day are more likely to not develop cancer.  

Another very important point to consider is that the level of statistical significance does not say anything about the magnitude of the difference between the variables to be compared. Statistical significance only tells us the probabilities that one variable is greater or less than the other. 
For example: the 1 Real coins produced in 1995 had a mass of 4.2744 g while the 1 Real coins produced in 1996 had 4.2706 g. When comparing the masses of 30,000 coins from 1995 with the masses of 20,000 coins from 1996, it can be seen that the difference between them is statistically significant at the 0.05 level. The difference is very small, less than 0.004g, but it is statistically significant because the sample is very large and the dispersion in the masses is small, which gives high sensitivity to the test.

Another example: João and Pedro are salespeople in the same store and in the same department. At the end of the first year, João sold, on average, R$10,000 per day, while Pedro sold R$8,000 per day. When surveying all sales of each one over the year, it is found that the difference observed between them is not statistically significant at a degree of 0.05.  

Because the tiny difference in mass between the coins, less than 0.004g, which represents less than 0.1% of the total mass of each coin, was a statistically significant difference at the 0.05 level, while the difference of BRL 2,000 between salespeople, which represents 20% to 25% of the daily average for each one, was there not a statistically significant difference at the 0.05 level?  

This is because the dispersion in the daily performances of sellers (or at least one of the sellers) was large compared to the difference between them, and the data sample size was not large enough for the standard deviation in the mean to be small in compared to the observed difference. This does not mean that one should conclude that both were equally competent. It just means that there is less than a 95% probability that John was actually more competent.  

In the case of coins, the variations in the masses of those produced in 1995, compared to each other, are very small, and the same occurred among those produced in 1996, in addition, the number of coins considered was very large, which made the The uncertainty in the measured masses for each year is very small, so minute differences are detectable and relevant, indicating that probably (more than 95% probability) there was some change in the production process, composition, climate, or something that caused the 1996 coins were actually lighter. 
Finally, the significance level is a very useful piece of information, when interpreted correctly. And when the results are interpreted correctly, the problem alleged by the authors of the article published in Nature does not exist. There are even other parameters that can be derived directly from the significance level, which can better serve certain purposes. For example:

Instead of the significance level, which determines the probability that the observed difference is greater than 0, tests could be performed to calculate the probability that the observed difference is greater than a certain value. This could be more useful in many situations, as well as being easier to interpret, leading to less confusion among researchers.  

In either case, it would be important that researchers with doubts about the use of certain statistical tools, before condemning the use of these tools, seek to know better how they should be used, how to interpret the results and how to make valid and useful inferences.  

bottom of page