Statistical significance of differences between samples. General population and sample study. Statistical significance

The significance level in statistics is important indicator, reflecting the degree of confidence in the accuracy and truth of the received (predicted) data. The concept is widely used in various fields: from conducting sociological research, before statistical testing of scientific hypotheses.

Definition

Level statistical significance(or statistically significant result) shows what is the probability of a random occurrence of the studied indicators. The overall statistical significance of a phenomenon is expressed by the p-value coefficient (p-level). In any experiment or observation, there is a possibility that the data obtained were due to sampling errors. This is especially true for sociology.

That is, a statistically significant value is a value whose probability of random occurrence is extremely small or tends to the extreme. The extreme in this context is the degree to which statistics deviate from the null hypothesis (a hypothesis that is tested for consistency with the obtained sample data). In scientific practice, the significance level is selected before data collection and, as a rule, its coefficient is 0.05 (5%). For systems where precise values are extremely important, this figure may be 0.01 (1%) or less.

Background

The concept of significance level was introduced by the British statistician and geneticist Ronald Fisher in 1925, when he was developing a technique for testing statistical hypotheses. When analyzing any process, there is a certain probability of certain phenomena. Difficulties arise when working with small (or not obvious) percentages of probabilities that fall under the concept of “measurement error.”

When working with statistical data that is not specific enough to test them, scientists are faced with the problem of the null hypothesis, which “prevents” operating with small quantities. Fisher proposed for such systems to determine the probability of events at 5% (0.05) as a convenient sampling cut, allowing one to reject the null hypothesis in calculations.

Introduction of fixed odds

In 1933 Jerzy scientists Neyman and Egon Pearson in their works recommended setting a certain level of significance in advance (before data collection). Examples of the use of these rules are clearly visible during elections. Let's say there are two candidates, one of whom is very popular and the other is little known. It is obvious that the first candidate will win the election, and the chances of the second tend to zero. They strive - but are not equal: there is always the possibility of force majeure, sensational information, unexpected decisions that can change the predicted election results.

Neyman and Pearson agreed that Fisher's significance level of 0.05 (denoted by α) was most appropriate. However, Fischer himself in 1956 opposed fixing this value. He believed that the level of α should be set according to specific circumstances. For example, in particle physics it is 0.01.

p-level value

The term p-value was first used by Brownlee in 1960. The P-level (p-value) is an indicator that is inversely related to the truth of the results. The highest p-value coefficient corresponds to the lowest level of confidence in the sampled relationship between variables.

This value reflects the likelihood of errors associated with the interpretation of the results. Let's assume p-level = 0.05 (1/20). It shows a five percent probability that the relationship between variables found in the sample is just a random feature of the sample. That is, if this dependence is absent, then with repeated similar experiments, on average, in every twentieth study one can expect the same or greater dependence between the variables. The p-level is often seen as a "margin" for the error rate.

By the way, p-value may not reflect the real relationship between variables, but only shows a certain average value within the assumptions. In particular, the final analysis of the data will also depend on the selected values of this coefficient. At p-level = 0.05 there will be some results, and at a coefficient equal to 0.01 there will be different results.

Testing statistical hypotheses

The level of statistical significance is especially important when testing hypotheses. For example, when calculating a two-sided test, the rejection region is divided equally at both ends of the sampling distribution (relative to the zero coordinate) and the truth of the resulting data is calculated.

Suppose, when monitoring a certain process (phenomenon), it turned out that new statistical information indicates small changes relative to previous values. At the same time, the discrepancies in the results are small, not obvious, but important for the study. The specialist is faced with a dilemma: are changes really occurring or are these sampling errors (measurement inaccuracy)?

In this case, they use or reject the null hypothesis (attribute everything to an error, or recognize the change in the system as a fait accompli). The problem solving process is based on the ratio of overall statistical significance (p-value) and significance level (α). If p-level< α, значит, нулевую гипотезу отвергают. Чем меньше р-value, тем более значимой является тестовая статистика.

Values used

The level of significance depends on the material being analyzed. In practice, the following fixed values are used:

α = 0.1 (or 10%);
α = 0.05 (or 5%);
α = 0.01 (or 1%);
α = 0.001 (or 0.1%).

The more accurate the calculations are required, the lower the α coefficient is used. Naturally, statistical forecasts in physics, chemistry, pharmaceuticals, and genetics require greater accuracy than in political science and sociology.

Significance thresholds in specific areas

In high-precision fields such as particle physics and manufacturing, statistical significance is often expressed as the ratio of the standard deviation (denoted by the sigma coefficient - σ) relative to a normal probability distribution (Gaussian distribution). σ is a statistical indicator that determines the dispersion of the values of a certain quantity relative to mathematical expectations. Used to plot the probability of events.

Depending on the field of knowledge, the coefficient σ varies greatly. For example, when predicting the existence of the Higgs boson, the parameter σ is equal to five (σ = 5), which corresponds to p-value = 1/3.5 million. In genome studies, the significance level can be 5 × 10 -8, which is not uncommon for this areas.

Efficiency

It must be taken into account that coefficients α and p-value are not exact characteristics. Whatever the level of significance in the statistics of the phenomenon under study, it is not an unconditional basis for accepting the hypothesis. For example, than less valueα, the greater the chance that the hypothesis being established is significant. However, there is a risk of error, which reduces the statistical power (significance) of the study.

Researchers who focus solely on statistically significant results may reach erroneous conclusions. At the same time, it is difficult to double-check their work, since they apply assumptions (which in fact are the α and p-values). Therefore, it is always recommended, along with calculating statistical significance, to determine another indicator - the magnitude of the statistical effect. Effect size is a quantitative measure of the strength of an effect.

The main features of any relationship between variables.

We can note the two simplest properties of the relationship between variables: (a) the magnitude of the relationship and (b) the reliability of the relationship.

- Magnitude . Dependency magnitude is easier to understand and measure than reliability. For example, if any man in the sample had a white blood cell count (WCC) value higher than any woman, then you can say that the relationship between the two variables (Gender and WCC) is very high. In other words, you could predict the values of one variable from the values of another.

- Reliability (“truth”). The reliability of interdependence is a less intuitive concept than the magnitude of dependence, but it is extremely important. The reliability of the relationship is directly related to the representativeness of a certain sample on the basis of which conclusions are drawn. In other words, reliability refers to how likely it is that a relationship will be rediscovered (in other words, confirmed) using data from another sample drawn from the same population.

It should be remembered that the ultimate goal is almost never to study this particular sample of values; a sample is of interest only insofar as it provides information about the entire population. If the study satisfies certain specific criteria, then the reliability of the found relationships between sample variables can be quantified and presented using a standard statistical measure.

The magnitude of the dependence and reliability represent two various characteristics dependencies between variables. However, it cannot be said that they are completely independent. The greater the magnitude of the relationship (connection) between variables in a sample of normal size, the more reliable it is (see the next section).

The statistical significance of a result (p-level) is an estimated measure of confidence in its “truth” (in the sense of “representativeness of the sample”). More technically speaking, the p-level is a measure that varies in decreasing order of magnitude with the reliability of the result. More high p-level corresponds more low level confidence in the relationship between variables found in the sample. Namely, the p-level represents the probability of error associated with the distribution of the observed result to the entire population.

For example, p-level = 0.05(i.e. 1/20) indicates that there is a 5% chance that the relationship between variables found in the sample is just a random feature of the sample. In many studies, a p-level of 0.05 is considered an "acceptable margin" for the level of error.

There is no way to avoid arbitrariness in deciding what level of significance should truly be considered "significant". The choice of a certain significance level above which results are rejected as false is quite arbitrary.

On practice final decision usually depends on whether the result was predicted a priori (i.e., before the experiment was conducted) or discovered a posteriori as a result of many analyzes and comparisons performed on a variety of data, as well as on the tradition of the field of study.

Generally, in many fields, a result of p .05 is an acceptable cutoff for statistical significance, but keep in mind that this level still includes a fairly large margin of error (5%).

Results significant at the p .01 level are generally considered statistically significant, while results at the p .005 or p .00 level are generally considered statistically significant. 001 as highly significant. However, it should be understood that this classification of significance levels is quite arbitrary and is just an informal agreement adopted on the basis of practical experience in a particular field of study.

It is clear that what larger number analyzes will be carried out on the totality of the collected data, the greater the number of significant (at the selected level) results will be discovered purely by chance.

Some statistical methods that involve many comparisons, and thus have a significant chance of repeating these types of errors, make a special adjustment or correction for total number comparisons. However, many statistical methods (especially simple methods exploratory data analysis) do not offer any way to solve this problem.

If the relationship between variables is “objectively” weak, then there is no other way to test such a relationship other than to study a large sample. Even if the sample is perfectly representative, the effect will not be statistically significant if the sample is small. Likewise, if a relationship is “objectively” very strong, then it can be detected with a high degree of significance even in a very small sample.

The weaker the relationship between variables, the larger the sample size required to detect it meaningfully.

Many different measures of relationship between variables. The choice of a particular measure in a particular study depends on the number of variables, the measurement scales used, the nature of the relationships, etc.

Most of these measures, however, are subject to general principle: They attempt to estimate the observed dependence by comparing it with the "maximum conceivable dependence" between the variables under consideration. Technically speaking, the usual way to make such estimates is to look at how the values of the variables vary and then calculate how much of the total variation present can be explained by the presence of "common" ("joint") variation in two (or more) variables.

Significance depends mainly on the sample size. As already explained, in very large samples even very weak relationships between variables will be significant, while in small samples even very strong relationships are not reliable.

Thus, in order to determine the level of statistical significance, a function is needed that represents the relationship between the “magnitude” and “significance” of the relationship between variables for each sample size.

Such a function would indicate exactly “how likely it is to obtain a dependence of a given value (or more) in a sample of a given size, assuming that there is no such dependence in the population.” In other words, this function would give a significance level
(p-level), and, therefore, the probability of erroneously rejecting the assumption of the absence of this dependence in the population.

This "alternative" hypothesis (that there is no relationship in the population) is usually called null hypothesis.

It would be ideal if the function that calculates the probability of error were linear and only had different slopes for different sample sizes. Unfortunately, this function is much more complex and is not always exactly the same. However, in most cases its form is known and can be used to determine significance levels in studies of samples of a given size. Most of these functions are associated with a class of distributions called normal .

Task 3. Five preschoolers are given a test. The time taken to solve each task is recorded. Will statistically significant differences be found between the time to solve the first three test items?

No. of subjects

Reference material

This assignment is based on the theory of analysis of variance. In general, the task of analysis of variance is to identify those factors that have a significant impact on the result of the experiment. Analysis of variance can be used to compare the means of several samples if there are more than two samples. One-way analysis of variance is used for this purpose.

In order to solve the assigned tasks, the following is accepted. If the variances of the obtained values of the optimization parameter in the case of influence of factors differ from the variances of the results in the absence of influence of factors, then such a factor is considered significant.

As can be seen from the formulation of the problem, methods for testing statistical hypotheses are used here, namely, the task of testing two empirical variances. Therefore, analysis of variance is based on testing variances using Fisher's test. In this task, it is necessary to check whether the differences between the time of solving the first three test tasks by each of the six preschoolers are statistically significant.

The null (main) hypothesis is called the put forward hypothesis H o. The essence of e comes down to the assumption that the difference between the compared parameters is zero (hence the name of the hypothesis - zero) and that the observed differences are random.

A competing (alternative) hypothesis is called H1, which contradicts the null hypothesis.

Solution:

Using the analysis of variance method at a significance level of α = 0.05, we will test the null hypothesis (H o) about the existence of statistically significant differences between the time of solving the first three test tasks for six preschoolers.

Let's look at the table of task conditions, in which we will find the average time to solve each of the three test tasks

No. of subjects	Factor levels
No. of subjects	Time to solve the first test task (in seconds).	Time to solve the second test task (in seconds).	Time to solve the third test task (in seconds).






Group average

Finding the overall average:

In order to take into account the significance of time differences in each test, the total sample variance is divided into two parts, the first of which is called factorial, and the second - residual

Let's calculate the total sum of squared deviations from the overall average using the formula

or , where p is the number of time measurements for solving test tasks, q is the number of test takers. To do this, let's create a table of squares

No. of subjects	Factor levels
No. of subjects	Time to solve the first test task (in seconds).	Time to solve the second test task (in seconds).	Time to solve the third test task (in seconds).

What do you think makes your “other half” special and meaningful? Is it related to her/his personality or to your feelings that you have for this person? Or maybe with simple fact that the hypothesis about the randomness of your sympathy, as studies show, has a probability of less than 5%? If we consider the last statement to be reliable, then successful dating sites would not exist in principle:

When you conduct split testing or any other analysis of your website, misunderstanding “statistical significance” can lead to misinterpretation of the results and, therefore, incorrect actions in the conversion optimization process. This is true for the thousands of other statistical tests performed every day in every existing industry.

To understand what “statistical significance” is, you need to dive into the history of the term, learn its true meaning, and understand how this “new” old understanding will help you correctly interpret the results of your research.

A little history

Although humanity has been using statistics to solve various problems for many centuries, the modern understanding of statistical significance, hypothesis testing, randomization and even Design of Experiments (DOE) began to take shape only at the beginning of the 20th century and is inextricably linked with the name of Sir Ronald Fisher (Sir Ronald Fisher, 1890-1962):

Ronald Fisher was an evolutionary biologist and statistician who had a special passion for the study of evolution and natural selection in the animal and plant kingdoms. During his illustrious career, he developed and popularized many useful statistical tools that we still use today.

Fisher used the techniques he developed to explain processes in biology such as dominance, mutations and genetic deviations. We can use the same tools today to optimize and improve the content of web resources. The fact that these analysis tools can be used to work with objects that did not even exist at the time of their creation seems quite surprising. Just as surprising as before complex calculations people performed without calculators or computers.

To describe the results of a statistical experiment as having a high probability of being true, Fisher used the word “significance.”

Also, one of Fisher’s most interesting developments can be called the “sexy son” hypothesis. According to this theory, women prefer sexually promiscuous men (promiscuous) because this will allow the sons born of these men to have the same predisposition and produce more offspring (note that this is just a theory).

But no one, even brilliant scientists, is immune from making mistakes. Fisher's flaws still plague specialists to this day. But remember the words of Albert Einstein: “Whoever has never made a mistake has never created anything new.”

Before moving on to the next point, remember: statistical significance is when the difference in test results is so large that the difference cannot be explained by random factors.

What is your hypothesis?

To understand what “statistical significance” means, you first need to understand what “hypothesis testing” is, since the two terms are closely intertwined.
A hypothesis is just a theory. Once you have developed a theory, you will need to establish a process for collecting enough evidence and actually collecting that evidence. There are two types of hypotheses.

Apples or oranges - which is better?

Null hypothesis

As a rule, this is where many people experience difficulties. One thing to keep in mind is that a null hypothesis is not something that needs to be proven, like you prove that a certain change on a website will lead to an increase in conversions, but vice versa. The null hypothesis is a theory that states that if you make any changes to the site, nothing will happen. And the goal of the researcher is to refute this theory, not to prove it.

If we look at the experience of solving crimes, where investigators also form hypotheses as to who the criminal is, the null hypothesis takes the form of the so-called presumption of innocence, the concept according to which the accused is presumed innocent until proven guilty in a court of law.

If the null hypothesis is that two objects are equal in their properties, and you are trying to prove that one is better (for example, A is better than B), you need to reject the null hypothesis in favor of the alternative. For example, you are comparing one or another conversion optimization tool. In the null hypothesis, they both have the same effect (or no effect) on the target. In the alternative, the effect of one of them is better.

Your alternative hypothesis may contain a numerical value, such as B - A > 20%. In this case, the null hypothesis and the alternative can take the following form:

Another name for an alternative hypothesis is a research hypothesis because the researcher is always interested in proving this particular hypothesis.

Statistical significance and p value

Let's return again to Ronald Fisher and his concept of statistical significance.

Now that you have a null hypothesis and an alternative, how can you prove one and disprove the other?

Since statistics, by their very nature, involve the study of a specific population (sample), you can never be 100% sure of the results obtained. A good example: election results often differ from the results of preliminary polls and even exit pools.

Dr. Fisher wanted to create a dividing line that would let you know whether your experiment was a success or not. This is how the reliability index appeared. Credibility is the level we take to say what we consider “significant” and what we don’t. If "p", the significance index, is 0.05 or less, then the results are reliable.

Don't worry, it's actually not as confusing as it seems.

Gaussian probability distribution. Along the edges are the less probable values of the variable, in the center are the most probable. The P-score (green shaded area) is the probability of the observed outcome occurring by chance.

The normal probability distribution (Gaussian distribution) is a representation of all possible values a certain variable on the graph (in the figure above) and their frequencies. If you do your research correctly and then plot all your answers on a graph, you will get exactly this distribution. According to the normal distribution, you will receive a large percentage of similar answers, and the remaining options will be located at the edges of the graph (the so-called “tails”). This distribution of values is often found in nature, which is why it is called “normal”.

Using an equation based on your sample and test results, you can calculate what is called a “test statistic,” which will indicate how much your results deviate. It will also tell you how close you are to the null hypothesis being true.

To help you get your head around it, use online calculators to calculate statistical significance:

One example of such calculators

The letter "p" represents the probability that the null hypothesis is true. If the number is small, it will indicate a difference between the test groups, whereas the null hypothesis would be that they are the same. Graphically, it will look like your test statistic will be closer to one of the tails of your bell-shaped distribution.

Dr. Fisher decided to set the significance threshold at p ≤ 0.05. However, this statement is controversial, since it leads to two difficulties:

1. First, the fact that you have proven the null hypothesis false does not mean that you have proven the alternative hypothesis. All this significance just means that you can't prove either A or B.

2. Secondly, if the p-score is 0.049, it will mean that the probability of the null hypothesis will be 4.9%. This may mean that your test results may be both true and false at the same time.

You can use the p-score or you can abandon it, but then you will need every special case Calculate the probability of the null hypothesis being true and decide whether it is large enough to prevent you from making the changes you planned and tested.

The most common scenario for conducting a statistical test today is to set a significance threshold of p ≤ 0.05 before running the test itself. Just be sure to look closely at the p-value when checking your results.

Errors 1 and 2

So much time has passed that errors that can occur when using the statistical significance metric have even been given their own names.

Type 1 Errors

As mentioned above, a p-value of 0.05 means there is a 5% chance that the null hypothesis is true. If you don't, you'll be making mistake number 1. The results say your new website increased your conversion rates, but there's a 5% chance that it didn't.

Type 2 Errors

This error is the opposite of error 1: you accept the null hypothesis when it is false. For example, test results tell you that the changes made to the site did not bring any improvements, while there were changes. As a result, you miss the opportunity to improve your performance.

This error is common in tests with an insufficient sample size, so remember: the larger the sample, the more reliable the result.

Conclusion

Perhaps no term is as popular among researchers as statistical significance. When test results are not found to be statistically significant, the consequences range from an increase in conversion rates to the collapse of a company.

And since marketers use this term when optimizing their resources, you need to know what it really means. Test conditions may vary, but sample size and success criteria are always important. Remember this.

Statistical significance or p-level of significance is the main result of the test

statistical hypothesis. Speaking technical language, is the probability of receiving a given

the result of a sample study, provided that in fact for the general

In the aggregate, the null statistical hypothesis is true - that is, there is no connection. In other words, this

the probability that the detected relationship is random and not a property

totality. It is statistical significance, the p-level of significance, that is

quantitative assessment communication reliability: the lower this probability, the more reliable the connection.

Suppose, when comparing two sample means, a level value was obtained

statistical significance p=0.05. This means that testing the statistical hypothesis about

equality of means in the population showed that if it is true, then the probability

The random occurrence of detected differences is no more than 5%. In other words, if

two samples were repeatedly drawn from the same population, then in 1 of

20 cases would reveal the same or greater difference between the means of these samples.

That is, there is a 5% chance that the differences found are due to chance.

character, and are not a property of the aggregate.

In a relationship scientific hypothesis level of statistical significance is a quantitative

an indicator of the degree of distrust in the conclusion about the existence of a connection, calculated from the results

selective, empirical testing of this hypothesis. The lower the p-level value, the higher

the statistical significance of a research result confirming a scientific hypothesis.

It is useful to know what influences the significance level. Level of significance, all other things being equal

conditions are higher (the p-level value is lower) if:

The magnitude of the connection (difference) is greater;

The variability of the trait(s) is less;

The sample size(s) is larger.

Unilateral Two-sided significance tests

If the purpose of the study is to identify differences in the parameters of two general

aggregates that correspond to its various natural conditions ( living conditions,

age of the subjects, etc.), then it is often unknown which of these parameters will be greater, and

Which one is smaller?

For example, if you are interested in the variability of results in a test and

experimental groups, then, as a rule, there is no confidence in the sign of the difference in variances or

standard deviations results from which variability is assessed. In this case

the null hypothesis is that the variances are equal, and the purpose of the study is

prove the opposite, i.e. presence of differences between variances. It is allowed that

the difference can be of any sign. Such hypotheses are called two-sided.

But sometimes the challenge is to prove an increase or decrease in a parameter;

for example, the average result in the experimental group is higher than the control group. Wherein

It is no longer allowed that the difference may be of a different sign. Such hypotheses are called

One-sided.

Significance tests used to test two-sided hypotheses are called

Double-sided, and for one-sided - unilateral.

The question arises as to which criterion should be chosen in a given case. Answer

This question is beyond formal statistical methods and completely

Depends on the goals of the study. Under no circumstances should you choose one or another criterion after

Conducting an experiment based on the analysis of experimental data, as this may

Lead to incorrect conclusions. If, before conducting an experiment, it is assumed that the difference

The compared parameters can be either positive or negative, then you should