How is the average calculated? Calculation of averages


The average value is a general indicator characterizing the typical level of a phenomenon. It expresses the value of a characteristic per unit of the population.

The average value is:

1) the most typical value of the attribute for the population;

2) the volume of the population attribute, distributed equally among the units of the population.

The characteristic for which the average value is calculated is called “averaged” in statistics.

The average always generalizes the quantitative variation of a trait, i.e. in average values, individual differences between units in the population due to random circumstances are eliminated. In contrast to the average, the absolute value characterizing the level of a characteristic of an individual unit of a population does not allow one to compare the values ​​of a characteristic among units belonging to different populations. So, if you need to compare the levels of remuneration of workers at two enterprises, then you cannot compare this characteristic two workers from different companies. The compensation of workers selected for comparison may not be typical for these enterprises. If we compare the size of wage funds at the enterprises under consideration, the number of employees is not taken into account and, therefore, it is impossible to determine where the level of wages is higher. Ultimately, only average indicators can be compared, i.e. How much does one employee earn on average at each enterprise? Thus, there is a need to calculate average size as a generalizing characteristic of the population.

It is important to note that during the averaging process, the total value of the attribute levels or its final value (in the case of calculating average levels in a dynamics series) must remain unchanged. In other words, when calculating the average value, the volume of the characteristic under study should not be distorted, and the expressions compiled when calculating the average must necessarily make sense.

Calculating the average is one of the common generalization techniques; the average indicator denies what is common (typical) for all units of the population being studied, while at the same time it ignores the differences of individual units. In every phenomenon and its development there is a combination of chance and necessity. When calculating averages, by virtue of the law large numbers accidents are cancelled, balanced, so it is possible to abstract from the unimportant features of the phenomenon, from the quantitative values ​​of the attribute in each specific case. The ability to abstract from the randomness of individual values ​​and fluctuations lies the scientific value of averages as generalizing characteristics of aggregates.

In order for the average to be truly representative, it must be calculated taking into account certain principles.

Let's look at some general principles application of average values.

1. The average must be determined for populations consisting of qualitatively homogeneous units.

2. The average must be calculated for a population consisting of sufficient large number units.

3. The average must be calculated for a population whose units are in a normal, natural state.

4. The average should be calculated taking into account the economic content of the indicator under study.

5.2. Types of averages and methods for calculating them

Let us now consider the types of average values, features of their calculation and areas of application. Average values ​​are divided into two large classes: power averages, structural averages.

Power means include the most well-known and frequently used types, such as geometric mean, arithmetic mean and square mean.

The mode and median are considered as structural averages.

Let's focus on power averages. Power averages, depending on the presentation of the source data, can be simple or weighted. Simple average It is calculated based on ungrouped data and has the following general form:

,

where X i is the variant (value) of the characteristic being averaged;

n – number option.

Weighted average is calculated based on grouped data and has a general appearance

,

where X i is the variant (value) of the characteristic being averaged or the middle value of the interval in which the variant is measured;

m – average degree index;

f i – frequency showing how many times it occurs i-e value averaging characteristic.

If you calculate all types of averages for the same initial data, then their values ​​will turn out to be different. The rule of majority of averages applies here: as the exponent m increases, the corresponding average value also increases:

In statistical practice, arithmetic means and harmonic weighted means are used more often than other types of weighted averages.

Types of power means

Kind of power
average

Index
degree (m)

Calculation formula

Simple

Weighted

Harmonic

Geometric

Arithmetic

Quadratic

Cubic

The harmonic mean has a more complex structure than the arithmetic mean. The harmonic mean is used for calculations when not the units of the population - the carriers of the characteristic - are used as weights, but the product of these units by the values ​​of the characteristic (i.e. m = Xf). The average harmonic simple should be resorted to in cases of determining, for example, the average cost of labor, time, materials per unit of production, per one part for two (three, four, etc.) enterprises, workers engaged in the manufacture of the same type of product , the same part, product.

The main requirement for the formula for calculating the average value is that all stages of the calculation have a real meaningful justification; the resulting average value should replace the individual values ​​of the attribute for each object without disrupting the connection between the individual and summary indicators. In other words, the average value must be calculated in such a way that when each individual value of the averaged indicator is replaced by its average value, some final summary indicator, connected in one way or another with the averaged indicator, remains unchanged. This total is called defining since the nature of its relationship with individual values ​​determines the specific formula for calculating the average value. Let us demonstrate this rule using the example of the geometric mean.

Geometric mean formula

used most often when calculating the average value based on individual relative dynamics.

The geometric mean is used if a sequence of chain relative dynamics is given, indicating, for example, an increase in production volume compared to the level of the previous year: i 1, i 2, i 3,…, i n. Obviously, the volume of production in the last year is determined by its initial level (q 0) and subsequent increase over the years:

q n =q 0 × i 1 × i 2 ×…×i n .

Taking q n as the determining indicator and replacing the individual values ​​of the dynamics indicators with average ones, we arrive at the relation

From here



A special type of averages - structural averages - is used to study internal structure series of distribution of attribute values, as well as for estimating the average value (power type), if its calculation cannot be carried out according to the available statistical data (for example, if in the example considered there were no data on both the volume of production and the amount of costs for groups of enterprises) .

Indicators are most often used as structural averages fashion – the most frequently repeated value of the attribute – and medians – the value of a characteristic that divides the ordered sequence of its values ​​into two equal parts. As a result, for one half of the units in the population the value of the attribute does not exceed the median level, and for the other half it is not less than it.

If the characteristic being studied has discrete values, then special difficulties When calculating there is no mode or median. If data on the values ​​of attribute X are presented in the form of ordered intervals of its change (interval series), the calculation of the mode and median becomes somewhat more complicated. Since the median value divides the entire population into two equal parts, it ends up in one of the intervals of characteristic X. Using interpolation, the value of the median is found in this median interval:

,

where X Me is the lower limit of the median interval;

h Me – its value;

(Sum m)/2 – half of the total number of observations or half the volume of the indicator that is used as a weighting in the formulas for calculating the average value (in absolute or relative terms);

S Me-1 – the sum of observations (or the volume of the weighting attribute) accumulated before the beginning of the median interval;

m Me – the number of observations or the volume of the weighting characteristic in the median interval (also in absolute or relative terms).

When calculating the modal value of a characteristic based on the data of an interval series, it is necessary to pay attention to the fact that the intervals are identical, since the repeatability indicator of the values ​​of the characteristic X depends on this. For an interval series with equal intervals, the magnitude of the mode is determined as

,

where X Mo is the lower value of the modal interval;

m Mo – number of observations or volume of the weighting characteristic in the modal interval (in absolute or relative terms);

m Mo-1 – the same for the interval preceding the modal one;

m Mo+1 – the same for the interval following the modal one;

h – the value of the interval of change of the characteristic in groups.

TASK 1

The following data is available for the group of industrial enterprises for the reporting year


enterprises

Product volume, million rubles.

Average number of employees, people.

Profit, thousand rubles

197,7

10,0

13,5

22,8

1500

136,2

465,5

18,4

1412

97,6

296,2

12,6

1200

44,4

584,1

22,0

1485

146,0

480,0

119,0

1420

110,4

57805

21,6

1390

138,7

204,7

30,6

466,8

19,4

1375

111,8

292,2

113,6

1200

49,6

423,1

17,6

1365

105,8

192,6

30,7

360,5

14,0

1290

64,8

280,3

10,2

33,3

It is required to group enterprises for the exchange of products, taking the following intervals:

    up to 200 million rubles

    from 200 to 400 million rubles.

  1. from 400 to 600 million rubles.

    For each group and for all together, determine the number of enterprises, volume of production, average number of employees, average output per employee. Present the grouping results in the form of a statistical table. Formulate a conclusion.

    SOLUTION

    We will group enterprises by product exchange, calculate the number of enterprises, volume of production, and the average number of employees using the simple average formula. The results of grouping and calculations are summarized in a table.

    Groups by product volume


    enterprises

    Product volume, million rubles.

    Average annual cost of fixed assets, million rubles.

    Medium sleep

    juicy number of employees, people.

    Profit, thousand rubles

    Average output per employee

    1 group

    up to 200 million rubles

    1,8,12

    197,7

    204,7

    192,6

    10,0

    9,4

    8,8

    900

    817

    13,5

    30,6

    30,7

    28,2

    2567

    74,8

    0,23

    Average level

    198,3

    24,9

    2nd group

    from 200 to 400 million rubles.

    4,10,13,14

    196,2

    292,2

    360,5

    280,3

    12,6

    113,6

    14,0

    10,2

    1200

    1200

    1290

    44,4

    49,6

    64,8

    33,3

    1129,2

    150,4

    4590

    192,1

    0,25

    Average level

    282,3

    37,6

    1530

    64,0

    3 group

    from 400 to

    600 million

    2,3,5,6,7,9,11

    592

    465,5

    584,1

    480,0

    578,5

    466,8

    423,1

    22,8

    18,4

    22,0

    119,0

    21,6

    19,4

    17,6

    1500

    1412

    1485

    1420

    1390

    1375

    1365

    136,2

    97,6

    146,0

    110,4

    138,7

    111,8

    105,8

    3590

    240,8

    9974

    846,5

    0,36

    Average level

    512,9

    34,4

    1421

    120,9

    Total in aggregate

    5314,2

    419,4

    17131

    1113,4

    0,31

    On average

    379,6

    59,9

    1223,6

    79,5

    Conclusion. Thus, in the population under consideration, the largest number of enterprises in terms of production volume fell into the third group - seven, or half of the enterprises. The average annual cost of fixed assets is also in this group, as well as the large average number of employees - 9974 people; enterprises of the first group are the least profitable.

    TASK 2

    The following data is available on the company's enterprises

    Number of the enterprise included in the company

    I quarter

    II quarter

    Product output, thousand rubles.

    Man-days worked by workers

    Average output per worker per day, rub.

    59390,13

average value- this is a general indicator that characterizes a qualitatively homogeneous population according to a certain quantitative characteristic. For example, average age persons convicted of theft.

In judicial statistics, average values ​​are used to characterize:

Average time for consideration of cases of this category;

Average claim size;

Average number of defendants per case;

Average damage;

Average workload of judges, etc.

The average is always a named value and has the same dimension as the characteristic of an individual unit of the population. Each average value characterizes the population being studied according to any one varying characteristic, therefore, behind each average value lies a series of distribution of units of this population according to the characteristic being studied. The choice of the type of average is determined by the content of the indicator and the initial data for calculating the average value.

All types of averages used in statistical research are divided into two categories:

1) power averages;

2) structural averages.

The first category of averages includes: arithmetic mean, harmonic mean, geometric mean And root mean square . The second category is fashion And median. Moreover, each of the listed types of power averages can have two forms: simple And weighted . The simple form of the average is used to obtain the average value of the characteristic being studied when the calculation is carried out on ungrouped statistical data, or when each option in the aggregate occurs only once. Weighted averages are values ​​that take into account that variants of attribute values ​​may have different numbers, and therefore each variant has to be multiplied by the corresponding frequency. In other words, each option is “weighted” by its frequency. Frequency is called statistical weight.

Simple arithmetic mean- the most common type of average. It is equal to the sum of the individual values ​​of the characteristic divided by total number these values:

Where x 1 ,x 2 , … ,x N are the individual values ​​of the varying characteristic (variants), and N is the number of units in the population.

Arithmetic average weighted used in cases where data is presented in the form of distribution series or groupings. It is calculated as the sum of the products of options and their corresponding frequencies, divided by the sum of the frequencies of all options:

Where x i- meaning i th variants of the characteristic; f i- frequency i th options.

Thus, each variant value is weighted by its frequency, which is why frequencies are sometimes called statistical weights.


Comment. When we talk about an arithmetic mean without indicating its type, we mean the simple arithmetic mean.

Table 12.

Solution. To calculate, we use the weighted arithmetic average formula:

Thus, on average there are two defendants per criminal case.

If the calculation of the average value is carried out using data grouped in the form of interval distribution series, then you first need to determine the middle values ​​of each interval x"i, and then calculate the average value using the arithmetic weighted average formula, into which x"i is substituted instead of xi.

Example. Data on the age of criminals convicted of theft are presented in the table:

Table 13.

Determine the average age of criminals convicted of theft.

Solution. In order to determine the average age of criminals based on an interval variation series, it is necessary to first find the middle values ​​of the intervals. Since an interval series with the first and last open intervals is given, the values ​​of these intervals are taken to be equal to the values ​​of adjacent closed intervals. In our case, the values ​​of the first and last intervals are equal to 10.

Now we find the average age of criminals using the weighted arithmetic average formula:

Thus, the average age of criminals convicted of theft is approximately 27 years.

Mean harmonic simple represents the reciprocal of the arithmetic mean of the inverse values ​​of the characteristic:

where 1/ x i are the inverse values ​​of the options, and N is the number of units in the population.

Example. To determine the average annual workload on judges of a district court when considering criminal cases, a study of the workload of 5 judges of this court was conducted. The average time spent on one criminal case for each of the surveyed judges turned out to be equal (in days): 6, 0, 5, 6, 6, 3, 4, 9, 5, 4. Find the average costs on one criminal case and the average annual workload on judges of a given district court when considering criminal cases.

Solution. To determine the average time spent on one criminal case, we use the harmonic average formula:

To simplify the calculations, in the example we take the number of days in a year to be 365, including weekends (this does not affect the calculation methodology, and when calculating a similar indicator in practice, it is necessary to substitute the number of working days in a particular year instead of 365 days). Then the average annual workload for judges of a given district court when considering criminal cases will be: 365 (days) : 5.56 ≈ 65.6 (cases).

If we were to use the simple arithmetic average formula to determine the average time spent on one criminal case, we would get:

365 (days): 5.64 ≈ 64.7 (cases), i.e. the average workload on judges turned out to be less.

Let's check the validity of this approach. To do this, we will use data on the time spent on one criminal case for each judge and calculate the number of criminal cases considered by each of them per year.

We get accordingly:

365(days) : 6 ≈ 61 (cases), 365(days) : 5.6 ≈ 65.2 (cases), 365(days) : 6.3 ≈ 58 (cases),

365(days) : 4.9 ≈ 74.5 (cases), 365(days) : 5.4 ≈ 68 (cases).

Now let’s calculate the average annual workload for judges of a given district court when considering criminal cases:

Those. the average annual load is the same as when using the harmonic average.

Thus, using the arithmetic mean in in this case illegal.

In cases where the variants of a characteristic and their volumetric values ​​(the product of variants and frequency) are known, but the frequencies themselves are unknown, the weighted harmonic average formula is used:

,

Where x i are the values ​​of the attribute options, and w i are the volumetric values ​​of the options ( w i = x i f i).

Example. Data on the price of a unit of the same type of product produced by various institutions of the penal system, and on the volume of its sales are given in Table 14.

Table 14

Find the average selling price of the product.

Solution. When calculating the average price, we must use the ratio of the sales amount to the number of units sold. We do not know the number of units sold, but we know the amount of sales of goods. Therefore, to find the average price of goods sold, we will use the weighted harmonic average formula. We get

If you use the arithmetic average formula here, you can get an average price that will be unrealistic:

Geometric mean is calculated by extracting the root of degree N from the product of all values ​​of the attribute variants:

,

Where x 1 ,x 2 , … ,x N- individual values ​​of the varying characteristic (variants), and

N- the number of units in the population.

This type of average is used to calculate the average growth rates of time series.

Mean square is used to calculate the standard deviation, which is an indicator of variation, and will be discussed below.

To determine the structure of the population, special average indicators are used, which include median And fashion , or the so-called structural averages. If the arithmetic mean is calculated based on the use of all variants of attribute values, then the median and mode characterize the value of the variant that occupies a certain average position in the ranked (ordered) series. The units of a statistical population can be ordered in ascending or descending order of variants of the characteristic being studied.

Median (Me)- this is the value that corresponds to the option located in the middle of the ranked series. Thus, the median is that version of the ranked series, on both sides of which in this series there should be an equal number of population units.

To find the median, you first need to determine its serial number in the ranked series using the formula:

where N is the volume of the series (the number of units in the population).

If the series consists of an odd number of terms, then the median is equal to the option with number N Me. If the series consists of an even number of terms, then the median is defined as the arithmetic mean of two adjacent options located in the middle.

Example. Given a ranked series 1, 2, 3, 3, 6, 7, 9, 9, 10. The volume of the series is N = 9, which means N Me = (9 + 1) / 2 = 5. Therefore, Me = 6, i.e. . fifth option. If the row is given 1, 5, 7, 9, 11, 14, 15, 16, i.e. series with an even number of terms (N = 8), then N Me = (8 + 1) / 2 = 4.5. This means that the median is equal to half the sum of the fourth and fifth options, i.e. Me = (9 + 11) / 2 = 10.

In a discrete variation series, the median is determined by the accumulated frequencies. The frequencies of the option, starting from the first, are summed until the median number is exceeded. The value of the last summed options will be the median.

Example. Find the median number of accused per criminal case using the data in Table 12.

Solution. In this case, the volume of the variation series is N = 154, therefore, N Me = (154 + 1) / 2 = 77.5. Having summed up the frequencies of the first and second options, we get: 75 + 43 = 118, i.e. we have surpassed the median number. So Me = 2.

In an interval variation series, the distribution first indicates the interval in which the median will be located. He is called median . This is the first interval whose accumulated frequency exceeds half the volume of the interval variation series. Then the numerical value of the median is determined by the formula:

Where x Me- lower limit of the median interval; i is the value of the median interval; S Me-1- accumulated frequency of the interval that precedes the median; f Me- frequency of the median interval.

Example. Find the median age of offenders convicted of theft based on the statistics presented in Table 13.

Solution. Statistical data is presented by an interval variation series, which means we first determine the median interval. The volume of the population is N = 162, therefore, the median interval is the interval 18-28, because this is the first interval whose accumulated frequency (15 + 90 = 105) exceeds half the volume (162: 2 = 81) of the interval variation series. Now we determine the numerical value of the median using the above formula:

Thus, half of those convicted of theft are under 25 years of age.

Fashion (Mo) They call the value of a characteristic that is most often found in units of the population. Fashion is used to identify the value of a characteristic that is most widespread. For a discrete series, the mode will be the option with the highest frequency. For example, for the discrete series presented in Table 3 Mo= 1, since this value corresponds to the highest frequency - 75. To determine the mode of the interval series, first determine modal interval (the interval having the highest frequency). Then, within this interval, the value of the feature is found, which can be a mode.

Its value is found using the formula:

Where x Mo- lower limit of the modal interval; i is the value of the modal interval; f Mo- frequency of the modal interval; f Mo-1- frequency of the interval preceding the modal one; f Mo+1- frequency of the interval following the modal one.

Example. Find the age of the criminals convicted of theft, data on which are presented in Table 13.

Solution. The highest frequency corresponds to the interval 18-28, therefore, the mode should be in this interval. Its value is determined by the above formula:

Thus, the largest number of criminals convicted of theft are 24 years old.

The average value provides a general characteristic of the entirety of the phenomenon being studied. However, two populations that have the same average values ​​may differ significantly from each other in the degree of fluctuation (variation) in the value of the characteristic being studied. For example, in one court the following terms of imprisonment were imposed: 3, 3, 3, 4, 5, 5, 5, 12, 12, 15 years, and in another - 5, 5, 6, 6, 7, 7, 7 , 8, 8, 8 years old. In both cases, the arithmetic mean is 6.7 years. However, these populations differ significantly from each other in the spread of individual values ​​of the assigned term of imprisonment relative to the average value.

And for the first court, where this spread is quite large, the average term of imprisonment does not reflect the entire population. Thus, if the individual values ​​of a characteristic differ little from each other, then the arithmetic mean will be a fairly indicative characteristic of the properties of a given population. Otherwise, the arithmetic mean will be an unreliable characteristic of this population and its use in practice will be ineffective. Therefore, it is necessary to take into account the variation in the values ​​of the characteristic being studied.

Variation- these are differences in the values ​​of any characteristic among different units of a given population at the same period or point in time. The term “variation” is of Latin origin - variatio, which means difference, change, fluctuation. It arises as a result of the fact that individual values ​​of a characteristic are formed under the combined influence of various factors (conditions), which are combined differently in each special case. To measure the variation of a characteristic, various absolute and relative indicators are used.

The main indicators of variation include the following:

1) scope of variation;

2) average linear deviation;

3) dispersion;

4) standard deviation;

5) coefficient of variation.

Let's briefly look at each of them.

Range of variation R is the most accessible absolute indicator in terms of ease of calculation, which is defined as the difference between the largest and smallest values ​​of a characteristic for units of a given population:

Range of variation (range of fluctuations) - important indicator the variability of the sign, but it makes it possible to see only extreme deviations, which limits the scope of its application. To more accurately characterize the variation of a trait based on its variability, other indicators are used.

Average linear deviation represents the arithmetic mean of the absolute values ​​of deviations of individual values ​​of a characteristic from the average and is determined by the formulas:

1) For ungrouped data

2) For variation series

However, the most widely used measure of variation is dispersion . It characterizes the measure of dispersion of the values ​​of the characteristic being studied relative to its average value. Dispersion is defined as the average of the deviations squared.

Simple variance for ungrouped data:

.

Variance weighted for the variation series:

Comment. In practice, it is better to use the following formulas to calculate variance:

For simple variance

.

For weighted variance

Standard deviation is the square root of the variance:

The standard deviation is a measure of the reliability of the mean. The smaller the standard deviation, the more homogeneous the population and the better the arithmetic mean reflects the entire population.

The measures of scattering discussed above (range of variation, dispersion, standard deviation) are absolute indicators, by which it is not always possible to judge the degree of variability of a characteristic. In some problems it is necessary to use relative scattering indices, one of which is the coefficient of variation.

The coefficient of variation- the ratio of the standard deviation to the arithmetic mean, expressed as a percentage:

The coefficient of variation is used not only for a comparative assessment of the variation of different characteristics or the same characteristic in different populations, but also to characterize the homogeneity of the population. A statistical population is considered quantitatively homogeneous if the coefficient of variation does not exceed 33% (for distributions close to the normal distribution).

Example. The following data are available on the terms of imprisonment of 50 convicts delivered to serve a sentence imposed by the court in a correctional institution of the penal system: 5, 4, 2, 1, 6, 3, 4, 3, 2, 2, 5, 6, 4, 3 , 10, 5, 4, 1, 2, 3, 3, 4, 1, 6, 5, 3, 4, 3, 5, 12, 4, 3, 2, 4, 6, 4, 4, 3, 1 , 5, 4, 3, 12, 6, 7, 3, 4, 5, 5, 3.

1. Construct a series of distributions by terms of imprisonment.

2. Find the mean, variance and standard deviation.

3. Calculate the coefficient of variation and make a conclusion about the homogeneity or heterogeneity of the population being studied.

Solution. To construct a discrete distribution series, it is necessary to determine options and frequencies. The option in this problem is the term of imprisonment, and the frequency is the number of individual options. Having calculated the frequencies, we obtain the following discrete distribution series:

Let's find the mean and variance. Since the statistical data is represented by a discrete variation series, we will use the formulas for the weighted arithmetic mean and dispersion to calculate them. We get:

= = 4,1;

= 5,21.

Now we calculate the standard deviation:

Finding the coefficient of variation:

Consequently, the statistical population is quantitatively heterogeneous.

Discipline: Statistics

Option No. 2

Average values ​​used in statistics

Introduction………………………………………………………………………………….3

Theoretical task

Average value in statistics, its essence and conditions of application.

1.1. The essence of average size and conditions of use………….4

1.2. Types of averages………………………………………………………8

Practical task

Task 1,2,3…………………………………………………………………………………14

Conclusion………………………………………………………………………………….21

List of references………………………………………………………...23

Introduction

This test consists of two parts – theoretical and practical. In the theoretical part, such an important statistical category as the average value will be examined in detail in order to identify its essence and conditions of application, as well as highlight the types of averages and methods for their calculation.

Statistics, as we know, studies massive socio-economic phenomena. Each of these phenomena may have a different quantitative expression of the same characteristic. For example, wages of workers of the same profession or market prices for the same product, etc. Average values ​​characterize the qualitative indicators of commercial activity: distribution costs, profit, profitability, etc.

To study any population according to varying (quantitatively changing) characteristics, statistics uses average values.

Medium sized entity

The average value is a generalizing quantitative characteristic of a set of similar phenomena based on one varying characteristic. In economic practice, a wide range of indicators are used, calculated as average values.

The most important property of the average value is that it represents the value of a certain characteristic in the entire population with one number, despite its quantitative differences in individual units of the population, and expresses what is common to all units of the population under study. Thus, through the characteristics of a unit of a population, it characterizes the entire population as a whole.

Average values ​​are related to the law of large numbers. The essence of this connection is that during averaging, random deviations of individual values, due to the action of the law of large numbers, cancel each other out and the main development trend, necessity, and pattern are revealed in the average. Average values ​​allow you to compare indicators related to populations with different numbers of units.

In modern conditions of development of market relations in the economy, averages serve as a tool for studying the objective patterns of socio-economic phenomena. However, in economic analysis One cannot limit oneself only to average indicators, since general favorable averages may hide large serious shortcomings in the activities of individual economic entities, and the sprouts of a new, progressive one. For example, the distribution of the population by income makes it possible to identify the formation of new social groups. Therefore, along with average statistical data, it is necessary to take into account the characteristics of individual units of the population.

The average value is the resultant of all factors influencing the phenomenon under study. That is, when calculating average values, the influence of random (perturbation, individual) factors cancels out and, thus, it is possible to determine the pattern inherent in the phenomenon under study. Adolphe Quetelet emphasized that the significance of the method of averages is the possibility of transition from the individual to the general, from the random to the regular, and the existence of averages is a category of objective reality.

Statistics studies mass phenomena and processes. Each of these phenomena has both common to the entire set and special, individual properties. The difference between individual phenomena is called variation. Another property of mass phenomena is their inherent similarity of characteristics of individual phenomena. So, the interaction of elements of a set leads to a limitation of the variation of at least part of their properties. This trend exists objectively. It is in its objectivity that lies the reason for the widest use of average values ​​in practice and in theory.

The average value in statistics is a general indicator that characterizes the typical level of a phenomenon in specific conditions of place and time, reflecting the value of a varying characteristic per unit of a qualitatively homogeneous population.

In economic practice, a wide range of indicators are used, calculated as average values.

Using the method of averages, statistics solves many problems.

The main significance of averages lies in their generalizing function, that is, the replacement of many different individual values ​​of a characteristic with an average value that characterizes the entire set of phenomena.

If the average value generalizes qualitatively homogeneous values ​​of a characteristic, then it is a typical characteristic of the characteristic in a given population.

However, it is incorrect to reduce the role of average values ​​only to the characterization of typical values ​​of characteristics in populations homogeneous for a given characteristic. In practice, much more often modern statistics use average values ​​that generalize clearly homogeneous phenomena.

Average national income per capita, average grain yield throughout the country, average consumption different products nutrition - these are the characteristics of the state as a single national economic system, these are the so-called system averages.

System averages can characterize both spatial or object systems that exist simultaneously (state, industry, region, planet Earth, etc.), and dynamic systems, extended in time (year, decade, season, etc.).

The most important property of the average value is that it reflects what is common to all units of the population under study. The attribute values ​​of individual units of the population fluctuate in one direction or another under the influence of many factors, among which there may be both basic and random. For example, the stock price of a corporation as a whole is determined by its financial position. At the same time, on certain days and on certain exchanges, these shares, due to prevailing circumstances, may be sold at a higher or lower rate. The essence of the average lies in the fact that it cancels out the deviations of the characteristic values ​​of individual units of the population caused by the action of random factors, and takes into account the changes caused by the action of the main factors. This allows the average to reflect the typical level of the trait and abstract from individual characteristics, inherent in individual units.

Calculating the average is one of the most common generalization techniques; the average indicator reflects what is common (typical) for all units of the population being studied, while at the same time it ignores the differences of individual units. In every phenomenon and its development there is a combination of chance and necessity.

The average is a summary characteristic of the laws of the process in the conditions in which it occurs.

Each average characterizes the population under study according to any one characteristic, but to characterize any population, describe its typical features and qualitative features, a system of average indicators is needed. Therefore, in the practice of domestic statistics, to study socio-economic phenomena, as a rule, a system of average indicators is calculated. So, for example, the average wage indicator is assessed together with indicators of average output, capital-labor ratio and energy-labor ratio, the degree of mechanization and automation of work, etc.

The average should be calculated taking into account the economic content of the indicator under study. Therefore, for a specific indicator used in socio-economic analysis, only one true value of the average can be calculated based on the scientific method of calculation.

The average value is one of the most important generalizing statistical indicators, characterizing a set of similar phenomena according to some quantitatively varying characteristic. Averages in statistics are general indicators, numbers expressing the typical characteristic dimensions of social phenomena according to one quantitatively varying characteristic.

Types of averages

The types of average values ​​differ primarily in what property, what parameter of the initial varying mass of individual values ​​of the attribute must be kept unchanged.

Arithmetic mean

The arithmetic mean is the average value of a characteristic, during the calculation of which the total volume of the characteristic in the aggregate remains unchanged. Otherwise, we can say that the arithmetic mean is the average term. When calculating it, the total volume of the attribute is mentally distributed equally among all units of the population.

The arithmetic mean is used if the values ​​of the characteristic being averaged (x) and the number of population units with a certain characteristic value (f) are known.

The arithmetic average can be simple or weighted.

Simple arithmetic mean

Simple is used if each value of attribute x occurs once, i.e. for each x the value of the attribute is f=1, or if the source data is not ordered and it is unknown how many units have certain attribute values.

The formula for the arithmetic mean is simple:

where is the average value; x – the value of the averaged characteristic (variant), – the number of units of the population being studied.

Arithmetic average weighted

Unlike a simple average, a weighted arithmetic average is used if each value of attribute x occurs several times, i.e. for each value of the feature f≠1. This average is widely used in calculating the average based on a discrete distribution series:

where is the number of groups, x is the value of the characteristic being averaged, f is the weight of the characteristic value (frequency, if f is the number of units in the population; frequency, if f is the proportion of units with option x in the total volume of the population).

Harmonic mean

Along with the arithmetic mean, statistics uses the harmonic mean, the inverse of the arithmetic mean of the inverse values ​​of the attribute. Like the arithmetic mean, it can be simple and weighted. It is used when the necessary weights (f i) in the initial data are not specified directly, but are included as a factor in one of the available indicators (i.e., when the numerator of the initial ratio of the average is known, but its denominator is unknown).

Harmonic mean weighted

The product xf gives the volume of the averaged characteristic x for a set of units and is denoted w. If the source data contains values ​​of the characteristic x being averaged and the volume of the characteristic being averaged w, then the harmonic weighted method is used to calculate the average:

where x is the value of the averaged characteristic x (variant); w – weight of variants x, volume of the averaged characteristic.

Harmonic mean unweighted (simple)

This medium form, used much less frequently, has the following form:

where x is the value of the characteristic being averaged; n – number of x values.

Those. this is the reciprocal of the simple arithmetic mean of the reciprocal values ​​of the attribute.

In practice, the harmonic simple mean is rarely used in cases where the values ​​of w for population units are equal.

Mean square and mean cubic

In a number of cases in economic practice, there is a need to calculate the average size of a characteristic, expressed in square or cubic units of measurement. Then the mean square is used (for example, to calculate the average size of a side and square sections, the average diameters of pipes, trunks, etc.) and the average cubic (for example, when determining the average length of a side and cubes).

If, when replacing individual values ​​of a characteristic with an average value, it is necessary to keep the sum of the squares of the original values ​​unchanged, then the average will be a quadratic average value, simple or weighted.

Simple mean square

Simple is used if each value of the attribute x occurs once, in general it has the form:

where is the square of the values ​​of the characteristic being averaged; - the number of units in the population.

Weighted mean square

The weighted mean square is applied if each value of the averaged characteristic x occurs f times:

,

where f is the weight of options x.

Cubic average simple and weighted

The average cubic prime is the cube root of the quotient of dividing the sum of the cubes of individual attribute values ​​by their number:

where are the values ​​of the attribute, n is their number.

Average cubic weighted:

,

where f is the weight of the options x.

The square and cubic means have limited use in statistical practice. The mean square statistic is widely used, but not from the options themselves x , and from their deviations from the average when calculating variation indices.

The average can be calculated not for all, but for some part of the units in the population. An example of such an average could be the progressive average as one of the partial averages, calculated not for everyone, but only for the “best” (for example, for indicators above or below individual averages).

Geometric mean

If the values ​​of the characteristic being averaged are significantly different from each other or are specified by coefficients (growth rates, price indices), then the geometric mean is used for calculation.

The geometric mean is calculated by extracting the root of the degree and from the products of individual values ​​- variants of the characteristic X:

where n is the number of options; P - product sign.

The geometric mean is most widely used to determine the average rate of change in dynamics series, as well as in distribution series.

Average values ​​are general indicators in which the effect of general conditions and the pattern of the phenomenon being studied are expressed. Statistical averages are calculated on the basis of mass data from correctly statistically organized mass observation (continuous or sample). However, the statistical average will be objective and typical if it is calculated from mass data for a qualitatively homogeneous population (mass phenomena). The use of averages should proceed from a dialectical understanding of the categories of general and individual, mass and individual.

The combination of general means with group means makes it possible to limit qualitatively homogeneous populations. By dividing the mass of objects that make up this or that complex phenomenon into internally homogeneous, but qualitatively different groups, characterizing each of the groups with its average, it is possible to reveal the reserves of the process of an emerging new quality. For example, the distribution of the population by income allows us to identify the formation of new social groups. In the analytical part, we looked at a particular example of using the average value. To summarize, we can say that the scope and use of averages in statistics is quite wide.

Practical task

Task No. 1

Determine the average purchase rate and average sale rate of one and $ US

Average purchase rate

Average selling rate

Task No. 2

The dynamics of the volume of own public catering products in the Chelyabinsk region for 1996-2004 are presented in the table in comparable prices (million rubles)

Connect rows A and B. To analyze the series of production dynamics finished products calculate:

1. Absolute growth, chain and base growth and growth rates

2. Average annual production of finished products

3. Average annual growth rate and increase in the company’s products

4. Perform analytical alignment of the dynamics series and calculate the forecast for 2005

5. Graphically depict a series of dynamics

6. Draw a conclusion based on the dynamics results

1) yi B = yi-y1 yi C = yi-y1

y2 B = 2.175 – 2.04 y2 C = 2.175 – 2.04 = 0.135

y3B = 2.505 – 2.04 y3 C = 2.505 – 2.175 = 0.33

y4 B = 2.73 – 2.04 y4 C = 2.73 – 2.505 = 0.225

y5 B = 1.5 – 2.04 y5 C = 1.5 – 2.73 = 1.23

y6 B = 3.34 – 2.04 y6 C = 3.34 – 1.5 = 1.84

y7 B = 3.6 3 – 2.04 y7 C = 3.6 3 – 3.34 = 0.29

y8 B = 3.96 – 2.04 y8 C = 3.96 – 3.63 = 0.33

y9 B = 4.41–2.04 y9 C = 4.41 – 3.96 = 0.45

Tr B2 Tr Ts2

Tr B3 Tr Ts3

Tr B4 Tr Ts4

Tr B5 Tr Ts5

Tr B6 Tr Ts6

Tr B7 Tr Ts7

Tr B8 Tr Ts8

Tr B9 Tr Ts9

Tr B = (TprB *100%) – 100%

Tr B2 = (1.066*100%) – 100% = 6.6%

Tr Ts3 = (1.151*100%) – 100% = 15.1%

2)y million rubles – average product productivity

2,921 + 0,294*(-4) = 2,921-1,176 = 1,745

2,921 + 0,294*(-3) = 2,921-0,882 = 2,039

(yt-y) = (1.745-2.04) = 0.087

(yt-yt) = (1.745-2.921) = 1.382

(y-yt) = (2.04-2.921) = 0.776

Tp

By

y2005=2.921+1.496*4=2.921+5.984=8.905

8,905+2,306*1,496=12,354

8,905-2,306*1,496=5,456

5,456 2005 12,354


Task No. 3

Statistical data on wholesale supplies of food and non-food items and the retail trade network of the region in 2003 and 2004 are presented in the corresponding graphs.

According to Tables 1 and 2, it is required

1. Find the general index of the wholesale supply of food products in actual prices;

2. Find the general index of the actual volume of food supply;

3. Compare general indices and draw the appropriate conclusion;

4. Find the general index of supply of non-food products in actual prices;

5. Find the general index of the physical volume of supply of non-food products;

6. Compare the obtained indices and draw conclusions on non-food products;

7. Find the consolidated general supply indexes of the entire commodity mass in actual prices;

8. Find the consolidated general index of physical volume (for the entire commodity mass of goods);

9. Compare the resulting summary indices and draw the appropriate conclusion.

Base period

Reporting period (2004)

Supplies of the reporting period at prices of the base period

1,291-0,681=0,61= - 39

Conclusion

In conclusion, let's summarize. Average values ​​are general indicators in which the effect of general conditions and the pattern of the phenomenon being studied are expressed. Statistical averages are calculated on the basis of mass data from correctly statistically organized mass observation (continuous or sample). However, the statistical average will be objective and typical if it is calculated from mass data for a qualitatively homogeneous population (mass phenomena). The use of averages should proceed from a dialectical understanding of the categories of general and individual, mass and individual.

The average reflects what is common in each individual, individual object; therefore, the average becomes of great importance for identifying patterns inherent in mass social phenomena and invisible in individual phenomena.

The deviation of the individual from the general is a manifestation of the development process. In some isolated cases, elements of the new, advanced may be laid down. In this case, it is specific factors, taken against the background of average values, that characterize the development process. Therefore, the average reflects the characteristic, typical, real level of the phenomena being studied. The characteristics of these levels and their changes in time and space are one of the main problems of averages. Thus, through the averages, for example, characteristic of enterprises at a certain stage of economic development is manifested; changes in the well-being of the population are reflected in average wages, family income in general and for individual social groups, and the level of consumption of products, goods and services.

The average indicator is a typical value (ordinary, normal, prevailing as a whole), but it is such because it is formed in the normal, natural conditions of the existence of a specific mass phenomenon, considered as a whole. The average reflects the objective property of the phenomenon. In reality, often only deviant phenomena exist, and the average as a phenomenon may not exist, although the concept of typicality of a phenomenon is borrowed from reality. The average value is a reflection of the value of the characteristic being studied and, therefore, is measured in the same dimension as this characteristic. However, there are various ways approximate determination of the level of population distribution for comparison of summary characteristics that are not directly comparable to each other, for example, the average population in relation to the territory (average population density). Depending on which factor needs to be eliminated, the content of the average will also be determined.

The combination of general means with group means makes it possible to limit qualitatively homogeneous populations. By dividing the mass of objects that make up this or that complex phenomenon into internally homogeneous, but qualitatively different groups, characterizing each of the groups with its average, it is possible to reveal the reserves of the process of an emerging new quality. For example, the distribution of the population by income allows us to identify the formation of new social groups. In the analytical part, we looked at a particular example of using the average value. To summarize, we can say that the scope and use of averages in statistics is quite wide.

Bibliography

1. Gusarov, V.M. Theory of statistics by quality [Text]: textbook. allowance / V.M.

Gusarov manual for universities. - M., 1998

2. Edronova, N.N. General theory of statistics [Text]: textbook / Ed. N.N. Edronova - M.: Finance and Statistics 2001 - 648 p.

3. Eliseeva I.I., Yuzbashev M.M. General theory of statistics [Text]: Textbook / Ed. Corresponding member RAS I.I. Eliseeva. – 4th ed., revised. and additional - M.: Finance and Statistics, 1999. - 480 pp.: ill.

4. Efimova M.R., Petrova E.V., Rumyantsev V.N. General theory of statistics: [Text]: Textbook. - M.: INFRA-M, 1996. - 416 p.

5. Ryauzova, N.N. General theory of statistics [Text]: textbook / Ed. N.N.

Ryauzova - M.: Finance and Statistics, 1984.


Gusarov V.M. Theory of statistics: Textbook. A manual for universities. - M., 1998.-P.60.

Eliseeva I.I., Yuzbashev M.M. General theory of statistics. - M., 1999.-P.76.

Gusarov V.M. Theory of statistics: Textbook. A manual for universities. -M., 1998.-P.61.

Average values ​​refer to general statistical indicators that provide a summary (final) characteristic of mass social phenomena, since they are built on the basis of a large number of individual values ​​of a varying characteristic. To clarify the essence of the average value, it is necessary to consider the peculiarities of the formation of the values ​​of the signs of those phenomena, according to the data of which the average value is calculated.

It is known that units of each mass phenomenon have numerous characteristics. Whichever of these characteristics we take, its values ​​will be different for individual units; they change, or, as they say in statistics, vary from one unit to another. For example, an employee’s salary is determined by his qualifications, nature of work, length of service and a number of other factors, and therefore varies within very wide limits. The combined influence of all factors determines the amount of earnings of each employee, however, we can talk about the average monthly salary of workers in different sectors of the economy. Here we operate with a typical, characteristic value of a varying characteristic, assigned to a unit of a large population.

The average value reflects that general, which is typical for all units of the population being studied. At the same time, it balances the influence of all factors acting on the value of the characteristic of individual units of the population, as if mutually extinguishing them. The level (or size) of any social phenomenon is determined by the action of two groups of factors. Some of them are general and main, constantly operating, closely related to the nature of the phenomenon or process being studied, and form the typical for all units of the population being studied, which is reflected in the average value. Others are individual, their effect is less pronounced and is episodic, random. They act in the opposite direction, causing differences between the quantitative characteristics of individual units of the population, trying to change the constant value of the characteristics being studied. The effect of individual characteristics is extinguished in the average value. In the combined influence of typical and individual factors, which is balanced and mutually canceled out in general characteristics, the fundamental principle known from mathematical statistics is manifested in general form. law of large numbers.

In the aggregate, the individual values ​​of the characteristics merge into a common mass and, as it were, dissolve. Hence average value acts as “impersonal”, which can deviate from the individual values ​​of characteristics without coinciding quantitatively with any of them. The average value reflects the general, characteristic and typical for the entire population due to the mutual cancellation of random, atypical differences in it between the characteristics of its individual units, since its value is determined as if by the common resultant of all causes.

However, in order for the average value to reflect the most typical value of a characteristic, it should not be determined for any population, but only for populations consisting of qualitatively homogeneous units. This requirement is the main condition for the scientifically based use of averages and implies a close connection between the method of averages and the method of groupings in the analysis of socio-economic phenomena. Consequently, the average value is a general indicator characterizing the typical level of a varying characteristic per unit of a homogeneous population under specific conditions of place and time.

In thus defining the essence of average values, it is necessary to emphasize that the correct calculation of any average value presupposes the fulfillment of the following requirements:

  • the qualitative homogeneity of the population from which the average value is calculated. This means that the calculation of average values ​​should be based on the grouping method, which ensures the identification of homogeneous, similar phenomena;
  • excluding the influence of random, purely individual causes and factors on the calculation of the average value. This is achieved in the case when the calculation of the average is based on sufficiently massive material in which the action of the law of large numbers is manifested, and all randomness cancels out;
  • When calculating the average value, it is important to establish the purpose of its calculation and the so-called defining indicator(property) to which it should be oriented.

The defining indicator can act as the sum of the values ​​of the characteristic being averaged, the sum of its inverse values, the product of its values, etc. The relationship between the defining indicator and the average value is expressed in the following: if all values ​​of the characteristic being averaged are replaced by the average value, then their sum or product in in this case will not change the defining indicator. Based on this connection between the defining indicator and the average value, an initial quantitative relationship is constructed for direct calculation of the average value. The ability of average values ​​to preserve the properties of statistical populations is called defining property.

The average value calculated for the population as a whole is called general average; average values ​​calculated for each group - group averages. The overall average reflects common features the phenomenon being studied, the group average gives a characteristic of the phenomenon that develops under the specific conditions of a given group.

Calculation methods may be different, therefore in statistics there are several types of averages, the main ones being the arithmetic mean, the harmonic mean and the geometric mean.

In economic analysis, the use of averages is the main tool for assessing the results of scientific and technological progress, social events, and searching for reserves for economic development. At the same time, it should be remembered that excessive reliance on average indicators can lead to biased conclusions when conducting economic and statistical analysis. This is due to the fact that average values, being general indicators, extinguish and ignore those differences in the quantitative characteristics of individual units of the population that actually exist and may be of independent interest.

Types of averages

In statistics, various types of averages are used, which are divided into two large classes:

  • power means (harmonic mean, geometric mean, arithmetic mean, quadratic mean, cubic mean);
  • structural means (mode, median).

To calculate power averages it is necessary to use all available characteristic values. Fashion And median are determined only by the structure of the distribution, therefore they are called structural, positional averages. Median and mode are often used as average characteristic in those populations where calculating the average power law is impossible or impractical.

The most common type of average is the arithmetic mean. Under arithmetic mean is understood as the value of a characteristic that each unit of the population would have if the total sum of all values ​​of the characteristic were distributed evenly among all units of the population. The calculation of this value comes down to summing all the values ​​of the varying characteristic and dividing the resulting amount by the total number of units in the population. For example, five workers fulfilled an order for the production of parts, while the first produced 5 parts, the second - 7, the third - 4, the fourth - 10, the fifth - 12. Since in the source data the value of each option occurred only once, to determine the average output of one worker should apply the simple arithmetic average formula:

i.e. in our example, the average output of one worker is equal to

Along with the simple arithmetic mean, they study weighted arithmetic average. For example, let's calculate the average age of students in a group of 20 people, whose ages range from 18 to 22 years, where xi- variants of the characteristic being averaged, fi- frequency, which shows how many times it occurs i-th value in the aggregate (Table 5.1).

Table 5.1

Average age of students

Applying the weighted arithmetic mean formula, we get:


To select a weighted arithmetic mean, there is certain rule: if there is a series of data on two indicators, for one of which it is necessary to calculate

average value, and at the same time the numerical values ​​of the denominator of its logical formula are known, and the values ​​of the numerator are unknown, but can be found as the product of these indicators, then the average value should be calculated using the arithmetic weighted average formula.

In some cases, the nature of the initial statistical data is such that the calculation of the arithmetic average loses its meaning and the only generalizing indicator can only be another type of average - harmonic mean. Currently, the computational properties of the arithmetic mean have lost their relevance in the calculation of general statistical indicators due to the widespread introduction of electronic computing technology. The harmonic mean value, which can also be simple and weighted, has acquired great practical importance. If the numerical values ​​of the numerator of a logical formula are known, and the values ​​of the denominator are unknown, but can be found as a partial division of one indicator by another, then the average value is calculated using the harmonic weighted average formula.

For example, let it be known that the car covered the first 210 km at a speed of 70 km/h, and the remaining 150 km at a speed of 75 km/h. It is impossible to determine the average speed of a car over the entire journey of 360 km using the arithmetic average formula. Since the options are speeds in individual sections xj= 70 km/h and X2= 75 km/h, and the weights (fi) are considered to be the corresponding sections of the path, then the products of the options and the weights will have neither physical nor economic meaning. In this case, the quotients acquire meaning from dividing the sections of the path into the corresponding speeds (options xi), i.e., the time spent on passing individual sections of the path (fi / xi). If the sections of the path are denoted by fi, then the entire path is expressed as Σfi, and the time spent on the entire path is expressed as Σ fi / xi , Then the average speed can be found as the quotient of the entire path divided by the total time spent:

In our example we get:

If, when using the harmonic mean, the weights of all options (f) are equal, then instead of the weighted one you can use simple (unweighted) harmonic mean:

where xi are individual options; n- number of variants of the averaged characteristic. In the speed example, simple harmonic mean could be applied if the path segments traveled at different speeds were equal.

Any average value must be calculated so that when it replaces each variant of the averaged characteristic, the value of some final, general indicator that is associated with the averaged indicator does not change. Thus, when replacing actual speeds on individual sections of the route with their average value (average speed), the total distance should not change.

The form (formula) of the average value is determined by the nature (mechanism) of the relationship of this final indicator with the averaged one, therefore the final indicator, the value of which should not change when replacing options with their average value, is called defining indicator. To derive the formula for the average, you need to create and solve an equation using the relationship between the averaged indicator and the determining one. This equation is constructed by replacing the variants of the characteristic (indicator) being averaged with their average value.

In addition to the arithmetic mean and harmonic mean, other types (forms) of the mean are used in statistics. They are all special cases power average. If we calculate all types of power averages for the same data, then the values

they will turn out to be the same, the rule applies here majo-rate average. As the exponent of the average increases, the average value itself increases. The most frequently used calculation formulas in practical research various types power average values ​​are presented in table. 5.2.

Table 5.2


The geometric mean is used when there is n growth coefficients, while individual values ​​of the characteristic are, as a rule, relative values dynamics constructed in the form of chain values, as a ratio to the previous level of each level in a series of dynamics. The average thus characterizes the average growth rate. Average geometric simple calculated by the formula

Formula weighted geometric mean has the following form:

The above formulas are identical, but one is applied at current coefficients or growth rates, and the second - at absolute values ​​of series levels.

Mean square used in calculations with the values ​​of quadratic functions, used to measure the degree of fluctuation of individual values ​​of a characteristic around the arithmetic mean in the distribution series and is calculated by the formula

Weighted mean square calculated using another formula:

Average cubic is used when calculating with values ​​of cubic functions and is calculated by the formula

average cubic weighted:

All average values ​​discussed above can be presented as a general formula:

where is the average value; - individual meaning; n- number of units of the population being studied; k- exponent that determines the type of average.

When using the same source data, the more k in the general power average formula, the larger the average value. It follows from this that there is a natural relationship between the values ​​of power averages:

The average values ​​described above give a generalized idea of ​​the population being studied, and from this point of view, their theoretical, applied and educational significance is indisputable. But it happens that the average value does not coincide with any of the actually existing options, therefore, in addition to the considered averages, in statistical analysis it is advisable to use the values ​​of specific options that occupy a very specific position in the ordered (ranked) series of attribute values. Among these quantities, the most commonly used are structural, or descriptive, average- mode (Mo) and median (Me).

Fashion- the value of a characteristic that is most often found in a given population. In relation to a variational series, the mode is the most frequently occurring value of the ranked series, that is, the option with the highest frequency. Fashion can be used in determining the stores that are visited more often, the most common price for any product. It shows the size of a feature characteristic of a significant part of the population and is determined by the formula

where x0 is the lower limit of the interval; h- interval size; fm- interval frequency; fm_ 1 - frequency of the previous interval; fm+ 1 - frequency of the next interval.

Median the option located in the center of the ranked row is called. The median divides the series into two equal parts in such a way that there are the same number of population units on either side of it. In this case, one half of the units in the population has a value of the varying characteristic less than the median, and the other half has a value greater than it. The median is used when studying an element whose value is greater than or equal to, or at the same time less than or equal to, half of the elements of a distribution series. The median gives general idea about where the values ​​of the attribute are concentrated, in other words, where their center is located.

The descriptive nature of the median is manifested in the fact that it characterizes the quantitative limit of the values ​​of a varying characteristic that half of the units in the population possess. The problem of finding the median for a discrete variation series is easily solved. If all units of the series are given serial numbers, then the serial number of the median option is determined as (n + 1) / 2 with an odd number of members of n. If the number of members of the series is an even number, then the median will be the average value of two options that have serial numbers n/ 2 and n / 2 + 1.

When determining the median in interval variation series, first determine the interval in which it is located (median interval). This interval is characterized by the fact that its accumulated sum of frequencies is equal to or exceeds half the sum of all frequencies of the series. The median of an interval variation series is calculated using the formula

Where X0- lower limit of the interval; h- interval size; fm- interval frequency; f- number of members of the series;

∫m-1 is the sum of the accumulated terms of the series preceding the given one.

Along with the median, to more fully characterize the structure of the population under study, other values ​​of options that occupy a very specific position in the ranked series are also used. These include quartiles And deciles. Quartiles divide the series according to the sum of frequencies into 4 equal parts, and deciles - into 10 equal parts. There are three quartiles and nine deciles.

The median and mode, unlike the arithmetic mean, do not eliminate individual differences in the values ​​of a variable characteristic and therefore are additional and very important characteristics of the statistical population. In practice, they are often used instead of the average or along with it. It is especially advisable to calculate the median and mode in cases where the population under study contains a certain number of units with a very large or very small value of the varying characteristic. These values ​​of the options, which are not very characteristic of the population, while influencing the value of the arithmetic mean, do not affect the values ​​of the median and mode, which makes the latter very valuable indicators for economic and statistical analysis.

Variation indicators

The purpose of statistical research is to identify the basic properties and patterns of the statistical population being studied. In the process of summary processing of statistical observation data, they build distribution series. There are two types of distribution series - attributive and variational, depending on whether the characteristic taken as the basis for the grouping is qualitative or quantitative.

Variational are called distribution series constructed on a quantitative basis. The values ​​of quantitative characteristics in individual units of the population are not constant, they differ more or less from each other. This difference in the value of a characteristic is called variations. Individual numerical values ​​of a characteristic found in the population being studied are called variants of values. The presence of variation in individual units of the population is due to the influence of a large number of factors on the formation of the level of the trait. The study of the nature and degree of variation of characteristics in individual units of the population is the most important issue of any statistical research. Variation indices are used to describe the measure of trait variability.

Another important task of statistical research is to determine the role of individual factors or their groups in the variation of certain characteristics of the population. To solve this problem, statistics uses special methods for studying variation, based on the use of a system of indicators with which variation is measured. In practice, a researcher is faced with a fairly large number of variants of attribute values, which does not give an idea of ​​the distribution of units by attribute value in the aggregate. To do this, arrange all variants of characteristic values ​​in ascending or descending order. This process is called ranking the series. The ranked series immediately gives a general idea of ​​the values ​​that the feature takes in the aggregate.

The insufficiency of the average value for an exhaustive description of the population forces us to supplement the average values ​​with indicators that allow us to assess the typicality of these averages by measuring the variability (variation) of the characteristic being studied. The use of these indicators of variation makes it possible to make statistical analysis more complete and meaningful and thereby gain a deeper understanding of the essence of the social phenomena being studied.

The simplest signs of variation are minimum And maximum - this is the smallest and largest value of the attribute in the aggregate. The number of repetitions of individual variants of characteristic values ​​is called repetition frequency. Let us denote the frequency of repetition of the attribute value fi, the sum of frequencies equal to the volume of the population being studied will be:

Where k- number of options for attribute values. It is convenient to replace frequencies with frequencies - wi. Frequency- relative frequency indicator - can be expressed in fractions of a unit or percentage and allows you to compare variation series with different numbers of observations. Formally we have:

To measure the variation of a characteristic, various absolute and relative indicators are used. Absolute indicators of variation include mean linear deviation, range of variation, dispersion, and standard deviation.

Range of variation(R) represents the difference between the maximum and minimum values ​​of the attribute in the population being studied: R= Xmax - Xmin. This indicator gives only the most general idea of ​​the variability of the characteristic being studied, since it shows the difference only between the maximum values ​​of the options. It is completely unrelated to the frequencies in the variation series, i.e., to the nature of the distribution, and its dependence can give it an unstable, random character only on the extreme values ​​of the characteristic. The range of variation does not provide any information about the characteristics of the populations under study and does not allow us to assess the degree of typicality of the obtained average values. The scope of application of this indicator is limited to fairly homogeneous populations; more precisely, it characterizes the variation of a characteristic, an indicator based on taking into account the variability of all values ​​of the characteristic.

To characterize the variation of a characteristic, it is necessary to generalize the deviations of all values ​​from any value typical for the population being studied. Such indicators

variations, such as the average linear deviation, dispersion and standard deviation, are based on considering the deviations of the characteristic values ​​of individual units of the population from the arithmetic mean.

Average linear deviation represents the arithmetic mean of the absolute values ​​of deviations of individual options from their arithmetic mean:


The absolute value (modulus) of the deviation of the variant from the arithmetic mean; f- frequency.

The first formula is applied if each of the options occurs in the aggregate only once, and the second - in series with unequal frequencies.

There is another way of averaging the deviations of options from the arithmetic mean. This very common method in statistics comes down to calculating the squared deviations of the options from the average value with their subsequent averaging. In this case, we obtain a new indicator of variation - dispersion.

Dispersion(σ 2) - the average of the squared deviations of the attribute value options from their average value:

The second formula is applied if the options have their own weights (or frequencies of the variation series).

In economic and statistical analysis, it is customary to evaluate the variation of a characteristic most often using the standard deviation. Standard deviation(σ) is the square root of the variance:

Average linear and standard deviations show how much the value of a characteristic fluctuates on average among units of the population under study, and are expressed in the same units of measurement as the options.

In statistical practice, there is often a need to compare the variation of different characteristics. For example, it is of great interest to compare variations in the age of personnel and their qualifications, length of service and wages, etc. For such comparisons, indicators of absolute variability of characteristics - linear average and standard deviation - are not suitable. It is, in fact, impossible to compare the fluctuation of length of service, expressed in years, with the fluctuation of wages, expressed in rubles and kopecks.

When comparing the variability of various characteristics together, it is convenient to use relative measures of variation. These indicators are calculated as the ratio of absolute indicators to the arithmetic mean (or median). Using the range of variation, the average linear deviation, and the standard deviation as an absolute indicator of variation, relative indicators of variability are obtained:


The most commonly used indicator of relative variability, characterizing the homogeneity of the population. The population is considered homogeneous if the coefficient of variation does not exceed 33% for distributions close to normal.

In most cases, data is concentrated around some central point. Thus, to describe any set of data, it is enough to indicate the average value. Let us consider sequentially three numerical characteristics that are used to estimate the average value of the distribution: arithmetic mean, median and mode.

Average

The arithmetic mean (often called simply the mean) is the most common estimate of the mean of a distribution. It is the result of dividing the sum of all observed numerical values ​​by their number. For a sample consisting of numbers X 1, X 2, …, Xn, sample mean (denoted by ) equals = (X 1 + X 2 + … + Xn) / n, or

where is the sample mean, n- sample size, Xii-th element samples.

Download the note in or format, examples in format

Consider calculating the arithmetic average of the five-year average annual returns of 15 mutual funds with very high level risk (Fig. 1).

Rice. 1. Average annual returns of 15 very high-risk mutual funds

The sample mean is calculated as follows:

This is a good return, especially compared to the 3-4% return that bank or credit union depositors received over the same time period. If we sort the returns, it is easy to see that eight funds have returns above the average, and seven - below the average. The arithmetic mean acts as the equilibrium point, so that funds with low returns balance out funds with high returns. All elements of the sample are involved in calculating the average. None of the other estimates of the mean of a distribution have this property.

When should you calculate the arithmetic mean? Since the arithmetic mean depends on all elements in the sample, the presence of extreme values ​​significantly affects the result. In such situations, the arithmetic mean can distort the meaning of numerical data. Therefore, when describing a data set containing extreme values, it is necessary to indicate the median or the arithmetic mean and the median. For example, if we remove the RS Emerging Growth fund's returns from the sample, the sample average of the 14 funds' returns decreases by almost 1% to 5.19%.

Median

The median represents median value ordered array of numbers. If the array does not contain repeating numbers, then half of its elements will be less than, and half will be greater than, the median. If the sample contains extreme values, it is better to use the median rather than the arithmetic mean to estimate the mean. To calculate the median of a sample, it must first be ordered.

This formula is ambiguous. Its result depends on whether the number is even or odd n:

  • If the sample contains an odd number of elements, the median is (n+1)/2-th element.
  • If the sample contains an even number of elements, the median lies between the two middle elements of the sample and is equal to the arithmetic mean calculated over these two elements.

To calculate the median of a sample containing the returns of 15 very high-risk mutual funds, you first need to sort the raw data (Figure 2). Then the median will be opposite the number of the middle element of the sample; in our example No. 8. Excel has a special function =MEDIAN() that works with unordered arrays too.

Rice. 2. Median 15 funds

Thus, the median is 6.5. This means that the return on one half of the very high-risk funds does not exceed 6.5, and the return on the other half exceeds it. Note that the median of 6.5 is not much larger than the mean of 6.08.

If we remove the return of the RS Emerging Growth fund from the sample, then the median of the remaining 14 funds decreases to 6.2%, that is, not as significantly as the arithmetic mean (Figure 3).

Rice. 3. Median 14 funds

Fashion

The term was first coined by Pearson in 1894. Fashion is the number that occurs most often in a sample (the most fashionable). Fashion describes well, for example, the typical reaction of drivers to a traffic light signal to stop moving. A classic example of the use of fashion is the choice of shoe size or wallpaper color. If a distribution has several modes, then it is said to be multimodal or multimodal (has two or more “peaks”). Multimodal distribution gives important information about the nature of the variable being studied. For example, in sociological surveys, if a variable represents a preference or attitude towards something, then multimodality may mean that there are several distinct different opinions. Multimodality also serves as an indicator that the sample is not homogeneous and the observations may be generated by two or more “overlapping” distributions. Unlike the arithmetic mean, outliers do not affect the mode. For continuously distributed random variables, such as the average annual return of mutual funds, the mode sometimes does not exist (or makes no sense) at all. Since these indicators can take on very different values, repeating values ​​are extremely rare.

Quartiles

Quartiles are the metrics most often used to evaluate the distribution of data when describing the properties of large numerical samples. While the median splits the ordered array in half (50% of the array's elements are less than the median and 50% are greater), quartiles split the ordered data set into four parts. The values ​​of Q 1 , median and Q 3 are the 25th, 50th and 75th percentiles, respectively. The first quartile Q 1 is a number that divides the sample into two parts: 25% of the elements are less than, and 75% are greater than, the first quartile.

The third quartile Q 3 is a number that also divides the sample into two parts: 75% of the elements are smaller, and 25% - more than three quartile

To calculate quartiles in versions of Excel before 2007, use the =QUARTILE(array,part) function. Starting from Excel 2010, two functions are used:

  • =QUARTILE.ON(array,part)
  • =QUARTILE.EXC(array,part)

These two functions give little different meanings(Fig. 4). For example, when calculating the quartiles of a sample containing the average annual returns of 15 very high-risk mutual funds, Q 1 = 1.8 or –0.7 for QUARTILE.IN and QUARTILE.EX, respectively. By the way, the QUARTILE function, previously used, corresponds to the modern QUARTILE.ON function. To calculate quartiles in Excel using the above formulas, the data array does not need to be ordered.

Rice. 4. Calculating quartiles in Excel

Let us emphasize again. Excel can calculate quartiles for a univariate discrete series, containing the values random variable. The calculation of quartiles for a frequency-based distribution is given below in the section.

Geometric mean

Unlike the arithmetic mean, the geometric mean allows you to estimate the degree of change in a variable over time. The geometric mean is the root n th degree from the work n quantities (in Excel the =SRGEOM function is used):

G= (X 1 * X 2 * … * X n) 1/n

A similar parameter - the geometric mean value of the rate of profit - is determined by the formula:

G = [(1 + R 1) * (1 + R 2) * … * (1 + R n)] 1/n – 1,

Where R i– rate of profit for i th time period.

For example, suppose the initial investment is $100,000. By the end of the first year, it falls to $50,000, and by the end of the second year it recovers to the initial level of $100,000. The rate of return of this investment over a two-year period equals 0, since the initial and final amounts of funds are equal to each other. However, the arithmetic average of the annual rates of return is = (–0.5 + 1) / 2 = 0.25 or 25%, since the rate of return in the first year R 1 = (50,000 – 100,000) / 100,000 = –0.5 , and in the second R 2 = (100,000 – 50,000) / 50,000 = 1. At the same time, the geometric mean value of the rate of profit for two years is equal to: G = [(1–0.5) * (1+1 )] 1/2 – 1 = ½ – 1 = 1 – 1 = 0. Thus, the geometric mean more accurately reflects the change (more precisely, the absence of changes) in the volume of investment over a two-year period than the arithmetic mean.

Interesting Facts. Firstly, the geometric mean will always be less than the arithmetic mean of the same numbers. Except for the case when all the numbers taken are equal to each other. Secondly, by considering the properties of a right triangle, you can understand why the mean is called geometric. The height of a right triangle, lowered to the hypotenuse, is the average proportional between the projections of the legs onto the hypotenuse, and each leg is the average proportional between the hypotenuse and its projection onto the hypotenuse (Fig. 5). This gives a geometric way to construct the geometric mean of two (lengths) segments: you need to construct a circle on the sum of these two segments as a diameter, then the height restored from the point of their connection to the intersection with the circle will give the desired value:

Rice. 5. Geometric nature of the geometric mean (figure from Wikipedia)

The second important property of numerical data is their variation, characterizing the degree of data dispersion. Two different samples may differ in both means and variances. However, as shown in Fig. 6 and 7, two samples may have the same variations but different means, or the same means and completely different variations. The data that corresponds to polygon B in Fig. 7, change much less than the data on which polygon A was constructed.

Rice. 6. Two symmetrical bell-shaped distributions with the same spread and different mean values

Rice. 7. Two symmetrical bell-shaped distributions with the same mean values ​​and different spreads

There are five estimates of data variation:

  • scope,
  • interquartile range,
  • dispersion,
  • standard deviation,
  • the coefficient of variation.

Scope

The range is the difference between the largest and smallest elements of the sample:

Range = XMax – XMin

The range of a sample containing the average annual returns of 15 very high-risk mutual funds can be calculated using the ordered array (see Figure 4): Range = 18.5 – (–6.1) = 24.6. This means that the difference between the highest and lowest average annual returns of very high-risk funds is 24.6%.

Range measures the overall spread of data. Although sample range is a very simple estimate of the overall spread of the data, its weakness is that it does not take into account exactly how the data are distributed between the minimum and maximum elements. This effect is clearly visible in Fig. 8, which illustrates samples having the same range. Scale B demonstrates that if a sample contains at least one extreme value, the sample range is a very imprecise estimate of the spread of the data.

Rice. 8. Comparison of three samples with the same range; the triangle symbolizes the support of the scale, and its location corresponds to the sample mean

Interquartile range

The interquartile, or average, range is the difference between the third and first quartiles of the sample:

Interquartile range = Q 3 – Q 1

This value allows us to estimate the scatter of 50% of the elements and not take into account the influence of extreme elements. The interquartile range of a sample containing the average annual returns of 15 very high-risk mutual funds can be calculated using the data in Fig. 4 (for example, for the QUARTILE.EXC function): Interquartile range = 9.8 – (–0.7) = 10.5. The interval bounded by the numbers 9.8 and -0.7 is often called the middle half.

It should be noted that the values ​​of Q 1 and Q 3 , and hence the interquartile range, do not depend on the presence of outliers, since their calculation does not take into account any value that would be less than Q 1 or greater than Q 3 . Summary measures such as the median, first and third quartiles, and interquartile range that are not affected by outliers are called robust measures.

Although range and interquartile range provide estimates of the overall and average spread of a sample, respectively, neither of these estimates takes into account exactly how the data are distributed. Variance and standard deviation are devoid of this drawback. These indicators allow you to assess the degree to which data fluctuates around the average value. Sample variance is an approximation of the arithmetic mean calculated from the squares of the differences between each sample element and the sample mean. For a sample X 1, X 2, ... X n, the sample variance (denoted by the symbol S 2 is given by the following formula:

In general, sample variance is the sum of the squares of the differences between the sample elements and the sample mean, divided by a value equal to the sample size minus one:

Where - arithmetic mean, n- sample size, X i - i th selection element X. In Excel before version 2007, the =VARIN() function was used to calculate the sample variance; since version 2010, the =VARIAN() function is used.

The most practical and widely accepted estimate of the spread of data is sample standard deviation. This indicator is denoted by the symbol S and is equal to square root from sample variance:

In Excel before version 2007, the function =STDEV.() was used to calculate the standard sample deviation; since version 2010, the function =STDEV.V() is used. To calculate these functions, the data array may be unordered.

Neither the sample variance nor the sample standard deviation can be negative. The only situation in which the indicators S 2 and S can be zero is if all elements of the sample are equal to each other. In this completely improbable case, the range and interquartile range are also zero.

Numerical data is inherently volatile. Any variable can take many different meanings. For example, different mutual funds have different rates of return and loss. Due to the variability of numerical data, it is very important to study not only estimates of the mean, which are summary in nature, but also estimates of variance, which characterize the spread of the data.

Dispersion and standard deviation allow you to evaluate the spread of data around the average value, in other words, determine how many sample elements are less than the average and how many are greater. Dispersion has some valuable mathematical properties. However, its value is the square of the unit of measurement - square percent, square dollar, square inch, etc. Therefore, a natural measure of dispersion is the standard deviation, which is expressed in common units of income percentage, dollars, or inches.

Standard deviation allows you to estimate the amount of variation of sample elements around the average value. In almost all situations, the majority of observed values ​​lie within the range of plus or minus one standard deviation from the mean. Consequently, knowing the arithmetic mean of the sample elements and the standard sample deviation, it is possible to determine the interval to which the bulk of the data belongs.

The standard deviation of returns for the 15 very high-risk mutual funds is 6.6 (Figure 9). This means that the profitability of the bulk of funds differs from the average value by no more than 6.6% (i.e., it fluctuates in the range from – S= 6.2 – 6.6 = –0.4 to +S= 12.8). In fact, the five-year average annual return of 53.3% (8 out of 15) of the funds lies within this range.

Rice. 9. Sample standard deviation

Note that when summing the squared differences, sample items that are further away from the mean are weighted more heavily than items that are closer to the mean. This property is the main reason why the arithmetic mean is most often used to estimate the mean of a distribution.

The coefficient of variation

Unlike previous estimates of scatter, the coefficient of variation is a relative estimate. It is always measured as a percentage and not in the units of the original data. The coefficient of variation, denoted by the symbols CV, measures the dispersion of the data around the mean. The coefficient of variation is equal to the standard deviation divided by the arithmetic mean and multiplied by 100%:

Where S- standard sample deviation, - sample average.

The coefficient of variation allows you to compare two samples whose elements are expressed in different units of measurement. For example, the manager of a mail delivery service intends to renew his fleet of trucks. When loading packages, there are two restrictions to consider: the weight (in pounds) and the volume (in cubic feet) of each package. Suppose that in a sample containing 200 bags, the mean weight is 26.0 pounds, the standard deviation of weight is 3.9 pounds, the mean bag volume is 8.8 cubic feet, and the standard deviation of volume is 2.2 cubic feet. How to compare the variation in weight and volume of packages?

Since the units of measurement for weight and volume differ from each other, the manager must compare the relative spread of these quantities. The coefficient of variation of weight is CV W = 3.9 / 26.0 * 100% = 15%, and the coefficient of variation of volume is CV V = 2.2 / 8.8 * 100% = 25%. Thus, the relative variation in the volume of packets is much greater than the relative variation in their weight.

Distribution form

The third important property of a sample is the shape of its distribution. This distribution may be symmetrical or asymmetrical. To describe the shape of a distribution, it is necessary to calculate its mean and median. If the two are the same, the variable is considered symmetrically distributed. If the mean value of a variable is greater than the median, its distribution has a positive skewness (Fig. 10). If the median is greater than the mean, the distribution of the variable is negatively skewed. Positive skewness occurs when the mean increases to unusually high values. Negative skewness occurs when the mean decreases to unusually small values. A variable is symmetrically distributed if it does not take any extreme values ​​in either direction, so that large and small values ​​of the variable cancel each other out.

Rice. 10. Three types of distributions

Data shown on scale A are negatively skewed. This figure shows a long tail and a leftward skew caused by the presence of unusually small values. These extremely small values ​​shift the average value to the left, making it less than the median. The data shown on scale B is distributed symmetrically. The left and right halves of the distribution are mirror images of themselves. Large and small values ​​balance each other, and the mean and median are equal. The data shown on scale B is positively skewed. This figure shows a long tail and a skew to the right caused by the presence of unusually high values. These too large values ​​shift the mean to the right, making it larger than the median.

In Excel, descriptive statistics can be obtained using an add-in Analysis package. Go through the menu DataData analysis, in the window that opens, select the line Descriptive Statistics and click Ok. In the window Descriptive Statistics be sure to indicate Input interval(Fig. 11). If you want to see descriptive statistics on the same sheet as the original data, select the radio button Output interval and specify the cell where the upper left corner of the displayed statistics should be placed (in our example, $C$1). If you want to output data to a new sheet or a new workbook, you just need to select the appropriate radio button. Check the box next to Summary statistics. If desired, you can also choose Difficulty level,kth smallest andkth largest.

If on deposit Data in area Analysis you don't see the icon Data analysis, you need to install the add-on first Analysis package(see, for example,).

Rice. 11. Descriptive statistics of five-year average annual returns of funds with very high levels of risk, calculated using the add-in Data analysis Excel programs

Excel calculates a number of statistics discussed above: mean, median, mode, standard deviation, variance, range ( interval), minimum, maximum and sample size ( check). Excel also calculates some statistics that are new to us: standard error, kurtosis, and skewness. Standard error equal to the standard deviation divided by the square root of the sample size. Asymmetry characterizes the deviation from the symmetry of the distribution and is a function that depends on the cube of the differences between the sample elements and the average value. Kurtosis is a measure of the relative concentration of data around the mean compared to the tails of the distribution and depends on the differences between the sample elements and the mean raised to the fourth power.

Calculate descriptive statistics for population

The mean, spread, and shape of the distribution discussed above are characteristics determined from the sample. However, if the data set contains numerical measurements of the entire population, its parameters can be calculated. Such parameters include the expected value, dispersion and standard deviation of the population.

Expected value equal to the sum of all values ​​in the population divided by the size of the population:

Where µ - expected value, Xi- i th observation of the variable X, N- volume of the general population. In Excel for calculation mathematical expectation The same function is used as for the arithmetic mean: =AVERAGE().

Population variance equal to the sum of the squares of the differences between the elements of the general population and the mat. expectation divided by the size of the population:

Where σ 2– dispersion of the general population. In Excel prior to version 2007, the function =VARP() is used to calculate the variance of a population, starting with version 2010 =VARP().

Population standard deviation equal to the square root of the population variance:

In Excel prior to version 2007, the =STDEV() function is used to calculate the standard deviation of a population, starting with version 2010 =STDEV.Y(). Note that the formulas for the population variance and standard deviation are different from the formulas for calculating the sample variance and standard deviation. When calculating sample statistics S 2 And S the denominator of the fraction is n – 1, and when calculating parameters σ 2 And σ - volume of the general population N.

Rule of thumb

In most situations, a large proportion of observations are concentrated around the median, forming a cluster. In data sets with positive skewness, this cluster is located to the left (i.e., below) the mathematical expectation, and in sets with negative skewness, this cluster is located to the right (i.e., above) the mathematical expectation. For symmetric data, the mean and median are the same, and observations cluster around the mean, forming a bell-shaped distribution. If the distribution is not clearly skewed and the data is concentrated around a center of gravity, a rule of thumb that can be used to estimate variability is that if the data has a bell-shaped distribution, then approximately 68% of the observations are within one standard deviation of the expected value. approximately 95% of observations are no more than two standard deviations away from the mathematical expectation and 99.7% of observations are no more than three standard deviations away from the mathematical expectation.

Thus, the standard deviation, which is an estimate of the average variation around the expected value, helps to understand how observations are distributed and to identify outliers. The rule of thumb is that for bell-shaped distributions, only one value in twenty differs from the mathematical expectation by more than two standard deviations. Therefore, values ​​outside the interval µ ± 2σ, can be considered outliers. In addition, only three out of 1000 observations differ from the mathematical expectation by more than three standard deviations. Thus, values ​​outside the interval µ ± 3σ are almost always outliers. For distributions that are highly skewed or not bell-shaped, the Bienamay-Chebyshev rule of thumb can be applied.

More than a hundred years ago, mathematicians Bienamay and Chebyshev independently discovered the useful property of standard deviation. They found that for any data set, regardless of the shape of the distribution, the percentage of observations that lie within a distance of k standard deviations from mathematical expectation, not less (1 – 1/ k 2)*100%.

For example, if k= 2, the Bienname-Chebyshev rule states that at least (1 – (1/2) 2) x 100% = 75% of observations must lie in the interval µ ± 2σ. This rule is true for any k, exceeding one. The Bienamay-Chebyshev rule is very general and valid for distributions of any type. It specifies the minimum number of observations, the distance from which to the mathematical expectation does not exceed a specified value. However, if the distribution is bell-shaped, the rule of thumb more accurately estimates the concentration of data around the expected value.

Calculating Descriptive Statistics for a Frequency-Based Distribution

If the original data are not available, the frequency distribution becomes the only source of information. In such situations, it is possible to calculate approximate values ​​of quantitative indicators of the distribution, such as the arithmetic mean, standard deviation, and quartiles.

If sample data is represented as a frequency distribution, an approximation of the arithmetic mean can be calculated by assuming that all values ​​within each class are concentrated at the class midpoint:

Where - sample average, n- number of observations, or sample size, With- number of classes in the frequency distribution, m j- midpoint j th class, fj- frequency corresponding j-th class.

To calculate the standard deviation from a frequency distribution, it is also assumed that all values ​​within each class are concentrated at the class midpoint.

To understand how quartiles of a series are determined based on frequencies, consider the calculation of the lower quartile based on data for 2013 on the distribution of the Russian population by average per capita monetary income (Fig. 12).

Rice. 12. Share of the Russian population with average per capita cash income per month, rubles

To calculate the first quartile of an interval variation series, you can use the formula:

where Q1 is the value of the first quartile, xQ1 is the lower limit of the interval containing the first quartile (the interval is determined by the accumulated frequency that first exceeds 25%); i – interval value; Σf – sum of frequencies of the entire sample; probably always equal to 100%; SQ1–1 – accumulated frequency of the interval preceding the interval containing the lower quartile; fQ1 – frequency of the interval containing the lower quartile. The formula for the third quartile differs in that in all places you need to use Q3 instead of Q1, and substitute ¾ instead of ¼.

In our example (Fig. 12), the lower quartile is in the range 7000.1 – 10,000, the accumulated frequency of which is 26.4%. The lower limit of this interval is 7000 rubles, the value of the interval is 3000 rubles, the accumulated frequency of the interval preceding the interval containing the lower quartile is 13.4%, the frequency of the interval containing the lower quartile is 13.0%. Thus: Q1 = 7000 + 3000 * (¼ * 100 – 13.4) / 13 = 9677 rub.

Pitfalls Associated with Descriptive Statistics

In this post, we looked at how to describe a data set using various statistics that evaluate its mean, spread, and distribution. The next step is data analysis and interpretation. Until now, we have studied the objective properties of data, and now we move on to their subjective interpretation. The researcher faces two mistakes: an incorrectly chosen subject of analysis and an incorrect interpretation of the results.

The analysis of the returns of 15 very high-risk mutual funds is quite unbiased. He led to completely objective conclusions: all mutual funds have different returns, the spread of fund returns ranges from -6.1 to 18.5, and the average return is 6.08. The objectivity of data analysis is ensured by the correct choice of summary quantitative indicators of distribution. Several methods for estimating the mean and scatter of data were considered, and their advantages and disadvantages were indicated. How do you choose the right statistics to provide an objective and impartial analysis? If the data distribution is slightly skewed, should you choose the median rather than the mean? Which indicator more accurately characterizes the spread of data: standard deviation or range? Should we point out that the distribution is positively skewed?

On the other hand, data interpretation is a subjective process. Different people come to different conclusions when interpreting the same results. Everyone has their own point of view. Someone considers the total average annual returns of 15 funds with a very high level of risk to be good and is quite satisfied with the income received. Others may feel that these funds have too low returns. Thus, subjectivity should be compensated by honesty, neutrality and clarity of conclusions.

Ethical issues

Data analysis is inextricably linked to ethical issues. You should be critical of information disseminated by newspapers, radio, television and the Internet. Over time, you will learn to be skeptical not only of the results, but also of the goals, subject matter and objectivity of the research. The famous British politician Benjamin Disraeli said it best: “There are three kinds of lies: lies, damned lies and statistics.”

As noted in the note, ethical issues arise when choosing the results that should be presented in the report. You should publish both positive and negative results. In addition, when making a report or written report, the results must be presented honestly, neutrally and objectively. There is a distinction to be made between unsuccessful and dishonest presentations. To do this, it is necessary to determine what the speaker's intentions were. Sometimes the speaker omits important information out of ignorance, and sometimes it is deliberate (for example, if he uses the arithmetic mean to estimate the average of clearly skewed data to obtain the desired result). It is also dishonest to suppress results that do not correspond to the researcher's point of view.

Materials from the book Levin et al. Statistics for Managers are used. – M.: Williams, 2004. – p. 178–209

The QUARTILE function has been retained for compatibility with earlier versions of Excel.