Key properties and central theorems in probability and statistics - corroborated by simulations and animations

(1)

Selçuk J. Appl. Math. Selçuk Journal of Special Issue. pp. 3-19, 2011 Applied Mathematics

Key Properties and Central Theorems in Probability and Statistics -Corroborated by Simulations and Animations

Manfred Borovcnik

Alpen-Adria-University Klagenfurt, Department of Statistics, Klagenfurt e-mail: m anfred.b orovcnik@ uni-klu.ac.at

Abstract. Probability and the methods of statistical inference are highlighted by theoretical concepts, which are far from intuitive conceptions. A more direct approach beyond the mathematical exposition of the theorems is a basic re-quirement of educational statistics not only for students of studies diﬀerent from mathematics. Also, the focus within mathematics lies heavily on the derivation of the mathematical connections and their logical proof relative to axioms and optimizing criteria. For example, the central limit theorem is hardly open to a full proof even to mathematics students. And in the proof, the used concepts — the characteristics function eg. — precludes understanding of the most relevant parts. It is not only the convergence of the distribution of the standardized statistics under scrutiny to the standard normal distribution. The central limit theorem incorporates also the speed of convergence to the limiting distribution, which is highly influenced by the shape of the distribution of a single random variable. To clarify such issues enhances the central limit theorem and the resulting importance of the normal distribution (even for non-parametric sta-tistics). In the lecture, a spreadsheet will be used to implement the simulations and animations.

Key words: Modelling, simulation, animation in educational statistics, EX-CEL.

2000 Mathematics Subject Classification: 46N30, 68U20. 1. Laws of Large Numbers

1.1. Random Variation — Bernoulli Law of Large Numbers

There are a lot of misleading intuitions about the pattern of random numbers. Also, a sound comprehension of the variation of random numbers is missing quite often. It is important to establish intuitive thought right from the begin-ning of teaching probability: Patterns can be quite awkward when they stem from a random source. The fluctuation of frequencies may be quite high, when frequencies are based on a small sample but narrows with increasing sample size.

(2)

The weak law of large numbers says that for a Bernoulli sequence Xi, ie. random

variables Xi∼iidB (1, p) it holds limn→∞P (|Xi− p| ≥ ε) = 0 .

This is the ex-post justification for the interpretation of probability as relative frequency and the basis for estimating probabilities via relative frequencies from random samples. While a proof is relatively easy with the help of Chebyshev’s inequality, it may be supported by early simulation experiments. Such exper-iments would also enhance the size of random fluctuation, which is roughly ± 3% with a sample size of 1000 while it is of the size of ±10% with only 100 data. It also makes clear that such a decrease of the band of fluctuation is based on the assumption of randomness of the samples drawn. Such an experiment also paves the way right from the beginning to confidence intervals for an unknown probability from a random sample.

In what follows, we simulate random digits from 0 to 9 and show the frequency distribution after a sample of 50. The result of the simulation may be easily renewed in a spreadsheet (using a key button like F9 in EXCEL). To witness the fluctuation in the repeated simulation like an animation gives a clear view on the size of random fluctuation. In Fig. 1a two diﬀerent samples of 50 digits are shown. It is not unusual to have one digit completely missing in the statistics of the experiment. Then the whole experiment is renewed — now with a sample of 1000 random digit. It is clearly seen from the graphs in Fig. 1b that the fluctuation has narrowed to a small band — which gives a clear reason for samples of size 1000 as is used quite often by opinion polls.

Sometimes, in simulation studies, the diagrams displayed show the dependency of the relative frequency of one digit (event) from the sample size. Firstly, after a while no further fluctuation is visible. Secondly, it suggests the convergence

(3)

of the relative frequency to an obscure limit (the underlying probability). Here, the focus is on the purpose: to estimate the unknown probability (whether it is known or not) is more eﬀective from samples based on more data — if the samples are drawn randomly. And this leads at an early stage to confidence intervals instead of a point estimation of an unknown parameter (as the probability p here).

1.2. Random Fluctuation — Early Explorations towards Statistical Tests

We have seen that bigger samples give a more accurate account of the underlying probability. Suppose we draw from an “infinite” population a sample of 10. If — note the indirect approach of inferential statistics -there were only 30% with “opinion A” (p = 0.3), what is the risk to have 7 or more people drawn with opinion A? The risk may be seen small but visible from Fig. 2.a.

In Fig. 2b one may see the same risk for a population with 60% opinion A (p = 0.6) a substantial. This will lead to samples compatible with hypotheses and the selection of rejection regions in statistical tests later in the course. For 100 data, one may read from Fig. 3a that there is nearly no risk to have 70% or more with opinion A in the sample in case of p = 0.3; there remains, however, some visible risk to still have 70 or more with opinion A in the sample if p = 0.6 is taken as a basis.

(4)

Early investigations into the binomial distribution easily lead to questions of statistical tests. Again, a simulation study gives an insight how random samples and their size influence the judgement. A diﬀerence between population and sample percentage (in case of a binomial variable) of 10 percentage points is somehow the biggest one has to take into account for samples of 100. With a sample as small as 10, such a diﬀerence might be much bigger. The bigger samples get, the more they reflect the properties of the population.

That is why we call random samples as representative for the population. De-spite the fact that all possible samples have the same probability of being drawn, extreme samples get less probable with the sample size increasing. Demoscopic institutes or market research companies would alternatively choose quota sam-ples with the aim to fulfil the quota (the proportions) with respect to important variables (like gender, age, city or rural area etc). They are quite skilful in doing so. A random sample, however, if possible to be drawn, let one calculate finally the risk to have a non-representative sample. And a simple selection procedure may be improved by random sampling within specified subgroups, which might lead to a strong reduction is sampling fluctuation and a big improvement of the precision of the estimates from the sample for population characteristics.

(5)

2. Mean, Variance, And Functional Parameters of Distributions 2.1. Mean and Variance of Sums of Random Variables

For the sum of random variables X1, X2, ..., Xn it holds:

E (X1+ X2+ . . . + Xn) = E (X1) + E (X2) + . . . + E (Xn)

If the variables are independent, the additivity holds also for the variances, ie. V (X1+ X2+ . . . + Xn) = V (X1) + V (X2) + . . . + V (Xn)

For the mean ¯X = _n1PXi of a random sample Xi ∼iid X of X therefore it

follows, E¡X¯n

¢

= E (X) and for its variance it holds V ¡X¯n

¢

= V (X)_n .

The key property for the variances is the independence of the single r.v., which amounts, therefore, to the most important property of a random sample. How-ever, it is in serious doubts in many applications leading to the usual flaws of statistics. It is remarkable that the independence assumption is not needed for the additivity of expected value. This runs counter to intuitions.

The additivity of expectation is thus not a probabilistic feature; it is merely a property of involved sums or integrals (in case of continuous distributions). To the contrary, the additivity of variances is a genuine stochastic property and holds only for the case of independence of single summands. To illustrate this, the simplest special case of two spinners may be used.

If the following two wheels are spun separately (see Fig. 5a), the expected values of payment are easily calculated to be p and q. There is intuitively no doubt about the expected value of playing both: it is just the sum p + q! However,

(6)

if you put the two wheels one over the other and decide the payment in one spin as in Fig. 5b, the amount of payment gets dependent and people refute to accept that the expected value of payments remains the same.

The situation is as follows (as in the spreadsheet in Fig. 6), we will denote the overlap where both spinners lead to a payment of 1 by x; the joint distribution of G1 and G2 then is given by the following tables:

It is easy to calculate the distribution of the total payment G1+ G2 and its

expectation and variance in dependence of the chosen overlap x. From the result one sees that expected value is additive while variance is not. To corroborate the result one may play with the overlap and change it (in a spreadsheet, this is easily done by a slider control) — the additivity for expectation remains unaltered; there is a special choice of the overlap, which serves also additivity for the variances: for x = 0, 250.0, 600 = 0, 150; in this case it holds

(7)

ie., the two r.v. are independent. The aim of the sequel here is to give an “argument” much more compelling to many students than a mathematical proof. Sometimes students behave diﬀerently when they have to answer tests and when they feel free from such a demand, ie., they would leave the statistics class and still believe that the random variables have to be independent if the expectations are to be added.

2.2. Influence of Parameters to the Shape of Distributions

With software, one can show the graph of the density function of a random variable, and interactively, one can change the parameters, and investigate the eﬀect on expected value μ and standard deviations σ as well as on the intervals [μ − 2σ, μ + 2σ] and the shape of the distribution. This gives a much clearer view on the impact of the parameters as only some single diagrams could do. The exploration on the screen and the immediate view on the new shape, however, cannot be authentically reproduced in print here. Nevertheless, by some selected values for the parameters of the gamma distribution X ∼ γ (α, β), we show that the parameter α resembles more to a simple scale parameter (if “time” is run faster or slower) where the parameter β influences the shape (the skewness) of the distribution.

(8)

3. Central Limit Theorems

3.1. Approximations of Distributions

There are a lot of rules of thumb for the approximation between diﬀerent dis-tributions. Some are justified by the central limit theorem — as is the normal approximation to the binomial or the χ2_{. Others are clear by the definition as}

is the binomial approximation to the hypergeometric distribution. In a spread-sheet, such approximations may be empirically justified just be eye-inspection. Also it is possible to get a feeling when such an approximation is suﬃciently accurate. For the normal approximation to the binomial the rule of thumb is that the variance of the binomial fulfils n(p(1 − p) > 9.

(9)

3.2. Speed of Convergence in the Central Limit Theorem

The central limit theorem (CLT) gives the limiting distribution of the stan-dardized sum of a sequence of independent, identically distributed (iid) random variables; in its simplest version it says:

Xi∼iidX with finite variance var (X) < ∞, then it holds limn→∞P (Un≤ u) =

φ (u) with φ (u) the cumulative distribution function of the standard normal distribution and Un the standardized sum Un=

P Xi−E P Xi u varPXi .

The first proof was provided by de Moivre and Laplace for the special case of the single summands Xi ∼iid B(1, p), which justifies to approximate the binomial

distribution by the normal for larger n. The theorem usually is proven by the help of characteristic functions, which is highly sophisticated. The original proof by Moivre and Laplace is so clumsy that — despite its conceptual simplicity — it gives no real clue what is going on. It is, however, not only the convergence per se to the normal distribution, which is important. More importantly, from the perspective of applications, is the speed of convergence. This speed is highly influenced by the shape of the distribution of the single summands.

A scenario of simulating 1000 diﬀerent samples of size n = 20 and then n = 40 from two diﬀerent distributions (see Fig. 8) may be seen from Fig. 9 (the scenario is taken from Borovcnik and Kapadia 2011). The distributions of the single summands are hypothesized to be

1. either equally distributed on the numbers 1, 2, . . . , 7 (Fig. 9a);

2. or consist of a distribution, which falls apart in two components (see Fig. 9b).

The eﬀect of the input distribution on the “normalization” of the mean of 20 or 40 data is shown by simulation, see Fig. 10. The equi-distribution for the single summands gives a much better fit of the normal distribution. In the scenario, one may wish to change the input distribution. The options, provided in a spreadsheet by the authors are to increase the numbers for the equi-distribution (with the eﬀect of much faster normalization) and to shift the right component

(10)

of the two-component distribution further away from the main part (with the eﬀect of slowing down the speed of normalization).

Such investigations give a good feeling about the requirements on the distrib-ution of single summands so that the sum or the mean may be approximated by a normal distribution. The eﬀect of the single summands on the speed of normalization may be studied interactively.

4. Key Properties of Inferential Statistical Procedures

4.1. Coverage Property of Confidence Intervals

There is some confusion about the correct interpretation of classical confidence intervals. Based on a sample X1, X2, . . . , Xn ∼iddX with distribution function

FX(·\θ) a confidence interval for coverage probability γ is a two-dimensional

function of the sample, which irrespectively of the value of θ fulfils P (L (X1, X2, . . . , Xn) ≤ θ ≤ L (X1, X2, . . . , Xn) \θ) = γ

This coverage is not a property of one single confidence interval but of the whole process of repeatedly drawing samples. For an illustration of this coverage on the long run we simulate from the parent distribution X ∼ N¡0, σ2_{= 16}¢_samples

of size 5 and 20 and calculate the usual confidence interval for the parameter μ as [¯_{x − 4, ¯}_{x + 4] with 4 = z}0.975.σ/√n for known standard deviation σ and

γ = 0.95. In Fig. 11, the confidence intervals of repeated samples are signified by vertical bars of the length of the interval. If these bars do not cross the zero line, the interval does not contain the parameter μ ; this is marked by a dark dot. One may see from Fig. 11a that 13 intervals do not cover the “unknown” parameter μ, which amounts to a coverage rate of 93.5%. The coverage for

(11)

samples of 20 in Fig. 11b is — by chance — a bit higher with 96.5%. However, the structure of the result from the experiment is that the intervals are much shorter, which gives the very use of confidence intervals: from larger samples, the unknown parameter may be estimated more precisely.

4.2. Properties of Estimation of Unknown Parameters

Parameters estimations should have some basic properties as unbiased, consis-tent and of minimal variance. Usually estimators derived by the method of maximum likelihood are asymptotically unbiased, their variance decreases to 0 with increasing sample size (consistent), and under some regularity conditions on the family of distribution of the variable X under scrutiny, they have minimal variance and — finally — they are asymptotically normally distributed.

(12)

4.3. Bayesian Estimation of an Unknown Parameter

Usually, the Bayesian concepts are a bit excluded from the basic curriculum in statistics. This leads to some inconsistencies in applications as well as in the interpretation of the probabilities involved. For example, in the Bayesian for-mula, usually the prior probabilities are not open to a frequentist interpretation. See here for the discussion of Witmer et al (1997) or the approaches towards statistics from a Bayesian viewpoint by Berry (1997) or Albert (1997). We will only illustrate the approach by one scenario, in which we will use repeated samples for updating the estimate of the unknown parameter. In the long run, Bayesians come to similar conclusions as classical statisticians. However, their prior knowledge leads to substantially diﬀerent estimates as long as samples are small. We will focus on the scenario of state lotteries.

In a lottery of drawing n numbers out of N , the total number N of balls is sup-posed to be unknown. The reader might think to come to a foreign country and does not know the lottery system there. We will model “complete” ignorance on this number N by a uniform distribution on the interval [31, 80]. By Bayes’ formula a posterior distribution may be derived on the basis of a week’s drawn numbers. It depends only on the week’s maximum. The following exploration (from Vancsó 2009) shows the impact of the week’s maximum number by three diﬀerent results, Fig. 13a.

(13)

The next exploration accumulates the knowledge of several weeks on the drawn numbers. Each posterior distribution on the unknown N is used as a prior distribution for the next week to calculate a new posterior distribution on N .

4.4. Some Key Properties of Statistical Tests

In what follows, we will illustrate a statistical test within the statistical model of a binomial distribution. A significance test is investigated for the null hypothesis of p = 0.5 against the alternative of p 6= 0.5 for the size α (eg. α = 0.05); this size implies also a β error (type II error), which is dependent on the actual value of p. The following exploration is embedded in a context of usual bugs that do not pay notice to a special source of odour when “crossing a labyrinth” and therefore have a probability of p = 0.5 to arrive at this exit where the source of

(14)

odour is (randomly) placed and “special” bugs that are attracted (p > 0.5) or distracted (p < 0.5) by the odour, see Fig. 14a.

In an animation, the various binomial distributions according to the hypotheses are depicted and the α and β errors due to the specifications are displayed. The special choice of α = 0.10 will lead to the situation in Fig. 15a and b — the rejection region visualized by black bars. One may see that with 20 data, no sound decision may be made as the β error is too big. With 100 data, the situation has improved by much.

(15)

4.5. The Resampling Alternative to Statistical Inference

Simulation is a good starter to replace the whole approach towards statistical inference based on the assumption of parametric models. From a first sample an estimation ˆF of the distribution function F of the random variable X under scrutiny is derived. As the “true” distribution function F is not known, one may wish to sample from the estimate ˆF instead. If the first sample is drawn without a bias, then the error should not be too big. That is the key to the approach of resampling. If we aim at a confidence interval for an unknown expected value of a random variable X and we base our method on the result of a random sample X1, X2, ..., Xn, we proceed as follows:

We take a “sample” with replacement from the first data — some suggest using the same sample size, others prefer to have a smaller, eg., sample only half of it. The process of resampling is continued. With each “re-sample” an estimate of the unknown expected value is associated (the mean of the resampled data). By the process of resampling, we establish a data base for such estimates of the unknown expected value. From this data base, we might take the 2.5% and 97.5% quantiles to form a resampling interval for the expected value: 95% of our resampled estimated vary within these bounds. This is a resampling interval replacing the classical confidence interval, which is based on the assumption of a parametric model.

The following exploration illustrates the procedure. Such resampling intervals may be derived for any parameter of a random variable; it may also be derived for the correlation coefficient of two variables. Or, it may be derived for the difference of an outcome in a treatment—control group comparison as is done in Stephenson et al (2010) for the difference of two proportions with the software R). For further details of the resampling approach, see, eg., Christie (2004).

(16)

5. Conclusions

In using a modelling approach with probability, the subject may be taught more realistically and motivating; see eg., Borovcnik and Kapadia (2011). Hereby, the use of suitable software becomes crucial. Not all problems in modelling may be solved analytically; furthermore, a graphical support of the model and the analysis is often helpful. With the simulation approach towards probability and statistics, sophisticated mathematical concepts may be put to a more concrete level, which might enhance the comprehension greatly. Sometimes, simulation is used in the sense of illustrating that relative frequencies converge to an — obscure — limit value, namely the underlying probability. Here, such an investigation is replaced by illustrating the variability at one sample size (by repeating the sample several times) and then switching to a larger sample size. This will show that larger samples are more reliable to draw conclusions from — if they are drawn randomly.

Dunsworth (2007), Sklar and Zwick (2009), or Mayer (2002) analyze the mer-its of multi-media teaching. Simons (1984) uses analogies to support teaching. Analogies give a context to the concepts, which gives strong associations for the “correct” properties of these concepts. In a way, they are helping to “manip-ulate” comprehension. We used a spreadsheet like EXCEL for our purpose to animate properties of the concepts, or to simulate to show their key features. Others as Eudey et al (2010) or Stephenson et al (2010) prefer R. The grade of interactivity is an argument in favour of a spreadsheet while the continuation in a special study of statistics would let R appear as more promising. For delib-erations to include also Bayesian methods into statistical education the reader

(17)

may wish to consult Witmer et al (1997), the famous discussion in the American Statistician still is worth to read again.

It is not only more motivating to integrate animations, analogies, and simu-lations but sometimes it enhances the concepts much more than a one-sided mathematical approach may do. However, sometimes using multi-media might also lead astray. Or, they may give an over-confidence, as eg., it is the case if patterns in a simulation study seem convincing but yet lack the basis for gen-eralization. On the whole, as more students are confronted with statistics in their studies, and they lack a stronger background in mathematics, they might be fostered by integration of approaches as illustrated here.

References

1. Albert J., (1997):Teaching Bayes’ rule: a data-oriented approach. The American Statistician 51(3), 247—253.

2. Berry D. A. (1997): Teaching elementary Bayesian statistics with real applications in science. The American Statistician 51(3), 241—246.

3. Borovcnik M., Kapadia R. (2011): Modelling in Probability and Statistics — Key ideas and innovative examples. In Maaß J., O’Donoghue J. (Eds.), Real-World Prob-lems for Secondary School Students — Case Studies. Rotterdam: Sense Publishers. 4. Christie D. (2004): Resampling with Excel. Teaching Statistics 26 (1), 9—14. 5. Dunsworth Q., Atkinson R. K. (2007): Fostering multimedia learning of science: Exploring the role of an animated agent’s image. Computers and Education 49(3), 677—690.

6. Duﬀy S. (2010): Random Numbers Demonstrate the Frequency of Type I Errors: Three Spreadsheets for Class Instruction. Journal of Statistics Education 18 (2), http://www.amstat.org/publications/jse/v18n2/duﬀy.pdf

7. Eudey T. L., Kerr J. D., Trumbo, B. E. (2010): Using R to Simulate Permuta-tion DistribuPermuta-tions for Some Elementary Experimental Designs. Journal of Statistics Education 18 (1), http://www.amstat.org/publications/jse/v18n1/eudey.pdf

8. Mayer R.E., Moreno R. (2002): Animation as an aid to multimedia learning. Educational Psychology Review 14(1), 87—99, 2002.

9. Simons, P. R. (1984): Instructing with analogies. Journal of Educational Psychol-ogy 76(3), 513—527, 1984.

10. Sklar J. C., Zwick R. (2009): Multimedia Presentations in Educational Measure-ment and Statistics: Design Considerations and Instructional Approaches. Journal of Statistics Education Volume 17 (3), www.amstat.org/publications/jse/v17n3/sklar.html 11. Stephenson W. R., Froelich A. G., Duckworth W. M. (2010): Using Resampling to Compare Two Proportions. Teaching Statistics 32, number 3, 66—71.

12. Vancsó Ö. (2009): Parallel Discussion of Classical and Bayesian Ways as an Introduction to Statistical Inference. International Electronic Journal of Mathematics Education, 4(3), 291—322.

13. Witmer J., Short T. H., Lindley D.V. Freedman D. A., Scheaﬀer R. L. (1997): Teacher’s corner. Discussion of papers by Berry, Albert, and Moore, with replies from the authors. The American Statistician 51(3), 262—274.