Theoretical Probability Distributions

(1)

Theoretical Probability Distributions

PhD Özgür Tosun

(2)

THEORETICAL PROBABILITY DISTRIBUTIONS

probability is a measure of chance

probability distributions help us to study the probabilities associated with outcomes of the variable under study

probability theory is the foundation for

statistical inference. A probability

distribution is a device for indicating the

values that a random variable may have

(3)

THEORETICAL PROBABILITY DISTRIBUTIONS

Several theoretical probability distributions are important in biostatistics:

I) Binomial II) Poisson

III)Normal

Discrete probability

distributions: Variable takes only integer values .

Continuous probability distribution: Variable has values measured on a

continuous scale.

(4)

THE BINOMIAL DISTRIBUTION:

• Variable has only binary/dichotomous outcomes

(male – female; diseased – not diseased;

positive – negative) denoted A and B.

• The probability of A is denoted by p.

P(A) = p and

P(B)= 1-p

(5)

THE BINOMIAL DISTRIBUTION:

• When an experiment is repeated n times, p remains constant (outcome is independent from one trial to another)

Such a variable is said to follow a BINOMIAL DISTRIBUTION.

Daniel Bernoulli

(6)

The question is:

What is the probability that outcome A occurs x times?

Or

What proportion of n outcomes will be A?

The probability of x outcomes in a group of size n, if each outcome has probability p and is independent from all outcomes is given by Binomial Probability Function:

x - n

x ( 1 p)

x p x) n

P(A  



 



 



(7)

Example

For families with 5 children each, what is the probability that

i) There will be one male child?

Among families with 5 children each, 0.16 have one male child.

0.16 )

50 . 0 1

( 1 0.50

1) 5

P(A 

¹



⁵ ¹





 



 



^

(8)

ii) There will be at least one male children?

0.97 0.03 -

1 0.5) -

(1 0 0.5

- 5 1

0) P(A

- 1

) 5 . 0 1

( A 0.5

5 5) P(A

4) P(A

3) P(A

2) P(A

1) P(A

5 0

5 1 A

A 5 A



 

 



 



 



 



 



























(9)

Using the probabilities associated with possible outcomes, we can draw a probability distribution for the event under study:

NO. OF MALE CHILDREN

5,00 4,00

3,00 2,00

1,00 ,00

PROBABILITY

,4

,3

,2

,1

0,0

(10)

Example:

Among men with localized prostate tumor and a PSA<10, the 5-year survival is known to be 0.8. We can use Binimial Distribution to calculate the probability that any particular number (A), out of n, will survive 5 years. For example for a new series of 6 such men:

Non will survive 5 years : P(A=0)=0,000064 Only 1 will survive 5 years : P(A=1)=0,0015 2 will survive 5 years : P(A=2)=0,015

3 will survive 5 years : P(A=3)=0,082

4 will sıurvive 5 years : P(A=4)=0,246

5 will survive 5 years : P(A=5)=0,393

All will survive 5 years : P(A=6)=0,262

(11)

Binomial Distribution for n=6 and p=0.8

NO. SURVIVING 5 YEARS

6 5

4 3

2 1

0

PROBABILITY

,5

,4

,3

,2

,1

0,0

(12)

(13)

THE POISSON DISTRIBUTION:

Like the Binomial, Poisson distribution is a discrete distribution applicable when the outcome is the

“number of times an event occurs”.

Instead of the probability of an outcome, if average number of occurrence of the event is given, associated probabilities can be calculated by using the Poisson Distribution Function which is defined as:

Simeon D. Poisson (1781- 1840)

) !

( A

A e X

P

A 

 ^



lambda

 

(14)

Example.

If the average number of hospitalizations for a group of patients is calculated as 3.22, the probability that a patient in the group has zero hospitalizations is

04 .

! 0 0

e 22 .

) 3 0 A

(

P  

⁰ ^³^.²²



(15)

The probability that a patient has exactly one hospitalization is

The probability that a patient will be

hospitalized more than 3 times, since the upper limit is unknown, is calculated as

P(A>3)=1-P(A3)

129 .

!1 0 e 22 .

) 3 1 A

(

P   ¹ ^ ³ ^. ²² 

(16)

EXAMPLE

(17)

(18)

(19)

Normal Distribution

Karl F. Gauss (1777-1855) Abraham de Moivre (1667-

1754)

(20)

NORMAL DISTRIBUTION

Normal (Gaussian) distribution is the most famous probability distribution of continuous variables.

The two parameters of the normal distribution are the mean (μ) and the standard deviation (σ).

The graph has a familiar bell-shaped curve.

The function of normal distribution curve is as follows:







 x

_i

2

2 1

2 ) 1

( ^ ^

 



  

 ^







x

i

e x

f

(21)

The normal distribution is completely defined by the mean and standard deviation of a set of quantitative data:

 The mean determines the location of the curve on the x axis of a graph

 The standard deviation determines the height of the curve on the y axis

There are an infinite number of normal

distributions- one for every possible combination

of a mean and standard deviation

(22)

Pr(X) on the y-axis refers to either frequency or probability.

Examples of Normal Distributions

(23)

Examples of Normal Distributions

(24)

Frequency

Mean

0 5 10 15 20 25

Frequency

55 60 65 70 75 80 85 90

Heart Rate (BPM) //

Many (but not all) continuous variables are approximately normally distributed. Generally, as sample size increases, the shape of a frequency

distribution becomes more normally distributed.

(25)

When data are normally distributed, the mode, median, and mean are identical and are located at the center of the

distribution.

M od e, M ed ia n, M ea n

Frequency of

occurrence

(26)

Quantitative variables may also have a skewed distribution:

When distributions are skewed, they have

more extreme values in one direction than the other, resulting in a long tail on one side of the distribution.

The direction of the tail determines whether a distribution is positively or negatively skewed.

A positively skewed distribution has a long tail on the right, or positive side of the curve.

A negatively skewed distribution has the tail

on the left, or negative side of the curve.

(27)

(28)

For a normally distributed variable:

~68.3% of the observations lie between the mean and  1 standard deviation

~95.4% lie between the mean and  2 standard deviations

~99.7% lie between the mean and  3 standard deviations

   







68.3 %

95.4 % 99.7 %

Mode, Median, Mean

(29)

68.26%

95.44%

99.74%



6826 .

0 )

(     x      P





 _ _ _

9544 .

0 )

2 2

(     x      P





 2  2  

9974 .

0 )

3 3

(     x      P





 3  3  

(30)

(31)

Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR F 55 M 66 F 70 M 73 F 77 M 79 F 82 M 57 F 67 F 70 M 73 F 77 M 79 M 82 M 59 F 67 M 70 M 73 F 77 M 79 F 83 F 61 F 68 M 70 M 73 M 77 F 80 M 83 M 61 F 68 F 71 F 74 M 77 F 80 M 83 M 62 F 68 F 71 F 74 F 78 M 80 F 84 M 62 M 68 M 71 F 74 F 78 F 81 F 84 F 63 F 69 M 71 M 74 F 78 F 81 M 85 F 64 M 69 F 72 F 75 F 78 F 81 F 86 M 64 M 69 M 72 F 75 M 78 M 81 F 86 M 64 M 69 F 73 M 75 M 78 F 82 M 89 M 66 F 70 M 73 M 76 M 79 F 82 M 89

0 5 10 15 20 25

Frequency

55 60 65 70 75 80 85 90

Heart Rate (BPM) //

For the heart rate data for 84 adults:

Mean HR = 74.0 bpm SD = 7.5 bpm

Mean  1SD = 74.0  7.5

= 66.5-81.5 bpm

Mean  2SD = 74.0  15.0

= 59.0-89.0 bpm

Mean  3SD = 74.0  22.5

= 51.5-96.5 bpm

(32)

HR Data:

• 57/84 (67.9%) subjects are between mean ± 1SD

• 82/84 (97.6%) are between mean ± 2SD

• 84/84 (100%) are between mean ± 3SD

45 50 55 60 65 70 75 80 85 90 95 100

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 Subject number

Heart rate (bpm)

Mean

+3 SD +2 SD + 1SD

-1 SD -2 SD -3 SD

(33)

The “normal” range in medical

measurements is the central 95% of the values for a reference population, and is usually determined from large samples representative of the population.

The central 95% is approximately the mean  2 sd*

Some examples of established reference ranges are:

Serum “Normal” range

fasting glucose 70-110 mg/dL

sodium 135-146 mEq/L

triglycerides 35-160 mg/dL

Note: The value is actually 1.96 sd but for convenience this

is usually rounded to 2 sd.

(34)

The Standard Normal Distribution

A normal distribution with a mean of 0, and standard deviation of 1

The distribution is also called the z distribution

Any normal distribution can be converted to the standard normal distribution using the z

transformation.

Each value in a distribution is converted to the number of standard deviations the value is

from the mean.

The transformed value is called a z score.

(35)





 x  z

Once the data are transformed to z-scores, the standard normal distribution can be used to determine areas under the curve for any normal

distribution.

Formula for the z

transformation

(36)

Example of a z- transformation

If the population mean heart rate is 74 bpm, and the standard deviation is 7.5, the z score for an

individual with HR = 80 bpm is:

8 . 6 0

80 z x 

 





 





 

The individual’s HR of 80 bpm is

0.8 standard deviations above the mean.

(37)

The z-value can be looked up in a table for the standard normal distribution to

determine the lower and upper areas

defined by a z-score of 0.8 (the areas are

the lower 78.8% and upper 21.2%)

(38)

Using Table

(…) .0082 is

the area under N(0,1) left of z =

-2.40

.0080 is the area under N(0,1) left of

z = -2.41

0.0069 is the area under N(0,1) left of

z = -2.46

(39)

Because all Normal distributions share the same properties, we can standardize our data to transform any Normal curve N (  ,  ) into the standard Normal curve N (0,1).

The standard Normal distribution

For each x we calculate a new value, z (called a z-score).

N(0,1)

=>

z



x

N(64.5, 2.5)

Standardized height (no units)

(40)

The total area under the normal distribution curve is 1:

90% of the area is between ± 1.645 sd 95% of the area is between ± 1.960 sd 99% of the area is between ± 2.575 sd

0 -1.645

+1.960 +2.575 +1.645

-1.960 -2.575

Area = 99%

Area = 95%

Area = 90%

(41)

The Normal Distribution &

Confidence Intervals

90% of the area is between ± 1.645 sd

95% of the area is between ± 1.960 sd

99% of the area is between ± 2.575 sd

These are the most commonly used areas for defining

Confidence Intervals

which are used in inferential statistics to estimate population values from sample data

If a certain interval is a 95% confidence interval, then we can say

that if we repeated the procedure of drawing random samples and

computing confidence intervals over and over again, 95% of those

confidence intervals include the true value from the population.

(42)

Birth weight (x

_i

) Z

_i

=

3200 -0.167

3450 0.25

2980 -0.533

4100 1.333

2900 -0.667

3500 0.333

: :

3400 0.167

=3300 ; =600 =0 ; =1.0

600  3300

x

i

(43)

If it is known that the birth weights of

infants are normally distributed with a mean of 3300gr and a standard deviation of 600gr, what is the probability that a randomly

selected infant will weigh less than 3000gr?

More than 3000gr? Ans: 0.19+0.50=0.69

z _i  x ⁱ  

  3000  3300

600  0.5

P(X _i  3000)  P(Z _i  0.5)  0.31

(44)

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

Area between 0

and z

(45)

If the mean and the standard deviation of the BMI of adult women are 24 and 6 units respectively, what proportion of women will have BMI>30 (what proportion of women will be clssified as obese)?

16% of the adult women will be classified as obese.

z _i  x ⁱ  

 ^{ 30  24} 6 1.0

P(X _i  30)  P(Z _i 1.0)  0.16

(46)

(47)

) 1 (

4 )

22 ( 18

) 18

(    





 P z P z

x P

1587 ,

0 3413 ,

0 5

, 0







(48)

Example

• In a normal curve with mean = 30, s = 5, what is the proportion of scores below 27?

27 -4 -3 -2 -1 0 1 2 3 4

Smaller portion of a Z of 0.6 is 0.2743

Mean to Z equals 0.2257 and 0.5 - 0.2257 = 0.2743

Portion  27%

27 27 30 5 0.6

Z    

(49)

Example

• In a normal curve with mean = 30, s = 5, what is the proportion of scores fall between 26 and 35?

26 -4 -3 -2 -1 0 1 2 3 4

Mean to a Z of 0.8 is 0.2881 Mean to a Z of 1 is 0.3413 0.2881 + 0.3413 = 0.6294 Portion = 62.94% or  63%

.3413 .2881

35

35 30 5 1

Z 

 

26

26 30 5 0.8

Z 

  

(50)

Example

• The Stanford-Binet IQ test has a mean of 100 and a SD of 15, how many people (out of 1000 ) have IQs between 120 and 140?

120 -4 -3 -2 -1 0 1 2 3 4

Mean to a Z of 2.66 is 0.4961

Mean to a Z of 1.33 is 0.4082 0.4961 - 0.4082 = 0.0879 Portion = 8.79% or  9%

0.0879 * 1000 = 87.9 or  88 people

140 .4082

.4961

120

120 100 15 1.33

Z 

 

140

140 100 15 2.66

Z 

 

(51)

When the numbers are on the same side of the mean: subtract

- =