• Sonuç bulunamadı

An Introduction to Graphical Models and Monte Carlo methods

N/A
N/A
Protected

Academic year: 2021

Share "An Introduction to Graphical Models and Monte Carlo methods"

Copied!
241
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

An Introduction to

Graphical Models and Monte Carlo methods

A. Taylan Cemgil

Signal Processing and Communications Lab.

Birkbeck School of Economics, Mathematics and Statistics

June 19, 2007

(2)

Goals of this Tutorial

To Provide ...

• a basic understanding of underlying principles of probabilistic modeling and inference

• an introduction to Graphical models and associated concepts

• a succinct overview of (perhaps interesting) applications from engineering and computer science

– Statistical Signal Processing, Pattern Recognition – Machine Learning, Artificial Intelligence

• an initial orientation in the broad literature of Monte Carlo methods

(3)

First Part, Basic Concepts and MCMC

• Introduction

– Bayes’ Theorem,

– Trivial toy example to clarify notation

• Graphical Models

– Bayesian Networks

– Undirected Graphical models, Markov Random Fields – Factor graphs

• Maximum Likelihood and Bayesian Learning

(4)

– (classical AI) Medical Expert systems, (Statistics) Variable selection, (Engineering-CS) Computer vision,

– Time Series - terminology and applications – Audio processing

– Non Bayesian applications

• Probability Models

– Exponential family, Conjugacy

– Motivation for Approximate Inference

• Markov Chain Monte Carlo – A Gaussian toy example – The Gibbs sampler

– Sketch of Markov Chain theory

(5)

– Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler

– Optimisation versus Integration: Simulated annealing and

iterative improvement

(6)

Second Part, Time Series Models and SMC

• Latent State-Space Models

– Hidden Markov Models (HMM), – Kalman Filter Models

– Switching State Space models – Changepoint models

• Inference in HMM

– Forward Backward Algorithm – Viterbi

– Exact inference in Graphical models by message passing

• Sequential Monte Carlo

(7)

– Particle Filtering

• Final Remarks and Bibliography

(8)

Bayes’ Theorem

Thomas Bayes (1702-1761)

“What you know about a parameter λ after the data D arrive is what you knew before about λ and what the data D told you 1 .”

p(λ|D) = p(D|λ)p(λ) p(D)

Posterior = Likelihood × Prior

Evidence

(9)

An application of Bayes’ Theorem: “Source Separation”

Given two fair dice with outcomes λ and y ,

D = λ + y

What is λ when D = 9 ?

(10)

An application of Bayes’ Theorem: “Source Separation”

D = λ + y = 9

D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 2 3 4 5 6 7

λ = 2 3 4 5 6 7 8

λ = 3 4 5 6 7 8 9

λ = 4 5 6 7 8 9 10

λ = 5 6 7 8 9 10 11

λ = 6 7 8 9 10 11 12

Bayes theorem “upgrades” p(λ) into p(λ|D).

But you have to provide an observation model: p(D|λ)

(11)

“Burocratical” derivation

Formally we write

p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(D|λ, y) = δ(D − (λ + y))

Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) =



1 x = 0 0 x 6= 0

p(λ, y|D) = 1

p(D) × p(D|λ, y) × p(y)p(λ) Posterior = 1

Evidence × Likelihood × Prior

X

(12)

Prior

p(y)p(λ)

p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 λ = 1 1/36 1/36 1/36 1/36 1/36 1/36 λ = 2 1/36 1/36 1/36 1/36 1/36 1/36 λ = 3 1/36 1/36 1/36 1/36 1/36 1/36 λ = 4 1/36 1/36 1/36 1/36 1/36 1/36 λ = 5 1/36 1/36 1/36 1/36 1/36 1/36 λ = 6 1/36 1/36 1/36 1/36 1/36 1/36

• A table with indicies λ and y

• Each cell denotes the probability p(λ, y)

(13)

Likelihood

p(D = 9|λ, y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1

λ = 4 0 0 0 0 1 0

λ = 5 0 0 0 1 0 0

λ = 6 0 0 1 0 0 0

• A table with indicies λ and y

(14)

Likelihood × Prior

φ D (λ, y) = p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /36

λ = 4 0 0 0 0 1 /36 0

λ = 5 0 0 0 1/36 0 0

λ = 6 0 0 1/36 0 0 0

(15)

Evidence

p(D = 9) = X

λ,y

p(D = 9|λ, y)p(λ)p(y)

= 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0

= 1/9

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /36

λ = 4 0 0 0 0 1/36 0

λ = 5 0 0 0 1/36 0 0

(16)

Posterior

p(λ, y|D = 9) = 1

p(D) p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /4

λ = 4 0 0 0 0 1/4 0

λ = 5 0 0 0 1/4 0 0

λ = 6 0 0 1 /4 0 0 0

1/4 = (1/36)/(1/9)

(17)

Marginal Posterior

p(λ|D) = X

y

1

p(D) p(D|λ, y)p(λ)p(y)

p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1 /4 0 0 0 0 0 1/4

λ = 4 1/4 0 0 0 0 1/4 0

λ = 5 1/4 0 0 0 1/4 0 0

λ = 6 1 /4 0 0 1/4 0 0 0

(18)

The “proportional to” notation

p(λ|D = 9) ∝ p(λ, D = 9) = X

y

p(D = 9|λ, y)p(λ)p(y)

p(λ, D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1/36 0 0 0 0 0 1/36

λ = 4 1/36 0 0 0 0 1/36 0

λ = 5 1/36 0 0 0 1 /36 0 0

λ = 6 1/36 0 0 1/36 0 0 0

(19)

Exercise

p(x 1 , x 2 ) x 2 = 1 x 2 = 2 x 1 = 1 0.3 0.3 x 1 = 2 0.1 0.3 1. Find the following quantities

• Marginals: p(x 1 ), p(x 2 )

• Conditionals: p(x 1 |x 2 ), p(x 2 |x 1 )

• Posterior: p(x 1 , x 2 = 2), p(x 1 |x 2 = 2)

• Evidence: p(x 2 = 2)

• p({})

• Max: p(x 1 ) = max x

1

p(x 1 |x 2 = 1)

• Mode: x 1 = arg max x

1

p(x 1 |x 2 = 1)

• Max-marginal: max x

1

p(x 1 , x 2 )

2. Are x 1 and x 2 independent ? (i.e., Is p(x 1 , x 2 ) = p(x 1 )p(x 2 ) ?)

(20)

Answers

p(x 1 , x 2 ) x 2 = 1 x 2 = 2 x 1 = 1 0.3 0.3 x 1 = 2 0.1 0.3

• Marginals:

p(x 1 )

x 1 = 1 0.6 x 1 = 2 0.4

p(x 2 ) x 2 = 1 x 2 = 2 0.4 0.6

• Conditionals:

p(x 1 |x 2 ) x 2 = 1 x 2 = 2 x 1 = 1 0.75 0.5 x 1 = 2 0.25 0.5

p(x 2 |x 1 ) x 2 = 1 x 2 = 2

x 1 = 1 0.5 0.5

x 1 = 2 0.25 0.75

(21)

Answers

p(x 1 , x 2 ) x 2 = 1 x 2 = 2 x 1 = 1 0.3 0.3 x 1 = 2 0.1 0.3

• Posterior:

p(x 1 , x 2 = 2) x 2 = 2 x 1 = 1 0.3 x 1 = 2 0.3

p(x 1 |x 2 = 2) x 2 = 2 x 1 = 1 0.5 x 1 = 2 0.5

• Evidence:

p(x 2 = 2) = X

x

1

p(x 1 , x 2 = 2) = 0.6

• Normalisation constant:

X X

(22)

Answers

p(x 1 , x 2 ) x 2 = 1 x 2 = 2 x 1 = 1 0.3 0.3 x 1 = 2 0.1 0.3

• Max: (get the value)

max x

1

p(x 1 |x 2 = 1) = 0.75

• Mode: (get the index)

argmax

x

1

p(x 1 |x 2 = 1) = 1

• Max-marginal: (get the “skyline”) max x

1

p(x 1 , x 2 )

max x

1

p(x 1 , x 2 ) x 2 = 1 x 2 = 2

0.3 0.3

(23)

Another application of Bayes’ Theorem: “Model Selection”

Given an unknown number of fair dice with outcomes λ 1 , λ 2 , . . . , λ n ,

D =

X n i=1

λ i

How many dice are there when D = 9 ?

Assume that any number n is equally likely

(24)

Another application of Bayes’ Theorem: “Model Selection”

Given all n are equally likely (i.e., p(n) is flat), we calculate (formally) p(n|D = 9) = p(D = 9|n)p(n)

p(D) ∝ p(D = 9|n)

p(D|n = 1) = X

λ

1

p(D|λ 1 )p(λ 1 ) p(D|n = 2) = X

λ

1

X

λ

2

p(D|λ 1 , λ 2 )p(λ 1 )p(λ 2 ) . . .

p(D|n = n ) = X

λ

1

,...,λ

n′

p(D|λ 1 , . . . , λ n

)

n

Y

i=1

p(λ i )

(25)

p(D|n) = P

λ p(D|λ, n)p(λ|n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

0.2

p(D|n=1)

0 0.2

p(D|n=2)

0 0.2

p(D|n=3)

0 0.2

p(D|n=4)

0 0.2

p(D|n=5)

(26)

Another application of Bayes’ Theorem: “Model Selection”

1 2 3 4 5 6 7 8 9

0 0.1 0.2 0.3 0.4 0.5

n = Number of Dice

p(n|D = 9)

• Complex models are more flexible but they spread their probability mass

• Bayesian inference inherently prefers “simpler models” – Occam’s razor

• Computational burden: We need to sum over all parameters λ

(27)

Probabilistic Inference

A huge spectrum of applications – all boil down to computation of

expectations of functions under probability distributions: Integration hf (x)i =

Z

X

dxp(x)f (x) hf (x)i = X

x∈X

p(x)f (x)

modes of functions under probability distributions: Optimization x = argmax

x∈X

p(x)f (x)

• any “mix” of the above: e.g.,

x = argmax p(x) = argmax Z

dzp(z)p(x|z)

(28)

Graphical Models

“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more

advanced problems, and in effect increases the mental power of the race.” A.N. Whitehead

(29)

Graphical Models

• formal languages for specification of probability distributions and associated inference algorithms

• historically, introduced in probabilistic expert systems (Pearl 1988) as a visual guide for representing expert knowledge

• today, a standard tool in machine learning, statistics and signal

processing

(30)

Graphical Models

• provide graph based algorithms for derivations and computation

• pedagogical insight/motivation for model/algorithm construction

– Statistics:

“Kalman filter models and hidden Markov models (HMM) are equivalent upto parametrisation”

– Signal processing:

“Fast Fourier transform is an instance of sum-product algorithm on a factor graph”

– Computer Science:

“Backtracking in Prolog is equivalent to inference in Bayesian networks with deterministic tables”

• Automated tools for code generation start to emerge, making the

design/implement/test cycle shorter

(31)

Important types of Graphical Models

• Useful for Model Construction

– Directed Acyclic Graphs (DAG), Bayesian Networks – Undirected Graphs, Markov Networks, Random Fields – Influence diagrams

– ...

• Useful for Inference – Factor Graphs

– Junction/Clique graphs – Region graphs

– ...

(32)

Directed Acyclic Graphical (DAG) Models

Factor Graphs and

(33)

Directed Graphical models

• Each random variable is associated with a node in the graph,

• We draw an arrow from A → B if p(B| . . . , A, . . . ) ( A ∈ parent (B) ),

The edges tell us qualitatively about the factorization of the joint probability

• For N random variables x 1 , . . . , x N , the distribution admits p(x 1 , . . . , x N ) =

Y N i=1

p(x i | parent (x i ))

• Describes in a compact way an algorithm to “generate” the data –

(34)

DAG Example: Two dice

p(λ) p(y)

λ y

D

p(D|λ, y)

p(D, λ, y) = p(D|λ, y)p(λ)p(y)

(35)

DAG with observations

p(λ) p(y)

λ y

D

p(D = 9|λ, y)

φ D (λ, y) = p(D = 9|λ, y)p(λ)p(y)

(36)

Examples

Model Structure factorization

Full

x1 x2 x3 x4

p(x 1 )p(x 2 |x 1 )p(x 3 |x 1 , x 2 )p(x 4 |x 1 , x 2 , x 3 )

Markov(2)

x1 x2 x3 x4

p(x 1 )p(x 2 |x 1 )p(x 3 |x 1 , x 2 )p(x 4 |x 2 , x 3 )

Markov(1)

x1 x2 x3 x4

p(x 1 )p(x 2 |x 1 )p(x 3 |x 2 )p(x 4 |x 3 )

x

1

x

2

x

3

x

4

p(x 1 )p(x 2 |x 1 )p(x 3 |x 1 )p(x 4 )

Factorized

x1 x2 x3 x4

p(x 1 )p(x 2 )p(x 3 )p(x 4 )

(37)

Undirected Graphical Models

• Define a distribution by local compatibility functions φ(x α ) p(x) = 1

Z Y

α

φ(x α )

where α runs over cliques : fully connected subsets

• Markov Random Fields

(38)

Undirected Graphical Models

• Examples

x 1

x 2 x 3

x 4

x 1

x 2 x 3

x 4

p(x) = Z 1 φ(x 1 , x 2 )φ(x 1 , x 3 )φ(x 2 , x 4 )φ(x 3 , x 4 ) p(x) = Z 1 φ(x 1 , x 2 , x 3 )φ(x 2 , x 3 , x 4 )

(39)

Factor graphs (Kschischang et. al.)

• A bipartite graph. A powerful graphical representation of the inference problem – Factor nodes: Black squares. Factor potentials (local functions) defining

the posterior.

– Variable nodes: White Nodes. Define collections of random variables

– Edges: denote membership. A variable node is connected to a factor node if a member variable is an argument of the local function.

p(λ) p(y)

λ y

p(D = 9|λ, y)

(40)

Exercise

• For the following Graphical models, write down the factors of the joint distribution and plot an equivalent factor graph.

Full

x1 x2 x3 x4

Markov(1)

x1 x2 x3 x4

HMM

h

1

h

2

h

3

h

4

x

1

x

2

x

3

x

4

MIX

h

x

1

x

2

x

3

x

4

IFA

h

1

h

2

x

1

x

2

x

3

x

4

Factorized

x1 x2 x3 x4

(41)

Answer (Markov(1))

x 1 x 2 x 3 x 4

p(x 1 )

x 1

p(x 2 |x 1 )

x 2

p(x 3 |x 2 )

x 3

p(x 4 |x 3 )

x 4

x 1 x 2 x 3 x 4

p(x )p(x |x ) p(x |x ) p(x |x )

(42)

Answer (IFA – Factorial)

h

1

h

2

x

1

x

2

x

3

x

4

p(h 1 )p(h 2 ) Y 4 i=1

p(x i |h 1 , h 2 )

h

1

h

2

x

1

x

2

x

3

x

4

(43)

Answer (IFA – Factorial)

h

1

h

2

x

1

x

2

x

3

x

4

• We can also cluster nodes together

h

1

, h

2

(44)

Inference and Learning

• Data set

D = {x 1 , . . . x N }

• Model with parameter λ

p(D|λ)

• Maximum Likelihood (ML)

λ ML = arg max

λ log p(D|λ)

• Predictive distribution

p(x N +1 |D) ≈ p(x N +1ML )

(45)

Regularisation

• Prior

p(λ)

• Maximum a-posteriori (MAP) : Regularised Maximum Likelihood λ MAP = arg max

λ log p(D|λ)p(λ)

• Predictive distribution

p(x N +1 |D) ≈ p(x N +1MAP )

(46)

Bayesian Learning

• We treat parameters on the same footing as all other variables

• We integrate over unknown parameters rather than using point estimates (remember the many-dice example)

– Avoids overfitting

– Natural setup for online adaptation

– Model selection

(47)

Bayesian Learning

• Predictive distribution

p(x N +1 |D) = Z

dλ p(x N +1 |λ)p(λ|D)

λ

x

1

x

2

. . . x

N

x

N +1

• Bayesian learning is just inference ...

(48)

Some Applications

(49)

Medical Expert Systems

A S

T L B

E

X D

Diseases

Symptomes Causes

(50)

Medical Expert Systems

Visit to Asia? Smoking?

Tuberclosis? Lung Cancer? Bronchitis?

Either T or L?

Positive X Ray? Dyspnoea?

(51)

Medical Expert Systems

Visit to Asia?

0 %99

1 %1

Smoking?

0 %50

1 %50

Tuberclosis?

0 %99

1 %1

Lung Cancer?

0 %94.5

1 %5.5

Bronchitis?

0 %55

1 %45

Either T or L?

0 %93.5

1 %6.5

Positive X Ray?

0 %89

1 %11

Dyspnoea?

0 %56.4

1 %43.6

(52)

Medical Expert Systems

Visit to Asia?

0 %98.7

1 %1.3

Smoking?

0 %31.2

1 %68.8

Tuberclosis?

0 %90.8

1 %9.2

Lung Cancer?

0 %51.1

1 %48.9

Bronchitis?

0 %49.4

1 %50.6

Either T or L?

0 %42.4

1 %57.6

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %35.9

1 %64.1

(53)

Medical Expert Systems

Visit to Asia?

0 %98.5

1 %1.5

Smoking?

0 %100

1 %0

Tuberclosis?

0 %85.2

1 %14.8

Lung Cancer?

0 %85.8

1 %14.2

Bronchitis?

0 %70

1 %30

Either T or L?

0 %71.1

1 %28.9

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %56

1 %44

(54)

Model Selection: Variable selection in Polynomial Regression

• Given D = {t j , x(t j )} j=1...J , what is the order N of the polynomial?

x(t) =

X N i=0

s i+1 t i + ǫ(t)

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

(55)

Bayesian Variable Selection

C(r

1

; π) C(r

W

; π)

r

1

. . . r

W

N (s

1

; µ(r

1

), Σ(r

1

)) s

1

. . . s

W

N (s

W

; µ(r

W

), Σ(r

W

))

x

N (x; Cs

1:W

, R)

• Generalized Linear Model – Column’s of C are the basis vectors

• The exact posterior is a mixture of 2

W

Gaussians

• When W is large, computation of posterior features becomes intractable.

(56)

Regression

t = t 1 t 2 . . . t J  ⊤

C ≡ t 0 t 1 . . . t W −1 

>> C = fliplr(vander(0:4)) % Van der Monde matrix

1 0 0 0 0

1 1 1 1 1

1 2 4 8 16

1 3 9 27 81

1 4 16 64 256

r i ∼ C(r i ; 0.5, 0.5) r i ∈ { on, off } s i |r i ∼ N (s i ; 0, Σ(r i ))

x |s 1:W ∼ N (x; Cs 1:W , R)

(57)

Regression

To find the “active” basis functions we need to calculate r 1:W ≡ argmax

r

1:W

p(r 1:W |x) = argmax

r

1:W

Z

ds 1:W p(x|s 1:W )p(s 1:W |r 1:W )p(r 1:W )

Then, the reconstruction is given by

ˆ

x(t) =

* W −1 X

i=0

s i+1 t i +

p(s

1:W

|x,r

1:W

)

=

W −1 X

i=0

hs i+1 i p(s

i+1

|x,r

1:W

) t i

(58)

Regression

i

0 1 2 3 4

−10 0 10 20 30

p(x, r

1:W

)

(59)

Regression

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

data

true

approx

(60)

Clustering

(61)

Clustering

π Label probability

c

1

c

2

. . . c

N

Labels ∈ {a, b}

x

1

x

2

. . . x

N

Data Points

µ

a

µ

b

Cluster Centers

a , µ b , π ) = argmax

µ

a

b

X

c

1:N

Y N i=1

p(x ia , µ b , c i )p(c i |π)

(62)

Computer vision / Cognitive Science

How many rectangles are there in this image?

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

(63)

Computer vision / Cognitive Science

. . .

π

1

π

2

. . . π

N

Label probabilities

c

1

c

2

. . . c

N

Labels ∈ {a, b, . . .}

x

1

x

2

. . . x

N

Pixel Values

µ

a

µ

b

. . . Rectangle Colors

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

(64)

Computer Vision

How many people are there in these images?

(65)

Visual Tracking

20 40 60 80 100 120 140

20 40 60 80 100 120 140

20 40 60 80 100 120 140

20 40 60 80 100 120 140

(66)

Navigation, Robotics

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2 0

2

−2 0 2 0 2 4

f Lx

Ly

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2

0 2

−2 0 2 0 2 4 6 8

(67)

Navigation, Robotics

GPS?t GPS status

Gt GPS reading

... Other sensors (magnetic, pressure, e.t.c.)

lt Linear accelerator sensor

ωt Gyroscope

Et−1 Et Attitude Variables

Xt−1 Xt Linear Kinematic Variables

1:Nt}t Set of feature points (Camera Frame)

{x1:Mt}t Set of feature points (World Coordinates)

ρ(x) Global Static Map (Intensity function)

(68)

Time series models and Inference, Terminology

Generic structure of dynamical system models

x

0

x

1

. . . x

k−1

x

k

. . . x

K

y

1

. . . y

k−1

y

k

. . . y

K

x k ∼ p(x k |x k−1 ) Transition Model y k ∼ p(y k |x k ) Observation Model

• x are the latent states

• y are the observations

• In a full Bayesian setting, x includes unknown model parameters

(69)

Online Inference, Terminology

Filtering: p(x k |y 1:k )

– Distribution of current state given all past information – Realtime/Online/Sequential Processing

x 0 x 1 . . . x k−1 x k . . . x K

y 1 . . . y k−1 y k . . . y K

(70)

Online Inference, Terminology

Prediction p(y k:K , x k:K |y 1:k−1 )

– evaluation of possible future outcomes; like filtering without observations

x 0 x 1 . . . x k−1 x k . . . x K

y 1 . . . y k−1 y k . . . y K

• Tracking, Restoration

(71)

Offline Inference, Terminology

Smoothing p(x 0:K |y 1:K ),

Most likely trajectory – Viterbi path arg max x

0:K

p(x 0:K |y 1:K ) better estimate of past states, essential for learning

x

0

x

1

. . . x

k−1

x

k

. . . x

K

y

1

. . . y

k−1

y

k

. . . y

K

Interpolation p(y k , x k |y 1:k−1 , y k+1:K )

fill in lost observations given past and future

x

0

x

1

. . . x

k−1

x

k

. . . x

K

y y y y

(72)

Time Series Analysis

• Stationary

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

– What is the true state of the process given noisy data ? – Parameters ?

– Markovian ? Order ?

(73)

Time Series Analysis

• Nonstationary, time varying variance – stochastic volatility

0 200 400 600 800 1000

0 5 10 15 20

v k

−10

−5 0 5 10

y k

True VB

(74)

Time Series Analysis

• Nonstationary, time varying intensity – nonhomogeneous Poisson Process

0 50 100 150

λ k

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

c k

True VB

(75)

Time Series Analysis

• Piecewise constant

0 10 20 30 40 50 60 70 80 90 100

−5

0

5

10

15

(76)

Time Series Analysis

• Piecewise linear

0 20 40 60 80 100 120 140 160 180 200

−10

−5 0 5 10 15

• Segmentation and Changepoint detection

– What is the true state of the process given noisy data ? – Where are the changepoints ?

– How many changepoints ?

(77)

Audio Processing

x t

t

(Speech)

t

x t

(Piano)

x = x . . . x . . . 

(78)

Audio Restoration

• During download or transmission, some samples of audio are lost

• Estimate missing samples given clean ones

0 50 100 150 200 250 300 350 400 450 500

0

(79)

Examples: Audio Restoration

p(x ¬κ |x κ ) ∝ Z

dHp(x ¬κ |H)p(x κ |H)p(H) H ≡ (parameters, hidden states)

H

x

¬κ

x

κ

Missing Observed

0

(80)

Probabilistic Phase Vocoder (Cemgil and Godsill 2005)

A

ν

Q

ν

s

ν0

· · · s

νk

· · · s

νK−1

ν = 0 . . . W −1

x

0

x

k

x

K−1

s ν k ∼ N (s ν k ; A ν s ν k−1 , Q ν ) A ν ∼ N

 A ν ;

 cos(ω ν ) − sin(ω ν ) sin(ω ν ) cos(ω ν )

 , Ψ



(81)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%)

– Reconstruction, 7.10 dB improvement

– Original

(82)

Pitch Tracking

Monophonic Pitch Tracking = Online estimation (filtering) of p(r t , m t |y 1:t ) .

100 200 300 400 500 600 700 800 900 1000

−100

−50 0 50

100 200 300 400 500 600 700 800 900 1000

5 10 15

(83)

Pitch Tracking

r 0 r 1 . . . r T

m 0 m 1 . . . m T

s 0 s 1 . . . s T

y 1 . . . y T

(84)

Monophonic transcription

• Detecting onsets, offsets and pitch (Cemgil et. al. 2006, IEEE TSALP)

500 1000 1500 2000 2500 3000 3500

Exact inference (S)

(85)

Tracking Pitch Variations

• Allow m to change with k.

50 100 150 200 250 300 350 400 450 500

(86)

Source Separation

s

k,1

. . . s

k,n

. . . s

k,N

x

k,1

. . . x

k,M

k = 1 . . . K

a

1

r

1

. . . a

M

r

M

• Joint estimation Sources, Channel noise and mixing system

x k,1:M ∼ N (x k,1:M ; As k,1:N , R)

(87)

Spectrogram

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

• A linear expansion using a collection of basis functions φ(t; τ, ω) centered around time τ and frequency ω

x t = X

τ,ω

α(τ, ω)φ(t; τ, ω)

(88)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Mix)

(89)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

f/Hz

2000 4000 6000 8000 10000

10 15 20 25

(90)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

(91)

Generative Models for Music

(92)

Generative Models for Music

Score Expression

Piano-Roll

Signal

(93)

Hierarchical Modeling of Music

M



1



2

:::



t

v

1

v

2

:::

v

t

k

1

k

2

::: k

t

h

1

h

2

::: h

t



1



2

::: 

t

m1 m2 ::: mt

gj;1 gj;2

:::

gj;t

rj;1 rj;2

:::

rj;t

nj;1 nj;2 ::: nj;t

xj;1 xj;2

:::

xj;t

yj;1 yj;2

:::

yj;t

(94)

A few non-Bayesian applications

where Monte Carlo is useful

(95)

Combinatorics

• Counting

Example : What is the probability that a solitaire laid out with 52 cards comes out successfully given all permutations have equal probability ?

|A| = X

x∈X

[x ∈ A] [x ∈ A] ≡

 1 x ∈ A 0 x / ∈ A

p(x ∈ A) = |A|

|X | = ?

≈ 2 225

(96)

Geometry

• Given a simplex S in N dimensional space by

S = {x : Ax ≤ b, x ∈ R N }

find the Volume |S|

(97)

Rare Events

• Given a graph with random edge lengths x i ∼ p(x i )

Find the probability that the shortest path from A to B is larger than γ .

A B

x

1

x

2

x

4

x

5

x

3

(98)

Rare Events

x 1 x 2 x 3 x 4 x 5 Edge Lengths

L ShortestPath(A,B)

Pr(L ≥ γ) = Z

dx 1:5 [L(x 1:5 ) ≥ γ] p(x 1:5 )

(99)

Rare Events

A B

hx

1

i = 4

hx

2

i = 1

hx

4

i = 1

hx

5

i = 4 hx

3

i = 1

x i ∼ E(x i ; u i ) ≡ u 1

i

exp 

u 1

i

x i  u i = hx i i ≡ R

x i p(x i )dx i

2000 4000 6000 8000 10000 12000

Count

(100)

Probability Models

(101)

Example: AR(1) model

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

x k = Ax k−1 + ǫ k k = 1 . . . K

ǫ k is i.i.d., zero mean and normal with variance R.

Estimation problem:

(102)

AR(1) model, Generative Model notation

A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)

x k |x k−1 , A, R ∼ N (x k ; Ax k−1 , R) x 0 = ˆ x 0

A R

x

0

x

1

. . . x

k−1

x

k

. . . x

K

Observed variables are shown with double circles

(103)

Example, Univariate Gaussian

The Gaussian distribution with mean m and covariance S has the form N (x; m, S) = (2πS) −1/2 exp{− 1

2 (x − m) 2 /S}

= exp{− 1

2 (x 2 + m 2 − 2xm)/S − 1

2 log(2πS)}

= exp m

S x − 1

2S x 2 − 1

2 log(2πS) + 1

2S m 2



= exp{

 m/S

1 2 /S

 ⊤

| {z }

θ

 x x 2



| {z }

ψ(x)

−c(θ)}

Hence by matching coefficients we have exp 

1 Kx 2 + hx + g

⇔ S = K −1 m = K −1 h

(104)

Example, Gaussian

(105)

The Multivariate Gaussian Distribution

µ is the mean and P is the covariance:

N (s; µ, P ) = |2πP | −1/2 exp



− 1

2 (s − µ) T P −1 (s − µ)



= exp



− 1

2 s T P −1 s + µ T P −1 s− 1

2 µ T P −1 µ − 1

2 |2πP |



log N (s; µ, P ) = − 1

2 s T P −1 s + µ T P −1 s + const

= − 1

2 Tr P −1 ss T + µ T P −1 s + const

= + − 1

2 Tr P −1 ss T + µ T P −1 s

Notation: log f (x) =

+

g(x) ⇐⇒ f (x) ∝ exp(g(x)) ⇐⇒ ∃c ∈ R : f(x) = c exp(g(x))

log p(s) = + − 1

Tr Kss T + h s ⇒ p(s) = N (s; K −1 h, K −1 )

(106)

Example, Inverse Gamma

The inverse Gamma distribution with shape a and scale b IG(r; a, b) = 1

Γ(a)

r −(a+1)

b a exp(− 1 br )

= exp



−(a + 1) log r − 1

br − log Γ(a) − a log b



= exp

 −(a + 1)

−1/b

 ⊤ 

log r 1/r



− log Γ(a) − a log b

!

Hence by matching coefficients, we have

exp



α log r + β 1

r + c



⇔ a = −α − 1 b = −1/β

(107)

Example, Inverse Gamma

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5

a=2 b=1

(108)

Basic Distributions : Exponential Family

• Following distributions are used often as elementary building blocks:

– Gaussian

– Gamma, Inverse Gamma, (Exponential, Chi-square, Wishart) – Dirichlet

– Discrete (Categorical), Bernoulli, multinomial

• All of those distributions can be written as

p(x|θ) = exp{θ ψ(x) − c(θ)}

c(θ) = log Z

X

n

dx exp(θ ψ(x)) log-partition function

θ canonical parameters

ψ(x) sufficient statistics

(109)

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference for the variance R of a zero mean Gaussian.

p(x|R) = N (x; 0, R) p(R) = IG(R; a, b)

p(R|x) ∝ p(R)p(x|R)

∝ exp



−(a + 1) log R − (1/b) 1 R



exp



−(x 2 /2) 1

R − 1

2 log R



= exp

 −(a + 1 + 1 2 )

−(1/b + x 2 /2)

 ⊤ 

log R 1/R

!

∝ IG(R; a + 1

2 , 2

x 2 + 2/b )

(110)

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference of variance R from x 1 , . . . , x N .

R

x

1

x

2

. . . x

N

x

N +1

p(R|x) ∝ p(R)

N

Y

i=1

p(x

i

|R)

∝ exp



−(a + 1) log R − (1/b) 1 R



exp − 1 2

X

i

x

2i

!

1

R − N

2 log R

!

= exp



−(a + 1 +

N2

)

−(1/b +

12 Pi

x

2i

)



log R 1/R



!

∝ IG(R; a + N

2 , 2

P

i

x

2i

+ 2/b )

(111)

Inverse Gamma, P

i x 2 i = 10 N = 10

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Σi x

i

2 = 10 N = 10

(112)

Inverse Gamma, P

i x 2 i = 100 N = 100

0 1 2 3 4 5

0 0.5 1 1.5 2 2.5 3

Σi x

i

2 = 100 N = 100

(113)

Inverse Gamma, P

i x 2 i = 1000 N = 1000

0 1 2 3 4 5

0 1 2 3 4 5 6 7 8 9

Σi x

i

2 = 1000 N = 1000

(114)

Example: AR(1) model

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

x k = Ax k−1 + ǫ k k = 1 . . . K

ǫ k is i.i.d., zero mean and normal with variance R.

Estimation problem:

Given x 0 , . . . , x K , determine coefficient A and variance R (both scalars).

(115)

AR(1) model, Generative Model notation

A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)

x k |x k−1 , A, R ∼ N (x k ; Ax k−1 , R) x 0 = ˆ x 0

A R

x

0

x

1

. . . x

k−1

x

k

. . . x

K

Gaussian : N (x; µ, V ) ≡ |2πV |

−12

exp(−

12

(x − µ)

2

/V )

(116)

AR(1) Model. Bayesian Posterior Inference

p(A, R|x 0 , x 1 , . . . , x K ) ∝ p(x 1 , . . . , x K |x 0 , A, R)p(A, R) Posterior ∝ Likelihood × Prior

Using the Markovian (conditional independence) structure we have

p(A, R|x 0 , x 1 , . . . , x K ) ∝

Y K k=1

p(x k |x k−1 , A, R)

!

p(A)p(R)

A R

x0 x1 . . . xk−1 xk . . . xK

(117)

Numerical Example

Suppose K = 1 ,

A R

x

0

x

1

A R

x

0

x

1

By Bayes’ Theorem and the structure of AR(1) model

p(A, R|x 0 , x 1 ) ∝ p(x 1 |x 0 , A, R)p(A)p(R)

= N (x 1 ; Ax 0 , R)N (A; 0, P )IG(R; ν, β/ν)

(118)

Numerical Example

p(A, R|x 0 , x 1 ) ∝ p(x 1 |x 0 , A, R)p(A)p(R)

= N (x 1 ; Ax 0 , R)N (A; 0, P )IG(R; ν, β/ν)

∝ exp



− 1 2

x 2 1

R + x 0 x 1 A

R − 1 2

x 2 0 A 2

R − 1

2 log 2πR



exp



− 1 2

A 2 P



exp



−(ν + 1) log R − ν β

1 R



This posterior has a nonstandard form

exp



α 1 1

R + α 2 A

R + α 3 A 2

R + α 4 log R + α 5 A 2



(119)

Numerical Example, the prior p(A, R)

Equiprobability contour of p(A)p(R)

A

R

−8 −6 −4 −2 0 2 4 6

10−4 10−2 100 102 104

A ∼ N (A; 0, 1.2) R ∼ IG(R; 0.4, 250)

(120)

Numerical Example, the posterior p(A, R|x)

A

R

−8 −6 −4 −2 0 2 4 6

10−4 10−2 100 102 104

Note the bimodal posterior with x

0

= 1, x

1

= −6

• A ≈ −6 ⇔ low noise variance R.

• A ≈ 0 ⇔ R.

(121)

Remarks

• Even very simple models can lead easily to complicated posterior distributions

• Ambiguous data usually leads to a multimodal posterior, each mode corresponding to one possible explanation

A-priori independent variables often become dependent a- posteriori (“Explaining away”)

• (Unfortunately), exact posterior inference is only possible for few special cases

⇒ We need numerical approximate inference methods

(122)

Approximate Inference by

Markov Chain Monte Carlo

(123)

Outline of this section

• A Gaussian toy example

• The Gibbs sampler

• Sketch of Markov Chain theory

• Metropolis-Hastings, MCMC Transition Kernels,

• Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler

• Optimisation versus Integration: Simulated annealing and iterative improvement

(124)

Toy Example : “Source separation”

s 1 p(s 1 )

s 2 p(s 2 )

x

p(x|s 1 , s 2 )

This graph encodes the joint: p(x, s 1 , s 2 ) = p(x|s 1 , s 2 )p(s 1 )p(s 2 ) s 1 ∼ p(s 1 ) = N (s 1 ; µ 1 , P 1 )

s 2 ∼ p(s 2 ) = N (s 2 ; µ 2 , P 2 )

x|s 1 , s 2 ∼ p(x|s 1 , s 2 ) = N (x; s 1 + s 2 , R)

(125)

Toy example

Suppose, we observe x = ˆ x.

s 1

p(s 1 )

s 2

p(s 2 )

x

p(x = ˆ x|s 1 , s 2 )

• By Bayes’ theorem, the posterior is given by:

P ≡ p(s 1 , s 2 |x = ˆ x) = 1

Z x ˆ p(x = ˆ x|s 1 , s 2 )p(s 1 )p(s 2 ) ≡ 1

Z x ˆ φ(s 1 , s 2 )

Referanslar

Benzer Belgeler

In this study, we aimed to investigate the association of some novel coronary risk factors, as serum levels of lipoprotein (a) [Lp(a)], homocysteine (Hcy), uric acid, and

Obstructive sleep apnea is a risk factor for osteoarthritis Introduction: Obstructive sleep apnea (OSA) syndrome is closely associated with cardiovascular and metabolic

Write the expression for the full joint distribution and assign terms to the individual factors on the factor graph.. Implement the

In Section 3.1 the SIR model with delay is constructed, then equilibrium points, basic reproduction number and stability analysis are given for this model.. In Section

Through three level of analysis, I clarify how identity illusion turns to a truth for both European Muslims and Christians.. Here, it is the analysis of suicide bombers through

The roof which has water insulation (L1), its temperature was higher than the roof which hasn’t water insulation material (L2) in the day time and night time. It is

According to Özkalp, with the most common definition family is an economic and social institution which is made up of the mother, father and children and the

Quantitative results are obtained using devices or instruments that allow us to determine the concentration of a chemical in a sample from an observable signal.. There