An Introduction to Graphical Models and Monte Carlo methods

(1)

An Introduction to

Graphical Models and Monte Carlo methods

A. Taylan Cemgil

Signal Processing and Communications Lab.

Birkbeck School of Economics, Mathematics and Statistics

June 19, 2007

(2)

Goals of this Tutorial

To Provide ...

• a basic understanding of underlying principles of probabilistic modeling and inference

• an introduction to Graphical models and associated concepts

• a succinct overview of (perhaps interesting) applications from engineering and computer science

– Statistical Signal Processing, Pattern Recognition – Machine Learning, Artificial Intelligence

• an initial orientation in the broad literature of Monte Carlo methods

(3)

First Part, Basic Concepts and MCMC

• Introduction

– Bayes’ Theorem,

– Trivial toy example to clarify notation

• Graphical Models

– Bayesian Networks

– Undirected Graphical models, Markov Random Fields – Factor graphs

• Maximum Likelihood and Bayesian Learning

(4)

– (classical AI) Medical Expert systems, (Statistics) Variable selection, (Engineering-CS) Computer vision,

– Time Series - terminology and applications – Audio processing

– Non Bayesian applications

• Probability Models

– Exponential family, Conjugacy

– Motivation for Approximate Inference

• Markov Chain Monte Carlo – A Gaussian toy example – The Gibbs sampler

– Sketch of Markov Chain theory

(5)

– Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler

– Optimisation versus Integration: Simulated annealing and

iterative improvement

(6)

Second Part, Time Series Models and SMC

• Latent State-Space Models

– Hidden Markov Models (HMM), – Kalman Filter Models

– Switching State Space models – Changepoint models

• Inference in HMM

– Forward Backward Algorithm – Viterbi

– Exact inference in Graphical models by message passing

• Sequential Monte Carlo

(7)

– Particle Filtering

• Final Remarks and Bibliography

(8)

Bayes’ Theorem

Thomas Bayes (1702-1761)

“What you know about a parameter λ after the data D arrive is what you knew before about λ and what the data D told you ¹ .”

p(λ|D) = p(D|λ)p(λ) p(D)

Posterior = Likelihood × Prior

Evidence

(9)

An application of Bayes’ Theorem: “Source Separation”

Given two fair dice with outcomes λ and y ,

D = λ + y

What is λ when D = 9 ?

(10)

An application of Bayes’ Theorem: “Source Separation”

D = λ + y = 9

D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 2 3 4 5 6 7

λ = 2 3 4 5 6 7 8

λ = 3 4 5 6 7 8 9

λ = 4 5 6 7 8 9 10

λ = 5 6 7 8 9 10 11

λ = 6 7 8 9 10 11 12

Bayes theorem “upgrades” p(λ) into p(λ|D).

But you have to provide an observation model: p(D|λ)

(11)

“Burocratical” derivation

Formally we write

p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(D|λ, y) = δ(D − (λ + y))

Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) =

1 x = 0 0 x 6= 0

p(λ, y|D) = 1

p(D) × p(D|λ, y) × p(y)p(λ) Posterior = 1

Evidence × Likelihood × Prior

X

(12)

Prior

p(y)p(λ)

p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 λ = 1 1/36 1/36 1/36 1/36 1/36 1/36 λ = 2 1/36 1/36 1/36 1/36 1/36 1/36 λ = 3 1/36 1/36 1/36 1/36 1/36 1/36 λ = 4 1/36 1/36 1/36 1/36 1/36 1/36 λ = 5 1/36 1/36 1/36 1/36 1/36 1/36 λ = 6 1/36 1/36 1/36 1/36 1/36 1/36

• A table with indicies λ and y

• Each cell denotes the probability p(λ, y)

(13)

Likelihood

p(D = 9|λ, y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1

λ = 4 0 0 0 0 1 0

λ = 5 0 0 0 1 0 0

λ = 6 0 0 1 0 0 0

• A table with indicies λ and y

(14)

Likelihood × Prior

φ _D (λ, y) = p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /36

λ = 4 0 0 0 0 1 /36 0

λ = 5 0 0 0 1/36 0 0

λ = 6 0 0 1/36 0 0 0

(15)

Evidence

p(D = 9) = X

λ,y

p(D = 9|λ, y)p(λ)p(y)

= 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0

= 1/9

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /36

λ = 4 0 0 0 0 1/36 0

λ = 5 0 0 0 1/36 0 0

(16)

Posterior

p(λ, y|D = 9) = 1

p(D) p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1 /4

λ = 4 0 0 0 0 1/4 0

λ = 5 0 0 0 1/4 0 0

λ = 6 0 0 1 /4 0 0 0

1/4 = (1/36)/(1/9)

(17)

Marginal Posterior

p(λ|D) = X

y

1 p(D) p(D|λ, y)p(λ)p(y)

p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1 /4 0 0 0 0 0 1/4

λ = 4 1/4 0 0 0 0 1/4 0

λ = 5 1/4 0 0 0 1/4 0 0

λ = 6 1 /4 0 0 1/4 0 0 0

(18)

The “proportional to” notation

p(λ|D = 9) ∝ p(λ, D = 9) = X

y

p(D = 9|λ, y)p(λ)p(y)

p(λ, D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1/36 0 0 0 0 0 1/36

λ = 4 1/36 0 0 0 0 1/36 0

λ = 5 1/36 0 0 0 1 /36 0 0

λ = 6 1/36 0 0 1/36 0 0 0

(19)

Exercise

p(x ₁ , x ₂ ) x ₂ = 1 x ₂ = 2 x ₁ = 1 0.3 0.3 x ₁ = 2 0.1 0.3 1. Find the following quantities

• Marginals: p(x ₁ ), p(x ₂ )

• Conditionals: p(x ₁ |x ₂ ), p(x ₂ |x ₁ )

• Posterior: p(x ₁ , x ₂ = 2), p(x ₁ |x ₂ = 2)

• Evidence: p(x ₂ = 2)

• p({})

• Max: p(x ^∗ ₁ ) = max _x

₁

p(x ₁ |x ₂ = 1)

• Mode: x ^∗ ₁ = arg max _x

₁

p(x ₁ |x ₂ = 1)

• Max-marginal: max _x

₁

p(x ₁ , x ₂ )

2. Are x ₁ and x ₂ independent ? (i.e., Is p(x ₁ , x ₂ ) = p(x ₁ )p(x ₂ ) ?)

(20)

Answers

p(x ₁ , x ₂ ) x ₂ = 1 x ₂ = 2 x ₁ = 1 0.3 0.3 x ₁ = 2 0.1 0.3

• Marginals:

p(x ₁ )

x ₁ = 1 0.6 x ₁ = 2 0.4

p(x ₂ ) x ₂ = 1 x ₂ = 2 0.4 0.6

• Conditionals:

p(x ₁ |x ₂ ) x ₂ = 1 x ₂ = 2 x ₁ = 1 0.75 0.5 x ₁ = 2 0.25 0.5

p(x ₂ |x ₁ ) x ₂ = 1 x ₂ = 2

x ₁ = 1 0.5 0.5

x ₁ = 2 0.25 0.75

(21)

Answers

p(x ₁ , x ₂ ) x ₂ = 1 x ₂ = 2 x ₁ = 1 0.3 0.3 x ₁ = 2 0.1 0.3

• Posterior:

p(x ₁ , x ₂ = 2) x ₂ = 2 x ₁ = 1 0.3 x ₁ = 2 0.3

p(x ₁ |x ₂ = 2) x ₂ = 2 x ₁ = 1 0.5 x ₁ = 2 0.5

• Evidence:

p(x ₂ = 2) = X

x

₁

p(x ₁ , x ₂ = 2) = 0.6

• Normalisation constant:

X X

(22)

Answers

p(x ₁ , x ₂ ) x ₂ = 1 x ₂ = 2 x ₁ = 1 0.3 0.3 x ₁ = 2 0.1 0.3

• Max: (get the value)

max x

₁

p(x ₁ |x ₂ = 1) = 0.75

• Mode: (get the index)

argmax

x

₁

p(x ₁ |x ₂ = 1) = 1

• Max-marginal: (get the “skyline”) max _x

₁

p(x ₁ , x ₂ )

max _x

₁

p(x ₁ , x ₂ ) x ₂ = 1 x ₂ = 2

0.3 0.3

(23)

Another application of Bayes’ Theorem: “Model Selection”

Given an unknown number of fair dice with outcomes λ ₁ , λ ₂ , . . . , λ _n ,

D =

X n i=1

λ _i

How many dice are there when D = 9 ?

Assume that any number n is equally likely

(24)

Another application of Bayes’ Theorem: “Model Selection”

Given all n are equally likely (i.e., p(n) is flat), we calculate (formally) p(n|D = 9) = p(D = 9|n)p(n)

p(D) ∝ p(D = 9|n)

p(D|n = 1) = X

λ

₁

p(D|λ ₁ )p(λ ₁ ) p(D|n = 2) = X

λ

₁

X

λ

₂

p(D|λ ₁ , λ ₂ )p(λ ₁ )p(λ ₂ ) . . .

p(D|n = n ^′ ) = X

λ

₁

,...,λ

_n′

p(D|λ ₁ , . . . , λ _n

^′

)

n

^′

Y

i=1

p(λ _i )

(25)

p(D|n) = P

λ p(D|λ, n)p(λ|n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

0.2 p(D|n=1)

0 0.2

p(D|n=2)

0 0.2

p(D|n=3)

0 0.2

p(D|n=4)

0 0.2

p(D|n=5)

(26)

Another application of Bayes’ Theorem: “Model Selection”

1 2 3 4 5 6 7 8 9

0 0.1 0.2 0.3 0.4 0.5

n = Number of Dice

p(n|D = 9)

• Complex models are more flexible but they spread their probability mass

• Bayesian inference inherently prefers “simpler models” – Occam’s razor

• Computational burden: We need to sum over all parameters λ

(27)

Probabilistic Inference

A huge spectrum of applications – all boil down to computation of

• expectations of functions under probability distributions: Integration hf (x)i =

Z

X

dxp(x)f (x) hf (x)i = X

x∈X

p(x)f (x)

• modes of functions under probability distributions: Optimization x ^∗ = argmax

x∈X

p(x)f (x)

• any “mix” of the above: e.g.,

x ^∗ = argmax p(x) = argmax Z

dzp(z)p(x|z)

(28)

Graphical Models

“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more

advanced problems, and in effect increases the mental power of the race.” A.N. Whitehead

(29)

Graphical Models

• formal languages for specification of probability distributions and associated inference algorithms

• historically, introduced in probabilistic expert systems (Pearl 1988) as a visual guide for representing expert knowledge

• today, a standard tool in machine learning, statistics and signal

processing

(30)

Graphical Models

• provide graph based algorithms for derivations and computation

• pedagogical insight/motivation for model/algorithm construction

– Statistics:

“Kalman filter models and hidden Markov models (HMM) are equivalent upto parametrisation”

– Signal processing:

“Fast Fourier transform is an instance of sum-product algorithm on a factor graph”

– Computer Science:

“Backtracking in Prolog is equivalent to inference in Bayesian networks with deterministic tables”

• Automated tools for code generation start to emerge, making the

design/implement/test cycle shorter

(31)

Important types of Graphical Models

• Useful for Model Construction

– Directed Acyclic Graphs (DAG), Bayesian Networks – Undirected Graphs, Markov Networks, Random Fields – Influence diagrams

– ...

• Useful for Inference – Factor Graphs

– Junction/Clique graphs – Region graphs

– ...

(32)

Directed Acyclic Graphical (DAG) Models

Factor Graphs and

(33)

Directed Graphical models

• Each random variable is associated with a node in the graph,

• We draw an arrow from A → B if p(B| . . . , A, . . . ) ( A ∈ parent (B) ),

• The edges tell us qualitatively about the factorization of the joint probability

• For N random variables x ₁ , . . . , x _N , the distribution admits p(x ₁ , . . . , x _N ) =

Y N i=1

p(x _i | parent (x _i ))

• Describes in a compact way an algorithm to “generate” the data –

(34)

DAG Example: Two dice

p(λ) p(y)

λ y

D

p(D|λ, y)

p(D, λ, y) = p(D|λ, y)p(λ)p(y)

(35)

DAG with observations

p(λ) p(y)

λ y

D

p(D = 9|λ, y)

φ _D (λ, y) = p(D = 9|λ, y)p(λ)p(y)

(36)

Examples

Model Structure factorization

Full

^x¹ ^x² ^x³ ^x⁴

p(x ₁ )p(x ₂ |x ₁ )p(x ₃ |x ₁ , x ₂ )p(x ₄ |x ₁ , x ₂ , x ₃ )

Markov(2)

^x¹ ^x² ^x³ ^x⁴

p(x ₁ )p(x ₂ |x ₁ )p(x ₃ |x ₁ , x ₂ )p(x ₄ |x ₂ , x ₃ )

Markov(1)

^x¹ ^x² ^x³ ^x⁴

p(x ₁ )p(x ₂ |x ₁ )p(x ₃ |x ₂ )p(x ₄ |x ₃ )

x

1

x

2

x

3

x

4

p(x ₁ )p(x ₂ |x ₁ )p(x ₃ |x ₁ )p(x ₄ )

Factorized

^x¹ ^x² ^x³ ^x⁴

p(x ₁ )p(x ₂ )p(x ₃ )p(x ₄ )

(37)

Undirected Graphical Models

• Define a distribution by local compatibility functions φ(x _α ) p(x) = 1

Z Y

α

φ(x _α )

where α runs over cliques : fully connected subsets

• Markov Random Fields

(38)

Undirected Graphical Models

• Examples

x 1

x 2 x 3

x ₄

x 1

x 2 x 3

x ₄

p(x) = _Z ¹ φ(x ₁ , x ₂ )φ(x ₁ , x ₃ )φ(x ₂ , x ₄ )φ(x ₃ , x ₄ ) p(x) = _Z ¹ φ(x ₁ , x ₂ , x ₃ )φ(x ₂ , x ₃ , x ₄ )

(39)

Factor graphs (Kschischang et. al.)

• A bipartite graph. A powerful graphical representation of the inference problem – Factor nodes: Black squares. Factor potentials (local functions) defining

the posterior.

– Variable nodes: White Nodes. Define collections of random variables

– Edges: denote membership. A variable node is connected to a factor node if a member variable is an argument of the local function.

p(λ) p(y)

λ y

p(D = 9|λ, y)

(40)

Exercise

• For the following Graphical models, write down the factors of the joint distribution and plot an equivalent factor graph.

Full

^x¹ ^x² ^x³ ^x⁴

Markov(1)

^x¹ ^x² ^x³ ^x⁴

HMM

h

1

h

2

h

3

h

4

x

1

x

2

x

3

x

4

MIX

h

x

1

x

2

x

3

x

4

IFA

h

1

h

2

x

1

x

2

x

3

x

4

Factorized

^x¹ ^x² ^x³ ^x⁴

(41)

Answer (Markov(1))

x ₁ x ₂ x ₃ x ₄

p(x ₁ )

x ₁

p(x ₂ |x ₁ )

x ₂

p(x ₃ |x ₂ )

x ₃

p(x ₄ |x ₃ )

x ₄

x ₁ x ₂ x ₃ x ₄

p(x )p(x |x ) p(x |x ) p(x |x )

(42)

Answer (IFA – Factorial)

h

1

h

2

x

1

x

2

x

3

x

4

p(h ₁ )p(h ₂ ) Y 4 i=1

p(x _i |h ₁ , h ₂ )

h

₁

h

₂

x

₁

x

₂

x

₃

x

₄

(43)

Answer (IFA – Factorial)

h

₁

h

₂

x

₁

x

₂

x

₃

x

₄

• We can also cluster nodes together

h

1

, h

2

(44)

Inference and Learning

• Data set

D = {x ₁ , . . . x _N }

• Model with parameter λ

p(D|λ)

• Maximum Likelihood (ML)

λ ^ML = arg max

λ log p(D|λ)

• Predictive distribution

p(x _{N +1} |D) ≈ p(x _{N +1} |λ ^ML )

(45)

Regularisation

• Prior

p(λ)

• Maximum a-posteriori (MAP) : Regularised Maximum Likelihood λ ^MAP = arg max

λ log p(D|λ)p(λ)

• Predictive distribution

p(x _{N +1} |D) ≈ p(x _{N +1} |λ ^MAP )

(46)

Bayesian Learning

• We treat parameters on the same footing as all other variables

• We integrate over unknown parameters rather than using point estimates (remember the many-dice example)

– Avoids overfitting

– Natural setup for online adaptation

– Model selection

(47)

Bayesian Learning

• Predictive distribution

p(x _{N +1} |D) = Z

dλ p(x _{N +1} |λ)p(λ|D)

λ

x

1

x

2

. . . x

N

x

N +1

• Bayesian learning is just inference ...

(48)

Some Applications

(49)

Medical Expert Systems

A S

T L B

E

X D

Diseases

Symptomes Causes

(50)

Medical Expert Systems

Visit to Asia? Smoking?

Tuberclosis? Lung Cancer? Bronchitis?

Either T or L?

Positive X Ray? Dyspnoea?

(51)

Medical Expert Systems

Visit to Asia?

0 %99

1 %1

Smoking?

0 %50

1 %50

Tuberclosis?

0 %99

1 %1

Lung Cancer?

0 %94.5

1 %5.5

Bronchitis?

0 %55

1 %45

Either T or L?

0 %93.5

1 %6.5

Positive X Ray?

0 %89

1 %11

Dyspnoea?

0 %56.4

1 %43.6

(52)

Medical Expert Systems

Visit to Asia?

0 %98.7

1 %1.3

Smoking?

0 %31.2

1 %68.8

Tuberclosis?

0 %90.8

1 %9.2

Lung Cancer?

0 %51.1

1 %48.9

Bronchitis?

0 %49.4

1 %50.6

Either T or L?

0 %42.4

1 %57.6

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %35.9

1 %64.1

(53)

Medical Expert Systems

Visit to Asia?

0 %98.5

1 %1.5

Smoking?

0 %100

1 %0

Tuberclosis?

0 %85.2

1 %14.8

Lung Cancer?

0 %85.8

1 %14.2

Bronchitis?

0 %70

1 %30

Either T or L?

0 %71.1

1 %28.9

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %56

1 %44

(54)

Model Selection: Variable selection in Polynomial Regression

• Given D = {t _j , x(t _j )} _j=1...J , what is the order N of the polynomial?

x(t) =

X N i=0

s _i+1 t ⁱ + ǫ(t)

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

(55)

Bayesian Variable Selection

C(r

₁

; π) C(r

_W

; π)

r

₁

. . . r

_W

N (s

₁

; µ(r

1

), Σ(r

1

)) s

1

. . . s

W

N (s

_W

; µ(r

W

), Σ(r

W

))

x

N (x; Cs

_1:W

, R)

• Generalized Linear Model – Column’s of C are the basis vectors

• The exact posterior is a mixture of 2

^W

Gaussians

• When W is large, computation of posterior features becomes intractable.

(56)

Regression

t = t ₁ t ₂ . . . t _J ⊤

C ≡ t ⁰ t ¹ . . . t ^{W −1}

>> C = fliplr(vander(0:4)) % Van der Monde matrix

1 0 0 0 0

1 1 1 1 1

1 2 4 8 16

1 3 9 27 81

1 4 16 64 256

r _i ∼ C(r _i ; 0.5, 0.5) r _i ∈ { on, off } s _i |r _i ∼ N (s _i ; 0, Σ(r _i ))

x |s _1:W ∼ N (x; Cs _1:W , R)

(57)

Regression

To find the “active” basis functions we need to calculate r _1:W ^∗ ≡ argmax

r

_1:W

p(r _1:W |x) = argmax

r

_1:W

Z

ds _1:W p(x|s _1:W )p(s _1:W |r _1:W )p(r _1:W )

Then, the reconstruction is given by

ˆ

x(t) =

* _{W −1} X

i=0

s _i+1 t ⁱ +

p(s

_1:W

|x,r

^∗_1:W

)

=

W −1 X

i=0

hs _i+1 i _p(s

i+1

|x,r

_1:W^∗

) t ⁱ

(58)

Regression

i

0 1 2 3 4

−10 0 10 20 30

p(x, r

1:W

)

(59)

Regression

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

data

true

approx

(60)

Clustering

(61)

Clustering

π Label probability

c

1

c

2

. . . c

N

Labels ∈ {a, b}

x

1

x

2

. . . x

N

Data Points

µ

a

µ

b

Cluster Centers

(µ ^∗ _a , µ ^∗ _b , π ^∗ ) = argmax

µ

a

,µ

_b

,π

X

c

_1:N

Y N i=1

p(x _i |µ _a , µ _b , c _i )p(c _i |π)

(62)

Computer vision / Cognitive Science

How many rectangles are there in this image?

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

(63)

Computer vision / Cognitive Science

. . .

π

₁

π

₂

. . . π

_N

Label probabilities

c

₁

c

₂

. . . c

_N

Labels ∈ {a, b, . . .}

x

₁

x

₂

. . . x

_N

Pixel Values

µ

_a

µ

_b

. . . Rectangle Colors

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

(64)

Computer Vision

How many people are there in these images?

(65)

Visual Tracking

20 40 60 80 100 120 140

(66)

Navigation, Robotics

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2 0

2

−2 0 2 0 2 4

f Lx

Ly

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2

0 2

−2 0 2 0 2 4 6 8

(67)

Navigation, Robotics

GPS?t GPS status

G_t GPS reading

... Other sensors (magnetic, pressure, e.t.c.)

lt Linear accelerator sensor

ω_t Gyroscope

E_t−1 E_t Attitude Variables

Xt−1 X_t Linear Kinematic Variables

{ξ_1:Nt}_t Set of feature points (Camera Frame)

{x_1:Mt}_t Set of feature points (World Coordinates)

ρ(x) Global Static Map (Intensity function)

(68)

Time series models and Inference, Terminology

Generic structure of dynamical system models

x

₀

x

₁

. . . x

k−1

x

_k

. . . x

_K

y

1

. . . y

_k−1

y

k

. . . y

K

x _k ∼ p(x _k |x _k−1 ) Transition Model y _k ∼ p(y _k |x _k ) Observation Model

• x are the latent states

• y are the observations

• In a full Bayesian setting, x includes unknown model parameters

(69)

Online Inference, Terminology

• Filtering: p(x _k |y _1:k )

– Distribution of current state given all past information – Realtime/Online/Sequential Processing

x ₀ x ₁ . . . x k−1 x _k . . . x _K

y ₁ . . . y _k−1 y _k . . . y _K

(70)

Online Inference, Terminology

• Prediction p(y _k:K , x _k:K |y _1:k−1 )

– evaluation of possible future outcomes; like filtering without observations

x ₀ x ₁ . . . x _k−1 x _k . . . x _K

y 1 . . . y _k−1 y k . . . y K

• Tracking, Restoration

(71)

Offline Inference, Terminology

• Smoothing p(x _0:K |y _1:K ),

Most likely trajectory – Viterbi path arg max _x

_0:K

p(x _0:K |y _1:K ) better estimate of past states, essential for learning

x

0

x

1

. . . x

k−1

x

k

. . . x

K

y

1

. . . y

k−1

y

k

. . . y

K

• Interpolation p(y _k , x _k |y _1:k−1 , y _k+1:K )

fill in lost observations given past and future

x

0

x

1

. . . x

k−1

x

k

. . . x

K

y y y y

(72)

Time Series Analysis

• Stationary

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

– What is the true state of the process given noisy data ? – Parameters ?

– Markovian ? Order ?

(73)

Time Series Analysis

• Nonstationary, time varying variance – stochastic volatility

0 200 400 600 800 1000

0 5 10 15 20

v k

−10

−5 0 5 10

y k

True VB

(74)

Time Series Analysis

• Nonstationary, time varying intensity – nonhomogeneous Poisson Process

0 50 100 150

λ k

0 0.2 0.4 0.6 0.8 1

c k

True VB

(75)

Time Series Analysis

• Piecewise constant

0 10 20 30 40 50 60 70 80 90 100

−5

0

5

10

15

(76)

Time Series Analysis

• Piecewise linear

0 20 40 60 80 100 120 140 160 180 200

−10

−5 0 5 10 15

• Segmentation and Changepoint detection

– What is the true state of the process given noisy data ? – Where are the changepoints ?

– How many changepoints ?

(77)

Audio Processing

x t

t

(Speech)

t

x t

(Piano)

x = x . . . x . . .

(78)

Audio Restoration

• During download or transmission, some samples of audio are lost

• Estimate missing samples given clean ones

0 50 100 150 200 250 300 350 400 450 500

0

(79)

Examples: Audio Restoration

p(x _¬κ |x ^κ ) ∝ Z

dHp(x _¬κ |H)p(x ^κ |H)p(H) H ≡ (parameters, hidden states)

H

x

_¬κ

x

^κ

Missing Observed

0

(80)

Probabilistic Phase Vocoder (Cemgil and Godsill 2005)

A

_ν

Q

_ν

s

^ν₀

· · · s

^ν_k

· · · s

^ν_K−1

ν = 0 . . . W −1

x

₀

x

_k

x

_K−1

s ^ν _k ∼ N (s ^ν _k ; A _ν s ^ν _k−1 , Q _ν ) A _ν ∼ N

A _ν ;

cos(ω _ν ) − sin(ω _ν ) sin(ω _ν ) cos(ω _ν )

, Ψ

(81)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%)

– Reconstruction, 7.10 dB improvement

– Original

(82)

Pitch Tracking

Monophonic Pitch Tracking = Online estimation (filtering) of p(r _t , m _t |y _1:t ) .

100 200 300 400 500 600 700 800 900 1000

−100

−50 0 50

100 200 300 400 500 600 700 800 900 1000

5 10 15

(83)

Pitch Tracking

r 0 r 1 . . . r T

m ₀ m ₁ . . . m _T

s ₀ s ₁ . . . s _T

y ₁ . . . y _T

(84)

Monophonic transcription

• Detecting onsets, offsets and pitch (Cemgil et. al. 2006, IEEE TSALP)

500 1000 1500 2000 2500 3000 3500

Exact inference (S)

(85)

Tracking Pitch Variations

• Allow m to change with k.

50 100 150 200 250 300 350 400 450 500

(86)

Source Separation

s

_k,1

. . . s

_k,n

. . . s

_k,N

x

_k,1

. . . x

_k,M

k = 1 . . . K

a

₁

r

₁

. . . a

_M

r

_M

• Joint estimation Sources, Channel noise and mixing system

x _k,1:M ∼ N (x _k,1:M ; As _k,1:N , R)

(87)

Spectrogram

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

• A linear expansion using a collection of basis functions φ(t; τ, ω) centered around time τ and frequency ω

x _t = X

τ,ω

α(τ, ω)φ(t; τ, ω)

(88)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Mix)

(89)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

f/Hz

2000 4000 6000 8000 10000

10 15 20 25

(90)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

(91)

Generative Models for Music

(92)

Generative Models for Music

Score Expression

Piano-Roll

Signal

(93)

Hierarchical Modeling of Music

M

1

2

:::

t

v

1

v

2

:::

v

t

k

1

k

2

::: k

t

h

1

h

2

::: h

t

1

2

:::

t

m1 m2 ::: mt

gj;1 gj;2

:::

gj;t

rj;1 rj;2

:::

rj;t

nj;1 nj;2 ::: nj;t

xj;1 xj;2

:::

xj;t

yj;1 yj;2

:::

yj;t

(94)

A few non-Bayesian applications

where Monte Carlo is useful

(95)

Combinatorics

• Counting

Example : What is the probability that a solitaire laid out with 52 cards comes out successfully given all permutations have equal probability ?

|A| = X

x∈X

[x ∈ A] [x ∈ A] ≡

1 x ∈ A 0 x / ∈ A

p(x ∈ A) = |A|

|X | = ?

≈ 2 ²²⁵

(96)

Geometry

• Given a simplex S in N dimensional space by

S = {x : Ax ≤ b, x ∈ R ^N }

find the Volume |S|

(97)

Rare Events

• Given a graph with random edge lengths x _i ∼ p(x _i )

Find the probability that the shortest path from A to B is larger than γ .

A B

x

1

x

₂

x

4

x

₅

x

3

(98)

Rare Events

x ₁ x ₂ x ₃ x ₄ x ₅ Edge Lengths

L ShortestPath(A,B)

Pr(L ≥ γ) = Z

dx _1:5 [L(x _1:5 ) ≥ γ] p(x _1:5 )

(99)

Rare Events

A B

hx

₁

i = 4

hx

₂

i = 1

hx

₄

i = 1

hx

₅

i = 4 hx

₃

i = 1

x _i ∼ E(x _i ; u _i ) ≡ _u ¹

i

exp

− _u ¹

i

x _i u _i = hx _i i ≡ R

x _i p(x _i )dx _i

2000 4000 6000 8000 10000 12000

Count

(100)

Probability Models

(101)

Example: AR(1) model

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

x _k = Ax _k−1 + ǫ _k k = 1 . . . K

ǫ _k is i.i.d., zero mean and normal with variance R.

Estimation problem:

(102)

AR(1) model, Generative Model notation

A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)

x _k |x _k−1 , A, R ∼ N (x _k ; Ax _k−1 , R) x ₀ = ˆ x ₀

A R

x

₀

x

₁

. . . x

k−1

x

_k

. . . x

_K

Observed variables are shown with double circles

(103)

Example, Univariate Gaussian

The Gaussian distribution with mean m and covariance S has the form N (x; m, S) = (2πS) ^−1/2 exp{− 1

2 (x − m) ² /S}

= exp{− 1

2 (x ² + m ² − 2xm)/S − 1

2 log(2πS)}

= exp m

S x − 1

2S x ² − 1

2 log(2πS) + 1

2S m ²

= exp{

m/S

− ¹ ₂ /S

⊤

| {z }

θ

x x ²

| {z }

ψ(x)

−c(θ)}

Hence by matching coefficients we have exp

− ¹ Kx ² + hx + g

⇔ S = K ⁻¹ m = K ⁻¹ h

(104)

Example, Gaussian

(105)

The Multivariate Gaussian Distribution

µ is the mean and P is the covariance:

N (s; µ, P ) = |2πP | ^−1/2 exp

− 1

2 (s − µ) ^T P ⁻¹ (s − µ)

= exp

− 1

2 s ^T P ⁻¹ s + µ ^T P ⁻¹ s− 1

2 µ ^T P ⁻¹ µ − 1

2 |2πP |

log N (s; µ, P ) = − 1

2 s ^T P ⁻¹ s + µ ^T P ⁻¹ s + const

= − 1

2 Tr P ⁻¹ ss ^T + µ ^T P ⁻¹ s + const

= ⁺ − 1

2 Tr P ⁻¹ ss ^T + µ ^T P ⁻¹ s

Notation: log f (x) =

⁺

g(x) ⇐⇒ f (x) ∝ exp(g(x)) ⇐⇒ ∃c ∈ R : f(x) = c exp(g(x))

log p(s) = ⁺ − 1

Tr Kss ^T + h ^⊤ s ⇒ p(s) = N (s; K ⁻¹ h, K ⁻¹ )

(106)

Example, Inverse Gamma

The inverse Gamma distribution with shape a and scale b IG(r; a, b) = 1

Γ(a)

r ^−(a+1)

b ^a exp(− 1 br )

= exp

−(a + 1) log r − 1

br − log Γ(a) − a log b

= exp

−(a + 1)

−1/b

⊤

log r 1/r

− log Γ(a) − a log b

!

Hence by matching coefficients, we have

exp

α log r + β 1

r + c

⇔ a = −α − 1 b = −1/β

(107)

Example, Inverse Gamma

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5

a=2 b=1

(108)

Basic Distributions : Exponential Family

• Following distributions are used often as elementary building blocks:

– Gaussian

– Gamma, Inverse Gamma, (Exponential, Chi-square, Wishart) – Dirichlet

– Discrete (Categorical), Bernoulli, multinomial

• All of those distributions can be written as

p(x|θ) = exp{θ ^⊤ ψ(x) − c(θ)}

c(θ) = log Z

X

ⁿ

dx exp(θ ^⊤ ψ(x)) log-partition function

θ canonical parameters

ψ(x) sufficient statistics

(109)

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference for the variance R of a zero mean Gaussian.

p(x|R) = N (x; 0, R) p(R) = IG(R; a, b)

p(R|x) ∝ p(R)p(x|R)

∝ exp

−(a + 1) log R − (1/b) 1 R

exp

−(x ² /2) 1

R − 1

2 log R

= exp

−(a + 1 + ¹ ₂ )

−(1/b + x ² /2)

⊤

log R 1/R

!

∝ IG(R; a + 1

2 , 2

x ² + 2/b )

(110)

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference of variance R from x ₁ , . . . , x _N .

R

x

₁

x

₂

. . . x

_N

x

_{N +1}

p(R|x) ∝ p(R)

N

Y

i=1

p(x

i

|R)

∝ exp

−(a + 1) log R − (1/b) 1 R

exp − 1 2

X

i

x

²_i

!

1 R − N

2 log R

!

= exp

−(a + 1 +

^N₂

)

−(1/b +

¹₂ ^P_i

x

²_i

)

⊤

log R 1/R

!

∝ IG(R; a + N

2 , 2

P

i

x

²_i

+ 2/b )

(111)

Inverse Gamma, P

i x ² _i = 10 N = 10

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Σ_i x

i

2 = 10 N = 10

(112)

Inverse Gamma, P

i x ² _i = 100 N = 100

0 1 2 3 4 5

0 0.5 1 1.5 2 2.5 3

Σ_i x

i

2 = 100 N = 100

(113)

Inverse Gamma, P

i x ² _i = 1000 N = 1000

0 1 2 3 4 5

0 1 2 3 4 5 6 7 8 9

Σ_i x

i

2 = 1000 N = 1000

(114)

Example: AR(1) model

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

x _k = Ax _k−1 + ǫ _k k = 1 . . . K

ǫ _k is i.i.d., zero mean and normal with variance R.

Estimation problem:

Given x ₀ , . . . , x _K , determine coefficient A and variance R (both scalars).

(115)

AR(1) model, Generative Model notation

A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)

x _k |x _k−1 , A, R ∼ N (x _k ; Ax _k−1 , R) x ₀ = ˆ x ₀

A R

x

0

x

1

. . . x

k−1

x

k

. . . x

K

Gaussian : N (x; µ, V ) ≡ |2πV |

⁻¹²

exp(−

¹₂

(x − µ)

²

/V )

(116)

AR(1) Model. Bayesian Posterior Inference

p(A, R|x ₀ , x ₁ , . . . , x _K ) ∝ p(x ₁ , . . . , x _K |x ₀ , A, R)p(A, R) Posterior ∝ Likelihood × Prior

Using the Markovian (conditional independence) structure we have

p(A, R|x ₀ , x ₁ , . . . , x _K ) ∝

Y K k=1

p(x _k |x _k−1 , A, R)

!

p(A)p(R)

A R

x₀ x₁ . . . xk−1 x_k . . . x_K

(117)

Numerical Example

Suppose K = 1 ,

A R

x

0

x

1

A R

x

0

x

1

By Bayes’ Theorem and the structure of AR(1) model

p(A, R|x ₀ , x ₁ ) ∝ p(x ₁ |x ₀ , A, R)p(A)p(R)

= N (x ₁ ; Ax ₀ , R)N (A; 0, P )IG(R; ν, β/ν)

(118)

Numerical Example

p(A, R|x ₀ , x ₁ ) ∝ p(x ₁ |x ₀ , A, R)p(A)p(R)

= N (x ₁ ; Ax ₀ , R)N (A; 0, P )IG(R; ν, β/ν)

∝ exp

− 1 2

x ² ₁

R + x ₀ x ₁ A

R − 1 2

x ² ₀ A ²

R − 1

2 log 2πR

exp

− 1 2

A ² P

exp

−(ν + 1) log R − ν β

1 R

This posterior has a nonstandard form

exp

α ₁ 1

R + α ₂ A

R + α ₃ A ²

R + α ₄ log R + α ₅ A ²

(119)

Numerical Example, the prior p(A, R)

Equiprobability contour of p(A)p(R)

A

R

−8 −6 −4 −2 0 2 4 6

10⁻⁴ 10⁻² 10⁰ 10² 10⁴

A ∼ N (A; 0, 1.2) R ∼ IG(R; 0.4, 250)

(120)

Numerical Example, the posterior p(A, R|x)

A

R

−8 −6 −4 −2 0 2 4 6

10⁻⁴ 10⁻² 10⁰ 10² 10⁴

Note the bimodal posterior with x

₀

= 1, x

₁

= −6

• A ≈ −6 ⇔ low noise variance R.

• A ≈ 0 ⇔ R.

(121)

Remarks

• Even very simple models can lead easily to complicated posterior distributions

• Ambiguous data usually leads to a multimodal posterior, each mode corresponding to one possible explanation

• A-priori independent variables often become dependent a- posteriori (“Explaining away”)

• (Unfortunately), exact posterior inference is only possible for few special cases

⇒ We need numerical approximate inference methods

(122)

Approximate Inference by

Markov Chain Monte Carlo

(123)

Outline of this section

• A Gaussian toy example

• The Gibbs sampler

• Sketch of Markov Chain theory

• Metropolis-Hastings, MCMC Transition Kernels,

• Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler

• Optimisation versus Integration: Simulated annealing and iterative improvement

(124)

An Introduction to Graphical Models and Monte Carlo methods