p(D|n=1)p(D|n=2)p(D|n=3)p(D|n=4)p(D|n=5) p(n|D = 9)

(1)

An Introduction to

Graphical Models and Monte Carlo methods

A. Taylan Cemgil

Signal Processing and Communications Lab.

Birkbeck School of Economics, Mathematics and Statistics June 19, 2007

Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007

Goals of this Tutorial

To Provide ...

• a basic understanding of underlying principles of probabilistic modeling and inference

• an introduction to Graphical models and associated concepts

• a succinct overview of (perhaps interesting) applications from engineering and computer science

– Statistical Signal Processing, Pattern Recognition – Machine Learning, Artificial Intelligence

• an initial orientation in the broad literature of Monte Carlo methods

Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 1

First Part, Basic Concepts and MCMC

• Introduction

– Bayes’ Theorem,

– Trivial toy example to clarify notation

• Graphical Models – Bayesian Networks

– Undirected Graphical models, Markov Random Fields – Factor graphs

• Maximum Likelihood and Bayesian Learning

• Some Applications

– (classical AI) Medical Expert systems, (Statistics) Variable selection, (Engineering-CS) Computer vision,

– Time Series - terminology and applications – Audio processing

– Non Bayesian applications

• Probability Models

– Exponential family, Conjugacy

– Motivation for Approximate Inference

• Markov Chain Monte Carlo – A Gaussian toy example – The Gibbs sampler

– Sketch of Markov Chain theory

– Metropolis-Hastings, MCMC Transition Kernels,

(2)

– Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler

– Optimisation versus Integration: Simulated annealing and iterative improvement

Second Part, Time Series Models and SMC

• Latent State-Space Models – Hidden Markov Models (HMM), – Kalman Filter Models

– Switching State Space models – Changepoint models

• Inference in HMM

– Forward Backward Algorithm – Viterbi

– Exact inference in Graphical models by message passing

• Sequential Monte Carlo – Importance Sampling

– Particle Filtering

• Final Remarks and Bibliography

Bayes’ Theorem

Thomas Bayes (1702-1761)

“What you know about a parameter λ after the data D arrive is what you knew before about λ and what the data D told you

¹

.”

p(λ|D) = p(D|λ)p(λ) p(D)

Posterior = Likelihood × Prior Evidence

1(Janes 2003 (ed. by Bretthorst); MacKay 2003)

(3)

An application of Bayes’ Theorem: “Source Separation”

Given two fair dice with outcomes λ and y,

D = λ + y

What is λ when D = 9 ?

An application of Bayes’ Theorem: “Source Separation”

D = λ + y = 9

D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 2 3 4 5 6 7

λ = 2 3 4 5 6 7 8

λ = 3 4 5 6 7 8

9

λ = 4 5 6 7 8

9

10

λ = 5 6 7 8

9

10 11

λ = 6 7 8

9

10 11 12

Bayes theorem “upgrades”p(λ) into p(λ|D).

But you have to provide an observation model:p(D|λ)

“Burocratical” derivation

Formally we write

p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(D|λ, y) = δ(D − (λ + y))

Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) =

•

1 x = 0 0 x 6= 0

p(λ, y|D) = 1

p(D)× p(D|λ, y) × p(y)p(λ) Posterior = 1

Evidence× Likelihood × Prior p(λ|D) = X

y

p(λ, y|D) Posterior Marginal

Prior

p(y)p(λ)

p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 1/36 1/36 1/36 1/36 1/36 1/36

λ = 2 1/36 1/36 1/36 1/36 1/36 1/36

λ = 3 1/36 1/36 1/36 1/36 1/36 1/36

λ = 4 1/36 1/36 1/36 1/36 1/36 1/36

λ = 5 1/36 1/36 1/36 1/36 1/36 1/36

λ = 6 1/36 1/36 1/36 1/36 1/36 1/36

• A table with indicies λ and y

• Each cell denotes the probability p(λ, y)

(4)

Likelihood

p(D = 9|λ, y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1

λ = 4 0 0 0 0 1 0

λ = 5 0 0 0 1 0 0

λ = 6 0 0 1 0 0 0

• A table with indicies λ and y

• The likelihood is not a probability distribution, but a positive function.

Likelihood × Prior

φ

_D

(λ, y) = p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1/36

λ = 4 0 0 0 0 1/36 0

λ = 5 0 0 0 1/36 0 0

λ = 6 0 0 1/36 0 0 0

Evidence

p(D = 9) = X

λ,y

p(D = 9|λ, y)p(λ)p(y)

= 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0

= 1/9

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1/36

λ = 4 0 0 0 0 1/36 0

λ = 5 0 0 0 1/36 0 0

λ = 6 0 0 1/36 0 0 0

Posterior

p(λ, y|D = 9) = 1

p(D) p(D = 9|λ, y)p(λ)p(y)

p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0

λ = 3 0 0 0 0 0 1/4

λ = 4 0 0 0 0 1/4 0

λ = 5 0 0 0 1/4 0 0

λ = 6 0 0 1/4 0 0 0

1/4 = (1/36)/(1/9)

(5)

Marginal Posterior

p(λ|D) = X

y

1

p(D)p(D|λ, y)p(λ)p(y)

p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1/4 0 0 0 0 0 1/4

λ = 4 1/4 0 0 0 0 1/4 0

λ = 5 1/4 0 0 0 1/4 0 0

λ = 6 1/4 0 0 1/4 0 0 0

The “proportional to” notation

p(λ|D = 9) ∝ p(λ, D = 9) =X

y

p(D = 9|λ, y)p(λ)p(y)

p(λ, D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6

λ = 1 0 0 0 0 0 0 0

λ = 2 0 0 0 0 0 0 0

λ = 3 1/36 0 0 0 0 0 1/36

λ = 4 1/36 0 0 0 0 1/36 0

λ = 5 1/36 0 0 0 1/36 0 0

λ = 6 1/36 0 0 1/36 0 0 0

Exercise

p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3 1. Find the following quantities

• Marginals: p(x1), p(x2)

• Conditionals: p(x1|x2), p(x2|x1)

• Posterior: p(x1, x2= 2), p(x1|x2= 2)

• Evidence: p(x2= 2)

• p({})

• Max: p(x^∗₁) = maxx₁p(x1|x2= 1)

• Mode: x^∗₁= arg maxx₁p(x1|x2= 1)

• Max-marginal: maxx₁p(x1, x2)

2. Arex1andx2independent ? (i.e., Isp(x1, x2) = p(x1)p(x2) ?)

Answers

p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3

• Marginals:

p(x1) x1= 1 0.6 x1= 2 0.4

p(x2) x2= 1 x2= 2

0.4 0.6

• Conditionals:

p(x1|x2) x2= 1 x2= 2 x1= 1 0.75 0.5 x1= 2 0.25 0.5

p(x2|x1) x2= 1 x2= 2 x1= 1 0.5 0.5 x1= 2 0.25 0.75

(6)

Answers

p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3

• Posterior:

p(x1, x2= 2) x2= 2 x1= 1 0.3 x1= 2 0.3

p(x1|x2= 2) x2= 2 x1= 1 0.5 x1= 2 0.5

• Evidence:

p(x2= 2) =X

x₁

p(x1, x2= 2) = 0.6

• Normalisation constant:

p({}) =X

x₁

X

x₂

p(x1, x2) = 1

Answers

p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3

• Max: (get the value)

maxx₁ p(x1|x2= 1) = 0.75

• Mode: (get the index)

argmax

x₁ p(x1|x2= 1) = 1

• Max-marginal: (get the “skyline”) maxx₁p(x1, x2)

maxx₁p(x1, x2) x2= 1 x2= 2

0.3 0.3

Another application of Bayes’ Theorem: “Model Selection”

Given an unknown number of fair dice with outcomes λ

1

, λ

2

, . . . , λ

n

,

D = X

n

i=1

λ

i

How many dice are there when D = 9 ? Assume that any number n is equally likely

Another application of Bayes’ Theorem: “Model Selection”

Given alln are equally likely (i.e., p(n) is flat), we calculate (formally)

p(n|D = 9) = p(D = 9|n)p(n)

p(D) ∝ p(D = 9|n)

p(D|n = 1) = X

λ₁

p(D|λ1)p(λ1)

p(D|n = 2) = X

λ₁

X

λ₂

p(D|λ1, λ2)p(λ1)p(λ2) . . .

p(D|n = n^′) = X

λ₁,...,λ_n′

p(D|λ1, . . . , λn^′)

n^′

Y

i=1

p(λi)

(7)

p(D|n) = P

λ

p(D|λ, n)p(λ|n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

0.2

p(D|n=1)

D 0

0.2

p(D|n=2)

0 0.2

p(D|n=3)

0 0.2

p(D|n=4)

0 0.2

p(D|n=5)

Another application of Bayes’ Theorem: “Model Selection”

1 2 3 4 5 6 7 8 9

0 0.1 0.2 0.3 0.4 0.5

n = Number of Dice

p(n|D = 9)

• Complex models are more flexible but they spread their probability mass

• Bayesian inference inherently prefers “simpler models” – Occam’s razor

• Computational burden: We need to sum over all parameters λ

Probabilistic Inference

A huge spectrum of applications – all boil down to computation of

• expectations of functions under probability distributions: Integration hf (x)i =

Z

X

dxp(x)f (x) hf (x)i = X

x∈X

p(x)f (x)

• modes of functions under probability distributions: Optimization x^∗ = argmax

x∈X p(x)f (x)

• any “mix” of the above: e.g., x^∗ = argmax

x∈X

p(x) = argmax

x∈X

Z

dzp(z)p(x|z)

Graphical Models

“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and in effect increases the mental power of the race.” A.N. Whitehead

(8)

Graphical Models

• formal languages for specification of probability distributions and associated inference algorithms

• historically, introduced in probabilistic expert systems (Pearl 1988) as a visual guide for representing expert knowledge

• today, a standard tool in machine learning, statistics and signal processing

Graphical Models

• provide graph based algorithms for derivations and computation

• pedagogical insight/motivation for model/algorithm construction

– Statistics:

“Kalman filter models and hidden Markov models (HMM) are equivalent upto parametrisation”

– Signal processing:

“Fast Fourier transform is an instance of sum-product algorithm on a factor graph”

– Computer Science:

“Backtracking in Prolog is equivalent to inference in Bayesian networks with deterministic tables”

• Automated tools for code generation start to emerge, making the design/implement/test cycle shorter

Important types of Graphical Models

• Useful for Model Construction

– Directed Acyclic Graphs (DAG), Bayesian Networks – Undirected Graphs, Markov Networks, Random Fields – Influence diagrams

– ...

• Useful for Inference – Factor Graphs

– Junction/Clique graphs – Region graphs

– ...

Directed Acyclic Graphical (DAG) Models

Factor Graphs and

(9)

Directed Graphical models

• Each random variable is associated with a node in the graph,

• We draw an arrow from A → B if p(B| . . . , A, . . . ) (A ∈ parent(B)),

• The edges tell us qualitatively about the factorization of the joint probability

• For N random variables x

1

, . . . , x

N

, the distribution admits

p(x

1

, . . . , x

N

) = Y

N i=1

p(x

i

|parent(x

i

))

• Describes in a compact way an algorithm to “generate” the data –

“Generative models”

DAG Example: Two dice

p(λ) p(y)

λ y

D p(D|λ, y)

p(D, λ, y) = p(D|λ, y)p(λ)p(y)

DAG with observations

p(λ) p(y)

λ y

D p(D = 9|λ, y)

φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)

Examples

Model Structure factorization

Full ^! ^" ^# p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)

Markov(2) ^! ^" ^# p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x2, x3)

Markov(1) ^! ^" ^# p(x1)p(x2|x1)p(x3|x2)p(x4|x3)

! " #

p(x1)p(x2|x1)p(x3|x1)p(x4)

Factorized ^! ^" ^# p(x1)p(x2)p(x3)p(x4)

Removing edges eliminates a term from the conditional probability factors.

(10)

Undirected Graphical Models

• Define a distribution by local compatibility functions φ(xα)

p(x) = 1 Z

Y

α

φ(xα)

whereα runs over cliques : fully connected subsets

• Markov Random Fields

Undirected Graphical Models

• Examples

x1

x2 x3

x4

x1

x2 x3

x4

p(x) =_Z¹φ(x1, x2)φ(x1, x3)φ(x2, x4)φ(x3, x4) p(x) =_Z¹φ(x1, x2, x3)φ(x2, x3, x4)

Factor graphs

(Kschischang et. al.)

• A bipartite graph. A powerful graphical representation of the inference problem – Factor nodes: Black squares. Factor potentials (local functions) defining

the posterior.

– Variable nodes: White Nodes. Define collections of random variables – Edges: denote membership. A variable node is connected to a factor node

if a member variable is an argument of the local function.

p(λ) p(y)

λ y

p(D = 9|λ, y)

φD(λ, y) = p(D = 9|λ, y)p(λ)p(y) = φ1(λ, y)φ2(λ)φ3(y)

Exercise

• For the following Graphical models, write down the factors of the joint distribution and plot an equivalent factor graph.

Full ^! ^" ^# Markov(1) ^! ^" ^#

HMM

! " #

! !! !" !#

MIX ^! ^!! ^!" ^!#

IFA

!

! !

!

"

!

# Factorized ^! ^" ^#

(11)

Answer (Markov(1))

x1 x2 x3 x4

p(x1) x1

p(x2|x1) x2

p(x3|x2) x3

p(x4|x3) x4

x1 x2 x3 x4

p(x1)p(x2|x1)

| {z }

φ(x₁,x₂)

p(x3|x2)

| {z }

φ(x₂,x₃)

p(x4|x3)

| {z }

φ(x₃,x₄)

Answer (IFA – Factorial)

!

! !

!

"

!

#

p(h1)p(h2) Y4 i=1

p(xi|h1, h2)

h1 h2

x1 x2 x3 x4

Answer (IFA – Factorial)

h1 h2

x1 x2 x3 x4

• We can also cluster nodes together

h1, h2

x1 x2 x3 x4

Inference and Learning

• Data set

D = {x

1

, . . . x

N

}

• Model with parameter λ

p(D|λ)

• Maximum Likelihood (ML)

λ

^ML

= arg max

λ

log p(D|λ)

• Predictive distribution

p(x

N +1

|D) ≈ p(x

N +1

|λ

^ML

)

(12)

Regularisation

• Prior

p(λ)

• Maximum a-posteriori (MAP) : Regularised Maximum Likelihood λ

^MAP

= arg max

λ

log p(D|λ)p(λ)

• Predictive distribution

p(x

N +1

|D) ≈ p(x

_{N +1}

|λ

^MAP

)

Bayesian Learning

• We treat parameters on the same footing as all other variables

• We integrate over unknown parameters rather than using point estimates (remember the many-dice example)

– Avoids overfitting

– Natural setup for online adaptation – Model selection

Bayesian Learning

• Predictive distribution p(x

N +1

|D) =

Z

dλ p(x

N +1

|λ)p(λ|D)

λ

x1 x2 . . . xN xN +1

• Bayesian learning is just inference ...

Some Applications

(13)

Medical Expert Systems

A S

T L B

E

X D

Diseases

Symptomes Causes

Medical Expert Systems

Visit to Asia? Smoking?

Tuberclosis? Lung Cancer? Bronchitis?

Either T or L?

Positive X Ray? Dyspnoea?

Medical Expert Systems

Visit to Asia?

0 %99

1 %1

Smoking?

0 %50

1 %50

Tuberclosis?

0 %99

1 %1

Lung Cancer?

0 %94.5

1 %5.5

Bronchitis?

0 %55

1 %45

Either T or L?

0 %93.5

1 %6.5

Positive X Ray?

0 %89

1 %11

Dyspnoea?

0 %56.4

1 %43.6

Medical Expert Systems

Visit to Asia?

0 %98.7

1 %1.3

Smoking?

0 %31.2

1 %68.8

Tuberclosis?

0 %90.8

1 %9.2

Lung Cancer?

0 %51.1

1 %48.9

Bronchitis?

0 %49.4

1 %50.6

Either T or L?

0 %42.4

1 %57.6

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %35.9

1 %64.1

(14)

Medical Expert Systems

Visit to Asia?

0 %98.5

1 %1.5

Smoking?

0 %100

1 %0

Tuberclosis?

0 %85.2

1 %14.8

Lung Cancer?

0 %85.8

1 %14.2

Bronchitis?

0 %70

1 %30

Either T or L?

0 %71.1

1 %28.9

Positive X Ray?

0 %0

1 %100

Dyspnoea?

0 %56

1 %44

Model Selection: Variable selection in Polynomial Regression

• Given D = {tj, x(tj)}j=1...J, what is the orderN of the polynomial?

x(t) = XN i=0

si+1tⁱ+ ǫ(t)

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Bayesian Variable Selection

C(r₁; π) C(r_W; π)

r1 . . . rW

N (s1; µ(r1), Σ(r1)) s1 . . . sW N (sW; µ(rW), Σ(rW))

x

N (x; Cs_1:W, R)

• Generalized Linear Model – Column’s of C are the basis vectors

• The exact posterior is a mixture of 2^WGaussians

• When W is large, computation of posterior features becomes intractable.

Regression

t = t1 t2 . . . tJ ⊤

C ≡ t⁰ t¹ . . . t^{W −1}

>> C = fliplr(vander(0:4)) % Van der Monde matrix

1 0 0 0 0

1 1 1 1 1

1 2 4 8 16

1 3 9 27 81

1 4 16 64 256

ri ∼ C(ri; 0.5, 0.5) ri∈ {on, off}

si|ri ∼ N (si; 0, Σ(ri)) x|s1:W ∼ N (x; Cs1:W, R)

Σ(ri= on) ≫ Σ(ri= off)

(15)

Regression

To find the “active” basis functions we need to calculate r_1:W^∗ ≡ argmax

r_1:W p(r1:W|x) = argmax

r_1:W

Z

ds1:Wp(x|s1:W)p(s1:W|r1:W)p(r1:W)

Then, the reconstruction is given by

ˆ x(t) =

*_{W −1} X

i=0

si+1tⁱ +

p(s_1:W|x,r_1:W^∗ )

=

W −1X

i=0

hsi+1i_p(s

i+1|x,r^∗_1:W)tⁱ

Regression

i

0 1 2 3 4

−10 0 10 20 30

p(x, r1:W)

All on Configurations All off

Regression

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

data true approx

Clustering

(16)

Clustering

π Label probability

c1 c2 . . . cN Labels∈ {a, b}

x₁ x₂ . . . x_N Data Points

µa µb Cluster Centers

(µ^∗a, µ^∗b, π^∗) = argmax

µa,µ_b,π

X

c_1:N

YN i=1

p(xi|µa, µb, ci)p(ci|π)

Computer vision / Cognitive Science

How many rectangles are there in this image?

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

Computer vision / Cognitive Science

. . .

π1 π2 . . . πN Label probabilities

c₁ c₂ . . . c_N Labels∈ {a, b, . . .}

x1 x2 . . . xN Pixel Values

µa µb . . . Rectangle Colors

0 10 20 30 40 50 60

0

5

10

15

20

25

30

35

40

Computer Vision

How many people are there in these images?

(17)

Visual Tracking

20 40 60

20 40 60 80 100 120 140

20 40 60

20 40 60 80 100 120 140

20 40 60

20 40 60 80 100 120 140

20 40 60

20 40 60 80 100 120 140

Navigation, Robotics

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2 0

2

−2 0 2 0 2 4

f Lx L_y

−0.5 0 0.5

−0.6

−0.4

−0.2 0 0.2 0.4 0.6

−2 0 2

−2

−1 0 1 2

−2 0 2

−2 0 2 0 2 4 6 8

Navigation, Robotics

GPS?t GPS status

Gt GPS reading

... Other sensors (magnetic, pressure, e.t.c.) lt Linear accelerator sensor

ωt Gyroscope

Et−1 Et Attitude Variables

Xt−1 Xt Linear Kinematic Variables

{ξ1:Nt}t Set of feature points (Camera Frame)

{x1:Mt}t Set of feature points (World Coordinates)

ρ(x) Global Static Map (Intensity function)

Time series models and Inference, Terminology

Generic structure of dynamical system models

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

xk ∼ p(xk|xk−1) Transition Model yk ∼ p(yk|xk) Observation Model

• x are the latent states

• y are the observations

• In a full Bayesian setting, x includes unknown model parameters

(18)

Online Inference, Terminology

• Filtering: p(x

k

|y

1:k

)

– Distribution of current state given all past information – Realtime/Online/Sequential Processing

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

Online Inference, Terminology

• Prediction p(y

k:K

, x

k:K

|y

1:k−1

)

– evaluation of possible future outcomes; like filtering without observations

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

• Tracking, Restoration

Offline Inference, Terminology

• Smoothing p(x0:K|y1:K),

Most likely trajectory – Viterbi patharg maxx_0:Kp(x0:K|y1:K) better estimate of past states, essential for learning

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

• Interpolation p(yk, xk|y1:k−1, yk+1:K) fill in lost observations given past and future

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

Time Series Analysis

• Stationary

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

– What is the true state of the process given noisy data ? – Parameters ?

– Markovian ? Order ?

(19)

Time Series Analysis

• Nonstationary, time varying variance – stochastic volatility

0 200 400 600 800 1000

0 5 10 15 20

vk

0 200 400 600 800 1000

−10

−5 0 5 10

yk

k

True VB

Time Series Analysis

• Nonstationary, time varying intensity – nonhomogeneous Poisson Process

0 50 100 150

λk

0 0.2 0.4 0.6 0.8 1

ck

Arrival time

True VB

Time Series Analysis

• Piecewise constant

0 10 20 30 40 50 60 70 80 90 100

−5 0 5 10 15

Time Series Analysis

• Piecewise linear

0 20 40 60 80 100 120 140 160 180 200

−10

−5 0 5 10 15

• Segmentation and Changepoint detection

– What is the true state of the process given noisy data ? – Where are the changepoints ?

– How many changepoints ?

(20)

Audio Processing

xt

t (Speech)

t xt

(Piano)

x = x1 . . . xt . . .

Audio Restoration

• During download or transmission, some samples of audio are lost

• Estimate missing samples given clean ones

0 50 100 150 200 250 300 350 400 450 500

0

Examples: Audio Restoration

p(x¬κ|x^κ) ∝ Z

dHp(x¬κ|H)p(x^κ|H)p(H) H ≡ (parameters, hidden states)

H

x¬κ xκ

Missing Observed

0 50 100 150 200 250 300 350 400 450 500

0

Probabilistic Phase Vocoder

(Cemgil and Godsill 2005)

Aν Qν

s^ν₀ · · · s^ν_k · · · s^ν_K−1 ν = 0 . . . W−1

x0 xk xK−1

s^ν_k ∼ N (s^ν_k; Aνs^ν_k−1, Qν) Aν∼ N

Aν;

cos(ων) − sin(ων) sin(ων) cos(ων)

, Ψ

(21)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%) – Reconstruction, 7.10 dB improvement – Original

Pitch Tracking

Monophonic Pitch Tracking= Online estimation (filtering) of p(rt, mt|y1:t).

100 200 300 400 500 600 700 800 900 1000

−100

−50 0 50

100 200 300 400 500 600 700 800 900 1000

5 10 15

Pitch Tracking

r0 r1 . . . rT

m0 m1 . . . mT

s0 s1 . . . sT

y1 . . . yT

Monophonic transcription

• Detecting onsets, offsets and pitch(Cemgil et. al. 2006, IEEE TSALP)

500 1000 1500 2000 2500 3000 3500

Exact inference (S)

(22)

Tracking Pitch Variations

• Allow m to change with k.

50 100 150 200 250 300 350 400 450 500

• Intractable, need to resort to approximate inference (Mixture Kalman Filter - Rao-Blackwellized Particle Filter)

Source Separation

sk,1 . . . sk,n . . . sk,N

xk,1 . . . xk,M

k = 1 . . . K

a₁ r1 . . . a_M rM

• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)

Spectrogram

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

• A linear expansion using a collection of basis functions φ(t; τ, ω) centered around timeτ and frequency ω

xt = X

τ,ω

α(τ, ω)φ(t; τ, ω)

• Spectrogram displays log |α(τ, ω)|²or|α(τ, ω)|²

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Mix)

(23)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25

(Guitar)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

Generative Models for Music Generative Models for Music

Score Expression

Piano-Roll

Signal

(24)

Hierarchical Modeling of Music

! !!

"""

!

# #! """ #

$ $! """ $

% %! """ %

& &! """ &

' '! """ '

(!" (!"! """ (!"

)!" )!"! """ )!"

*!" *!"! """ *!"

!" !"!

"""

!"

+!" +!"! """ +!"

+ +! """ +

A few non-Bayesian applications where Monte Carlo is useful

Combinatorics

• Counting

Example : What is the probability that a solitaire laid out with 52 cards comes out successfully given all permutations have equal probability ?

|A| = X

x∈X

[x ∈ A] [x ∈ A] ≡

1 x ∈ A 0 x /∈ A

p(x ∈ A) = |A|

|X |= ?

≈ 2²²⁵

Geometry

• Given a simplex S in N dimensional space by S = {x : Ax ≤ b, x ∈ R^N} find the Volume|S|

(25)

Rare Events

• Given a graph with random edge lengths xi∼ p(xi)

Find the probability that the shortest path from A to B is larger thanγ.

A B

x1

x2

x4

x5

x₃

Rare Events

x1 x2 x3 x4 x5 Edge Lengths

L ShortestPath(A,B)

Pr(L ≥ γ) = Z

dx1:5[L(x1:5) ≥ γ] p(x1:5)

Rare Events

A B

hx1i = 4

hx2i = 1

hx4i = 1

hx5i = 4 hx3i = 1

xi∼ E(xi; ui) ≡_u¹

iexp

−_u¹

ixi

ui= hxii ≡R

xip(xi)dxi

0 2 4 6 8 10 12 14

0 2000 4000 6000 8000 10000 12000

Shortest−Path(A,B)

Count

Probability Models

(26)

Example: AR(1) model

0 10 20 30 40 50 60 70 80 90 100

−0.5 0 0.5

xk= Axk−1+ ǫk k = 1 . . . K

ǫkis i.i.d., zero mean and normal with varianceR.

Estimation problem:

Givenx0, . . . , xK, determine coefficientA and variance R (both scalars).

AR(1) model, Generative Model notation

A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)

xk|xk−1, A, R ∼ N (xk; Axk−1, R) x0= ˆx0

A R

x0 x1 . . . xk−1 xk . . . xK

Observed variables are shown with double circles

Example, Univariate Gaussian

The Gaussian distribution with meanm and covariance S has the form N (x; m, S) = (2πS)^−1/2exp{−1

2(x − m)²/S}

= exp{−1

2(x²+ m²− 2xm)/S −1

2log(2πS)}

= expm Sx − 1

2Sx²−1

2log(2πS) + 1 2Sm²

= exp{

m/S

−¹₂/S

⊤

| {z }

θ

x x²

| {z }

ψ(x)

−c(θ)}

Hence by matching coefficients we have exp

−¹₂Kx²+ hx + g

⇔ S = K⁻¹ m = K⁻¹h

Example, Gaussian

(27)

The Multivariate Gaussian Distribution

µ is the mean and P is the covariance:

N (s; µ, P ) = |2πP |^−1/2exp

−1

2(s − µ)^TP⁻¹(s − µ)

= exp

−1

2s^TP⁻¹s + µ^TP⁻¹s−1

2µ^TP⁻¹µ −1 2|2πP |

log N (s; µ, P ) = −1

2s^TP⁻¹s + µ^TP⁻¹s + const

= −1

2TrP⁻¹ss^T+ µ^TP⁻¹s + const

=⁺ −1

2TrP⁻¹ss^T+ µ^TP⁻¹s

Notation:log f (x) =⁺g(x) ⇐⇒ f (x) ∝ exp(g(x)) ⇐⇒ ∃c ∈ R : f(x) = c exp(g(x))

log p(s) =⁺ −1

2TrKss^T+ h^⊤s ⇒ p(s) = N (s; K⁻¹h, K⁻¹)

Example, Inverse Gamma

The inverse Gamma distribution with shapea and scale b

IG(r; a, b) = 1 Γ(a)

r^−(a+1)

b^a exp(−1 br)

= exp

−(a + 1) log r − 1

br− log Γ(a) − a log b

= exp

−(a + 1)

−1/b

⊤ log r

1/r

− log Γ(a) − a log b

!

Hence by matching coefficients, we have

exp

α log r + β1 r+ c

⇔ a = −α − 1 b = −1/β

Example, Inverse Gamma

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5 a=2 b=1

Basic Distributions : Exponential Family

• Following distributions are used often as elementary building blocks:

– Gaussian

– Gamma, Inverse Gamma, (Exponential, Chi-square, Wishart) – Dirichlet

– Discrete (Categorical), Bernoulli, multinomial

• All of those distributions can be written as

p(x|θ) = exp{θ^⊤ψ(x) − c(θ)}

c(θ) = log Z

Xⁿ

dx exp(θ^⊤ψ(x)) log-partition function

θ canonical parameters

ψ(x) sufficient statistics

(28)

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference for the varianceR of a zero mean Gaussian.

p(x|R) = N (x; 0, R) p(R) = IG(R; a, b)

p(R|x) ∝ p(R)p(x|R)

∝ exp

−(a + 1) log R − (1/b)1 R

exp

−(x²/2)1 R−1

2log R

= exp

−(a + 1 +¹₂)

−(1/b + x²/2)

⊤ log R

1/R

!

∝ IG(R; a +1 2, 2

x²+ 2/b)

Like the prior, this is an inverse-Gamma distribution.

Conjugate priors: Posterior is in the same family as the prior.

Example: posterior inference of varianceR from x1, . . . , xN.

R

x1 x2 . . . xN xN +1

p(R|x) ∝ p(R)

N i=1

p(x_i|R)

∝ exp

!

−(a + 1) log R − (1/b)1 R

"

exp

#

−

#

1 2

$

i

x²_i

%

1 R−N

2 log R

%

= exp

#

!

−(a + 1 +^N₂)

−(1/b +¹₂^&_ix²_i)

"⊤^!

log R 1/R

"

%

∝ IG(R; a +N

2, 2

&

ix²_i+ 2/b) Sufficient statistics are additive

Inverse Gamma, P

i

x

²_i

= 10 N = 10

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Σ_i x_i² = 10 N = 10

Inverse Gamma, P

i

x

²_i

= 100 N = 100

0 1 2 3 4 5

0 0.5 1 1.5 2 2.5 3

Σ_i x_i² = 100 N = 100