An Introduction to
Graphical Models and Monte Carlo methods
A. Taylan Cemgil
Signal Processing and Communications Lab.
Birkbeck School of Economics, Mathematics and Statistics June 19, 2007
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007
Goals of this Tutorial
To Provide ...
• a basic understanding of underlying principles of probabilistic modeling and inference
• an introduction to Graphical models and associated concepts
• a succinct overview of (perhaps interesting) applications from engineering and computer science
– Statistical Signal Processing, Pattern Recognition – Machine Learning, Artificial Intelligence
• an initial orientation in the broad literature of Monte Carlo methods
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 1
First Part, Basic Concepts and MCMC
• Introduction
– Bayes’ Theorem,
– Trivial toy example to clarify notation
• Graphical Models – Bayesian Networks
– Undirected Graphical models, Markov Random Fields – Factor graphs
• Maximum Likelihood and Bayesian Learning
• Some Applications
– (classical AI) Medical Expert systems, (Statistics) Variable selection, (Engineering-CS) Computer vision,
– Time Series - terminology and applications – Audio processing
– Non Bayesian applications
• Probability Models
– Exponential family, Conjugacy
– Motivation for Approximate Inference
• Markov Chain Monte Carlo – A Gaussian toy example – The Gibbs sampler
– Sketch of Markov Chain theory
– Metropolis-Hastings, MCMC Transition Kernels,
– Sketch of Convergence proofs for Metropolis-Hastings and the Gibbs sampler
– Optimisation versus Integration: Simulated annealing and iterative improvement
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 4
Second Part, Time Series Models and SMC
• Latent State-Space Models – Hidden Markov Models (HMM), – Kalman Filter Models
– Switching State Space models – Changepoint models
• Inference in HMM
– Forward Backward Algorithm – Viterbi
– Exact inference in Graphical models by message passing
• Sequential Monte Carlo – Importance Sampling
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 5
– Particle Filtering
• Final Remarks and Bibliography
Bayes’ Theorem
Thomas Bayes (1702-1761)
“What you know about a parameter λ after the data D arrive is what you knew before about λ and what the data D told you
1.”
p(λ|D) = p(D|λ)p(λ) p(D)
Posterior = Likelihood × Prior Evidence
1(Janes 2003 (ed. by Bretthorst); MacKay 2003)
An application of Bayes’ Theorem: “Source Separation”
Given two fair dice with outcomes λ and y,
D = λ + y
What is λ when D = 9 ?
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 8
An application of Bayes’ Theorem: “Source Separation”
D = λ + y = 9
D = λ + y y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 2 3 4 5 6 7
λ = 2 3 4 5 6 7 8
λ = 3 4 5 6 7 8
9
λ = 4 5 6 7 8
9
10λ = 5 6 7 8
9
10 11λ = 6 7 8
9
10 11 12Bayes theorem “upgrades”p(λ) into p(λ|D).
But you have to provide an observation model:p(D|λ)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 9
“Burocratical” derivation
Formally we write
p(λ) = C(λ; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(y) = C(y; [ 1/6 1/6 1/6 1/6 1/6 1/6 ]) p(D|λ, y) = δ(D − (λ + y))
Kronecker delta function denoting a degenerate (deterministic) distribution δ(x) =
•
1 x = 0 0 x 6= 0
p(λ, y|D) = 1
p(D)× p(D|λ, y) × p(y)p(λ) Posterior = 1
Evidence× Likelihood × Prior p(λ|D) = X
y
p(λ, y|D) Posterior Marginal
Prior
p(y)p(λ)
p(y) × p(λ) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 1/36 1/36 1/36 1/36 1/36 1/36
λ = 2 1/36 1/36 1/36 1/36 1/36 1/36
λ = 3 1/36 1/36 1/36 1/36 1/36 1/36
λ = 4 1/36 1/36 1/36 1/36 1/36 1/36
λ = 5 1/36 1/36 1/36 1/36 1/36 1/36
λ = 6 1/36 1/36 1/36 1/36 1/36 1/36
• A table with indicies λ and y
• Each cell denotes the probability p(λ, y)
Likelihood
p(D = 9|λ, y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0
λ = 3 0 0 0 0 0 1
λ = 4 0 0 0 0 1 0
λ = 5 0 0 0 1 0 0
λ = 6 0 0 1 0 0 0
• A table with indicies λ and y
• The likelihood is not a probability distribution, but a positive function.
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 12
Likelihood × Prior
φ
D(λ, y) = p(D = 9|λ, y)p(λ)p(y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0
λ = 3 0 0 0 0 0 1/36
λ = 4 0 0 0 0 1/36 0
λ = 5 0 0 0 1/36 0 0
λ = 6 0 0 1/36 0 0 0
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 13
Evidence
p(D = 9) = X
λ,y
p(D = 9|λ, y)p(λ)p(y)
= 0 + 0 + · · · + 1/36 + 1/36 + 1/36 + 1/36 + 0 + · · · + 0
= 1/9
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0
λ = 3 0 0 0 0 0 1/36
λ = 4 0 0 0 0 1/36 0
λ = 5 0 0 0 1/36 0 0
λ = 6 0 0 1/36 0 0 0
Posterior
p(λ, y|D = 9) = 1
p(D) p(D = 9|λ, y)p(λ)p(y)
p(D = 9|λ, y) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0
λ = 3 0 0 0 0 0 1/4
λ = 4 0 0 0 0 1/4 0
λ = 5 0 0 0 1/4 0 0
λ = 6 0 0 1/4 0 0 0
1/4 = (1/36)/(1/9)
Marginal Posterior
p(λ|D) = X
y
1
p(D)p(D|λ, y)p(λ)p(y)
p(λ|D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0 0
λ = 3 1/4 0 0 0 0 0 1/4
λ = 4 1/4 0 0 0 0 1/4 0
λ = 5 1/4 0 0 0 1/4 0 0
λ = 6 1/4 0 0 1/4 0 0 0
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 16
The “proportional to” notation
p(λ|D = 9) ∝ p(λ, D = 9) =X
y
p(D = 9|λ, y)p(λ)p(y)
p(λ, D = 9) y = 1 y = 2 y = 3 y = 4 y = 5 y = 6
λ = 1 0 0 0 0 0 0 0
λ = 2 0 0 0 0 0 0 0
λ = 3 1/36 0 0 0 0 0 1/36
λ = 4 1/36 0 0 0 0 1/36 0
λ = 5 1/36 0 0 0 1/36 0 0
λ = 6 1/36 0 0 1/36 0 0 0
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 17
Exercise
p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3 1. Find the following quantities
• Marginals: p(x1), p(x2)
• Conditionals: p(x1|x2), p(x2|x1)
• Posterior: p(x1, x2= 2), p(x1|x2= 2)
• Evidence: p(x2= 2)
• p({})
• Max: p(x∗1) = maxx1p(x1|x2= 1)
• Mode: x∗1= arg maxx1p(x1|x2= 1)
• Max-marginal: maxx1p(x1, x2)
2. Arex1andx2independent ? (i.e., Isp(x1, x2) = p(x1)p(x2) ?)
Answers
p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3
• Marginals:
p(x1) x1= 1 0.6 x1= 2 0.4
p(x2) x2= 1 x2= 2
0.4 0.6
• Conditionals:
p(x1|x2) x2= 1 x2= 2 x1= 1 0.75 0.5 x1= 2 0.25 0.5
p(x2|x1) x2= 1 x2= 2 x1= 1 0.5 0.5 x1= 2 0.25 0.75
Answers
p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3
• Posterior:
p(x1, x2= 2) x2= 2 x1= 1 0.3 x1= 2 0.3
p(x1|x2= 2) x2= 2 x1= 1 0.5 x1= 2 0.5
• Evidence:
p(x2= 2) =X
x1
p(x1, x2= 2) = 0.6
• Normalisation constant:
p({}) =X
x1
X
x2
p(x1, x2) = 1
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 20
Answers
p(x1, x2) x2= 1 x2= 2 x1= 1 0.3 0.3 x1= 2 0.1 0.3
• Max: (get the value)
maxx1 p(x1|x2= 1) = 0.75
• Mode: (get the index)
argmax
x1 p(x1|x2= 1) = 1
• Max-marginal: (get the “skyline”) maxx1p(x1, x2)
maxx1p(x1, x2) x2= 1 x2= 2
0.3 0.3
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 21
Another application of Bayes’ Theorem: “Model Selection”
Given an unknown number of fair dice with outcomes λ
1, λ
2, . . . , λ
n,
D = X
ni=1
λ
iHow many dice are there when D = 9 ? Assume that any number n is equally likely
Another application of Bayes’ Theorem: “Model Selection”
Given alln are equally likely (i.e., p(n) is flat), we calculate (formally)
p(n|D = 9) = p(D = 9|n)p(n)
p(D) ∝ p(D = 9|n)
p(D|n = 1) = X
λ1
p(D|λ1)p(λ1)
p(D|n = 2) = X
λ1
X
λ2
p(D|λ1, λ2)p(λ1)p(λ2) . . .
p(D|n = n′) = X
λ1,...,λn′
p(D|λ1, . . . , λn′)
n′
Y
i=1
p(λi)
p(D|n) = P
λ
p(D|λ, n)p(λ|n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0
0.2
p(D|n=1)
D 0
0.2
p(D|n=2)
0 0.2
p(D|n=3)
0 0.2
p(D|n=4)
0 0.2
p(D|n=5)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 24
Another application of Bayes’ Theorem: “Model Selection”
1 2 3 4 5 6 7 8 9
0 0.1 0.2 0.3 0.4 0.5
n = Number of Dice
p(n|D = 9)
• Complex models are more flexible but they spread their probability mass
• Bayesian inference inherently prefers “simpler models” – Occam’s razor
• Computational burden: We need to sum over all parameters λ
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 25
Probabilistic Inference
A huge spectrum of applications – all boil down to computation of
• expectations of functions under probability distributions: Integration hf (x)i =
Z
X
dxp(x)f (x) hf (x)i = X
x∈X
p(x)f (x)
• modes of functions under probability distributions: Optimization x∗ = argmax
x∈X p(x)f (x)
• any “mix” of the above: e.g., x∗ = argmax
x∈X
p(x) = argmax
x∈X
Z
Z
dzp(z)p(x|z)
Graphical Models
“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and in effect increases the mental power of the race.” A.N. Whitehead
Graphical Models
• formal languages for specification of probability distributions and associated inference algorithms
• historically, introduced in probabilistic expert systems (Pearl 1988) as a visual guide for representing expert knowledge
• today, a standard tool in machine learning, statistics and signal processing
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 28
Graphical Models
• provide graph based algorithms for derivations and computation
• pedagogical insight/motivation for model/algorithm construction
– Statistics:
“Kalman filter models and hidden Markov models (HMM) are equivalent upto parametrisation”
– Signal processing:
“Fast Fourier transform is an instance of sum-product algorithm on a factor graph”
– Computer Science:
“Backtracking in Prolog is equivalent to inference in Bayesian networks with deterministic tables”
• Automated tools for code generation start to emerge, making the design/implement/test cycle shorter
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 29
Important types of Graphical Models
• Useful for Model Construction
– Directed Acyclic Graphs (DAG), Bayesian Networks – Undirected Graphs, Markov Networks, Random Fields – Influence diagrams
– ...
• Useful for Inference – Factor Graphs
– Junction/Clique graphs – Region graphs
– ...
Directed Acyclic Graphical (DAG) Models
Factor Graphs and
Directed Graphical models
• Each random variable is associated with a node in the graph,
• We draw an arrow from A → B if p(B| . . . , A, . . . ) (A ∈ parent(B)),
• The edges tell us qualitatively about the factorization of the joint probability
• For N random variables x
1, . . . , x
N, the distribution admits
p(x
1, . . . , x
N) = Y
N i=1p(x
i|parent(x
i))
• Describes in a compact way an algorithm to “generate” the data –
“Generative models”
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 32
DAG Example: Two dice
p(λ) p(y)
λ y
D p(D|λ, y)
p(D, λ, y) = p(D|λ, y)p(λ)p(y)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 33
DAG with observations
p(λ) p(y)
λ y
D p(D = 9|λ, y)
φD(λ, y) = p(D = 9|λ, y)p(λ)p(y)
Examples
Model Structure factorization
Full ! " # p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)
Markov(2) ! " # p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x2, x3)
Markov(1) ! " # p(x1)p(x2|x1)p(x3|x2)p(x4|x3)
! " #
p(x1)p(x2|x1)p(x3|x1)p(x4)
Factorized ! " # p(x1)p(x2)p(x3)p(x4)
Removing edges eliminates a term from the conditional probability factors.
Undirected Graphical Models
• Define a distribution by local compatibility functions φ(xα)
p(x) = 1 Z
Y
α
φ(xα)
whereα runs over cliques : fully connected subsets
• Markov Random Fields
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 36
Undirected Graphical Models
• Examples
x1
x2 x3
x4
x1
x2 x3
x4
p(x) =Z1φ(x1, x2)φ(x1, x3)φ(x2, x4)φ(x3, x4) p(x) =Z1φ(x1, x2, x3)φ(x2, x3, x4)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 37
Factor graphs
(Kschischang et. al.)• A bipartite graph. A powerful graphical representation of the inference problem – Factor nodes: Black squares. Factor potentials (local functions) defining
the posterior.
– Variable nodes: White Nodes. Define collections of random variables – Edges: denote membership. A variable node is connected to a factor node
if a member variable is an argument of the local function.
p(λ) p(y)
λ y
p(D = 9|λ, y)
φD(λ, y) = p(D = 9|λ, y)p(λ)p(y) = φ1(λ, y)φ2(λ)φ3(y)
Exercise
• For the following Graphical models, write down the factors of the joint distribution and plot an equivalent factor graph.
Full ! " # Markov(1) ! " #
HMM
! " #
! !! !" !#
MIX ! !! !" !#
IFA
!
! !
!
!
"
!
# Factorized ! " #
Answer (Markov(1))
x1 x2 x3 x4
p(x1) x1
p(x2|x1) x2
p(x3|x2) x3
p(x4|x3) x4
x1 x2 x3 x4
p(x1)p(x2|x1)
| {z }
φ(x1,x2)
p(x3|x2)
| {z }
φ(x2,x3)
p(x4|x3)
| {z }
φ(x3,x4)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 40
Answer (IFA – Factorial)
!
! !
!
!
"
!
#
p(h1)p(h2) Y4 i=1
p(xi|h1, h2)
h1 h2
x1 x2 x3 x4
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 41
Answer (IFA – Factorial)
h1 h2
x1 x2 x3 x4
• We can also cluster nodes together
h1, h2
x1 x2 x3 x4
Inference and Learning
• Data set
D = {x
1, . . . x
N}
• Model with parameter λ
p(D|λ)
• Maximum Likelihood (ML)
λ
ML= arg max
λ
log p(D|λ)
• Predictive distribution
p(x
N +1|D) ≈ p(x
N +1|λ
ML)
Regularisation
• Prior
p(λ)
• Maximum a-posteriori (MAP) : Regularised Maximum Likelihood λ
MAP= arg max
λ
log p(D|λ)p(λ)
• Predictive distribution
p(x
N +1|D) ≈ p(x
N +1|λ
MAP)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 44
Bayesian Learning
• We treat parameters on the same footing as all other variables
• We integrate over unknown parameters rather than using point estimates (remember the many-dice example)
– Avoids overfitting
– Natural setup for online adaptation – Model selection
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 45
Bayesian Learning
• Predictive distribution p(x
N +1|D) =
Z
dλ p(x
N +1|λ)p(λ|D)
λ
x1 x2 . . . xN xN +1
• Bayesian learning is just inference ...
Some Applications
Medical Expert Systems
A S
T L B
E
X D
Diseases
Symptomes Causes
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 48
Medical Expert Systems
Visit to Asia? Smoking?
Tuberclosis? Lung Cancer? Bronchitis?
Either T or L?
Positive X Ray? Dyspnoea?
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 49
Medical Expert Systems
Visit to Asia?
0 %99
1 %1
Smoking?
0 %50
1 %50
Tuberclosis?
0 %99
1 %1
Lung Cancer?
0 %94.5
1 %5.5
Bronchitis?
0 %55
1 %45
Either T or L?
0 %93.5
1 %6.5
Positive X Ray?
0 %89
1 %11
Dyspnoea?
0 %56.4
1 %43.6
Medical Expert Systems
Visit to Asia?
0 %98.7
1 %1.3
Smoking?
0 %31.2
1 %68.8
Tuberclosis?
0 %90.8
1 %9.2
Lung Cancer?
0 %51.1
1 %48.9
Bronchitis?
0 %49.4
1 %50.6
Either T or L?
0 %42.4
1 %57.6
Positive X Ray?
0 %0
1 %100
Dyspnoea?
0 %35.9
1 %64.1
Medical Expert Systems
Visit to Asia?
0 %98.5
1 %1.5
Smoking?
0 %100
1 %0
Tuberclosis?
0 %85.2
1 %14.8
Lung Cancer?
0 %85.8
1 %14.2
Bronchitis?
0 %70
1 %30
Either T or L?
0 %71.1
1 %28.9
Positive X Ray?
0 %0
1 %100
Dyspnoea?
0 %56
1 %44
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 52
Model Selection: Variable selection in Polynomial Regression
• Given D = {tj, x(tj)}j=1...J, what is the orderN of the polynomial?
x(t) = XN i=0
si+1ti+ ǫ(t)
−1 −0.5 0 0.5 1
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 53
Bayesian Variable Selection
C(r1; π) C(rW; π)
r1 . . . rW
N (s1; µ(r1), Σ(r1)) s1 . . . sW N (sW; µ(rW), Σ(rW))
x
N (x; Cs1:W, R)
• Generalized Linear Model – Column’s of C are the basis vectors
• The exact posterior is a mixture of 2WGaussians
• When W is large, computation of posterior features becomes intractable.
Regression
t = t1 t2 . . . tJ ⊤
C ≡ t0 t1 . . . tW −1
>> C = fliplr(vander(0:4)) % Van der Monde matrix
1 0 0 0 0
1 1 1 1 1
1 2 4 8 16
1 3 9 27 81
1 4 16 64 256
ri ∼ C(ri; 0.5, 0.5) ri∈ {on, off}
si|ri ∼ N (si; 0, Σ(ri)) x|s1:W ∼ N (x; Cs1:W, R)
Σ(ri= on) ≫ Σ(ri= off)
Regression
To find the “active” basis functions we need to calculate r1:W∗ ≡ argmax
r1:W p(r1:W|x) = argmax
r1:W
Z
ds1:Wp(x|s1:W)p(s1:W|r1:W)p(r1:W)
Then, the reconstruction is given by
ˆ x(t) =
*W −1 X
i=0
si+1ti +
p(s1:W|x,r1:W∗ )
=
W −1X
i=0
hsi+1ip(s
i+1|x,r∗1:W)ti
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 56
Regression
i
0 1 2 3 4
−10 0 10 20 30
p(x, r1:W)
All on Configurations All off
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 57
Regression
−1 −0.5 0 0.5 1
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
data true approx
Clustering
Clustering
π Label probability
c1 c2 . . . cN Labels∈ {a, b}
x1 x2 . . . xN Data Points
µa µb Cluster Centers
(µ∗a, µ∗b, π∗) = argmax
µa,µb,π
X
c1:N
YN i=1
p(xi|µa, µb, ci)p(ci|π)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 60
Computer vision / Cognitive Science
How many rectangles are there in this image?
0 10 20 30 40 50 60
0
5
10
15
20
25
30
35
40
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 61
Computer vision / Cognitive Science
. . .
π1 π2 . . . πN Label probabilities
c1 c2 . . . cN Labels∈ {a, b, . . .}
x1 x2 . . . xN Pixel Values
µa µb . . . Rectangle Colors
0 10 20 30 40 50 60
0
5
10
15
20
25
30
35
40
Computer Vision
How many people are there in these images?
Visual Tracking
20 40 60
20 40 60 80 100 120 140
20 40 60
20 40 60 80 100 120 140
20 40 60
20 40 60 80 100 120 140
20 40 60
20 40 60 80 100 120 140
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 64
Navigation, Robotics
−0.5 0 0.5
−0.6
−0.4
−0.2 0 0.2 0.4 0.6
−2 0 2
−2
−1 0 1 2
−2 0
2
−2 0 2 0 2 4
f Lx Ly
−0.5 0 0.5
−0.6
−0.4
−0.2 0 0.2 0.4 0.6
−2 0 2
−2
−1 0 1 2
−2 0 2
−2 0 2 0 2 4 6 8
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 65
Navigation, Robotics
GPS?t GPS status
Gt GPS reading
... Other sensors (magnetic, pressure, e.t.c.) lt Linear accelerator sensor
ωt Gyroscope
Et−1 Et Attitude Variables
Xt−1 Xt Linear Kinematic Variables
{ξ1:Nt}t Set of feature points (Camera Frame)
{x1:Mt}t Set of feature points (World Coordinates)
ρ(x) Global Static Map (Intensity function)
Time series models and Inference, Terminology
Generic structure of dynamical system models
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
xk ∼ p(xk|xk−1) Transition Model yk ∼ p(yk|xk) Observation Model
• x are the latent states
• y are the observations
• In a full Bayesian setting, x includes unknown model parameters
Online Inference, Terminology
• Filtering: p(x
k|y
1:k)
– Distribution of current state given all past information – Realtime/Online/Sequential Processing
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 68
Online Inference, Terminology
• Prediction p(y
k:K, x
k:K|y
1:k−1)
– evaluation of possible future outcomes; like filtering without observations
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
• Tracking, Restoration
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 69
Offline Inference, Terminology
• Smoothing p(x0:K|y1:K),
Most likely trajectory – Viterbi patharg maxx0:Kp(x0:K|y1:K) better estimate of past states, essential for learning
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
• Interpolation p(yk, xk|y1:k−1, yk+1:K) fill in lost observations given past and future
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
Time Series Analysis
• Stationary
0 10 20 30 40 50 60 70 80 90 100
−0.5 0 0.5
– What is the true state of the process given noisy data ? – Parameters ?
– Markovian ? Order ?
Time Series Analysis
• Nonstationary, time varying variance – stochastic volatility
0 200 400 600 800 1000
0 5 10 15 20
vk
0 200 400 600 800 1000
−10
−5 0 5 10
yk
k
True VB
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 72
Time Series Analysis
• Nonstationary, time varying intensity – nonhomogeneous Poisson Process
0 50 100 150
λk
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
ck
Arrival time
True VB
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 73
Time Series Analysis
• Piecewise constant
0 10 20 30 40 50 60 70 80 90 100
−5 0 5 10 15
Time Series Analysis
• Piecewise linear
0 20 40 60 80 100 120 140 160 180 200
−10
−5 0 5 10 15
• Segmentation and Changepoint detection
– What is the true state of the process given noisy data ? – Where are the changepoints ?
– How many changepoints ?
Audio Processing
xt
t (Speech)
t xt
(Piano)
x = x1 . . . xt . . .
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 76
Audio Restoration
• During download or transmission, some samples of audio are lost
• Estimate missing samples given clean ones
0 50 100 150 200 250 300 350 400 450 500
0
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 77
Examples: Audio Restoration
p(x¬κ|xκ) ∝ Z
dHp(x¬κ|H)p(xκ|H)p(H) H ≡ (parameters, hidden states)
H
x¬κ xκ
Missing Observed
0 50 100 150 200 250 300 350 400 450 500
0
Probabilistic Phase Vocoder
(Cemgil and Godsill 2005)Aν Qν
sν0 · · · sνk · · · sνK−1 ν = 0 . . . W−1
x0 xk xK−1
sνk ∼ N (sνk; Aνsνk−1, Qν) Aν∼ N
Aν;
cos(ων) − sin(ων) sin(ων) cos(ων)
, Ψ
Restoration
• Piano
– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original
• Trumpet
– Signal with missing samples (37%) – Reconstruction, 7.10 dB improvement – Original
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 80
Pitch Tracking
Monophonic Pitch Tracking= Online estimation (filtering) of p(rt, mt|y1:t).
100 200 300 400 500 600 700 800 900 1000
−100
−50 0 50
100 200 300 400 500 600 700 800 900 1000
5 10 15
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 81
Pitch Tracking
r0 r1 . . . rT
m0 m1 . . . mT
s0 s1 . . . sT
y1 . . . yT
Monophonic transcription
• Detecting onsets, offsets and pitch(Cemgil et. al. 2006, IEEE TSALP)
500 1000 1500 2000 2500 3000 3500
Exact inference (S)
Tracking Pitch Variations
• Allow m to change with k.
50 100 150 200 250 300 350 400 450 500
• Intractable, need to resort to approximate inference (Mixture Kalman Filter - Rao-Blackwellized Particle Filter)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 84
Source Separation
sk,1 . . . sk,n . . . sk,N
xk,1 . . . xk,M
k = 1 . . . K
a1 r1 . . . aM rM
• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 85
Spectrogram
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
• A linear expansion using a collection of basis functions φ(t; τ, ω) centered around timeτ and frequency ω
xt = X
τ,ω
α(τ, ω)φ(t; τ, ω)
• Spectrogram displays log |α(τ, ω)|2or|α(τ, ω)|2
Source Separation
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 5 10 15 20 25
(Guitar)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
10 15 20 25
(Mix)
Reconstructions
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25
(Guitar)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 88
Polyphonic Music Transcription
• from sound ...
t/sec
f/Hz
0 1 2 3 4 5 6 7 8
0 1000 2000 3000 4000 5000
0 10 20
(S)
• ... to score
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 89
Generative Models for Music Generative Models for Music
Score Expression
Piano-Roll
Signal
Hierarchical Modeling of Music
! !!
"""
!
# #! """ #
$ $! """ $
% %! """ %
& &! """ &
' '! """ '
(!" (!"! """ (!"
)!" )!"! """ )!"
*!" *!"! """ *!"
!" !"!
"""
!"
+!" +!"! """ +!"
+ +! """ +
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 92
A few non-Bayesian applications where Monte Carlo is useful
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 93
Combinatorics
• Counting
Example : What is the probability that a solitaire laid out with 52 cards comes out successfully given all permutations have equal probability ?
|A| = X
x∈X
[x ∈ A] [x ∈ A] ≡
1 x ∈ A 0 x /∈ A
p(x ∈ A) = |A|
|X |= ?
≈ 2225
Geometry
• Given a simplex S in N dimensional space by S = {x : Ax ≤ b, x ∈ RN} find the Volume|S|
Rare Events
• Given a graph with random edge lengths xi∼ p(xi)
Find the probability that the shortest path from A to B is larger thanγ.
A B
x1
x2
x4
x5
x3
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 96
Rare Events
x1 x2 x3 x4 x5 Edge Lengths
L ShortestPath(A,B)
Pr(L ≥ γ) = Z
dx1:5[L(x1:5) ≥ γ] p(x1:5)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 97
Rare Events
A B
hx1i = 4
hx2i = 1
hx4i = 1
hx5i = 4 hx3i = 1
xi∼ E(xi; ui) ≡u1
iexp
−u1
ixi
ui= hxii ≡R
xip(xi)dxi
0 2 4 6 8 10 12 14
0 2000 4000 6000 8000 10000 12000
Shortest−Path(A,B)
Count
Probability Models
Example: AR(1) model
0 10 20 30 40 50 60 70 80 90 100
−0.5 0 0.5
xk= Axk−1+ ǫk k = 1 . . . K
ǫkis i.i.d., zero mean and normal with varianceR.
Estimation problem:
Givenx0, . . . , xK, determine coefficientA and variance R (both scalars).
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 100
AR(1) model, Generative Model notation
A ∼ N (A; 0, P ) R ∼ IG(R; ν, β/ν)
xk|xk−1, A, R ∼ N (xk; Axk−1, R) x0= ˆx0
A R
x0 x1 . . . xk−1 xk . . . xK
Observed variables are shown with double circles
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 101
Example, Univariate Gaussian
The Gaussian distribution with meanm and covariance S has the form N (x; m, S) = (2πS)−1/2exp{−1
2(x − m)2/S}
= exp{−1
2(x2+ m2− 2xm)/S −1
2log(2πS)}
= expm Sx − 1
2Sx2−1
2log(2πS) + 1 2Sm2
= exp{
m/S
−12/S
⊤
| {z }
θ
x x2
| {z }
ψ(x)
−c(θ)}
Hence by matching coefficients we have exp
−12Kx2+ hx + g
⇔ S = K−1 m = K−1h
Example, Gaussian
The Multivariate Gaussian Distribution
µ is the mean and P is the covariance:
N (s; µ, P ) = |2πP |−1/2exp
−1
2(s − µ)TP−1(s − µ)
= exp
−1
2sTP−1s + µTP−1s−1
2µTP−1µ −1 2|2πP |
log N (s; µ, P ) = −1
2sTP−1s + µTP−1s + const
= −1
2TrP−1ssT+ µTP−1s + const
=+ −1
2TrP−1ssT+ µTP−1s
Notation:log f (x) =+g(x) ⇐⇒ f (x) ∝ exp(g(x)) ⇐⇒ ∃c ∈ R : f(x) = c exp(g(x))
log p(s) =+ −1
2TrKssT+ h⊤s ⇒ p(s) = N (s; K−1h, K−1)
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 104
Example, Inverse Gamma
The inverse Gamma distribution with shapea and scale b
IG(r; a, b) = 1 Γ(a)
r−(a+1)
ba exp(−1 br)
= exp
−(a + 1) log r − 1
br− log Γ(a) − a log b
= exp
−(a + 1)
−1/b
⊤ log r
1/r
− log Γ(a) − a log b
!
Hence by matching coefficients, we have
exp
α log r + β1 r+ c
⇔ a = −α − 1 b = −1/β
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 105
Example, Inverse Gamma
0 1 2 3 4 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
a=1 b=1
a=1 b=0.5 a=2 b=1
Basic Distributions : Exponential Family
• Following distributions are used often as elementary building blocks:
– Gaussian
– Gamma, Inverse Gamma, (Exponential, Chi-square, Wishart) – Dirichlet
– Discrete (Categorical), Bernoulli, multinomial
• All of those distributions can be written as
p(x|θ) = exp{θ⊤ψ(x) − c(θ)}
c(θ) = log Z
Xn
dx exp(θ⊤ψ(x)) log-partition function
θ canonical parameters
ψ(x) sufficient statistics
Conjugate priors: Posterior is in the same family as the prior.
Example: posterior inference for the varianceR of a zero mean Gaussian.
p(x|R) = N (x; 0, R) p(R) = IG(R; a, b)
p(R|x) ∝ p(R)p(x|R)
∝ exp
−(a + 1) log R − (1/b)1 R
exp
−(x2/2)1 R−1
2log R
= exp
−(a + 1 +12)
−(1/b + x2/2)
⊤ log R
1/R
!
∝ IG(R; a +1 2, 2
x2+ 2/b)
Like the prior, this is an inverse-Gamma distribution.
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 108
Conjugate priors: Posterior is in the same family as the prior.
Example: posterior inference of varianceR from x1, . . . , xN.
R
x1 x2 . . . xN xN +1
p(R|x) ∝ p(R)
N i=1
p(xi|R)
∝ exp
!
−(a + 1) log R − (1/b)1 R
"
exp
#
−
#
1 2
$
i
x2i
%
1 R−N
2 log R
%
= exp
#
!
−(a + 1 +N2)
−(1/b +12&ix2i)
"⊤!
log R 1/R
"
%
∝ IG(R; a +N
2, 2
&
ix2i+ 2/b) Sufficient statistics are additive
Cemgil An Introduction to Graphical Models and Monte Carlo Methods. June 19, 2007 109
Inverse Gamma, P
i
x
2i= 10 N = 10
0 1 2 3 4 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Σi xi2 = 10 N = 10
Inverse Gamma, P
i
x
2i= 100 N = 100
0 1 2 3 4 5
0 0.5 1 1.5 2 2.5 3
Σi xi2 = 100 N = 100