Periodp(y k 1:k−1 |y )

(1)

Introduction to Sequential Monte Carlo and

Inference in (switching) state space models

A. Taylan Cemgil

Signal Processing and Communications Lab.

30 Nov 2006

(2)

Abstract

This talk will be an introduction to Sequential Monte Carlo methods for filtering and smoothing in time series models. After reviewing the basic concepts such as importance sampling, resampling, Rao-Blackwellization, I will illustrate how those ideas can be applied for inference in switching state space models (using the mixture Kalman filter) and changepoint models (where exact inference is possible). The material is based on the following papers:

• A. Doucet, S. Godsill, and C. Andrieu, ”On sequential Monte Carlo sampling methods for Bayesian filtering,” Statistics and Computing, vol. 10, no. 3, pp.

197-208, 2000.

• R. Chen and J. S. Liu, ”Mixture Kalman filters,” J. R. Statist. Soc., vol. 10, 2000.

• P. Fearnhead, ”Exact and efficient bayesian inference for multiple changepoint problems,” Tech. Rep., Dept. of Math. and Stat., Lancaster University, 2003.

(3)

Outline

• Time Series Models and Inference

• Importance Sampling

• Resampling

• Rao-Blackwellization

• Putting it all together, Sequential Monte Carlo

• Conditionally Gaussian Switching State Space Models – Mixture Kalman Filter

– Change-point models

(4)

Time series models and Inference, Terminology

In signal processing, applied physics, machine learning many phenomena are modelled by dynamical models

x₀ x₁ . . . xk−1 xk . . . xK

y₁ . . . yk−1 yk . . . yK

x_k ∼ p(x_k|x_k−1) Transition Model y_k ∼ p(y_k|x_k) Observation Model

• x are the latent states

• y are the observations

• In a full Bayesian setting, x includes unknown model parameters

(5)

Time series models and applications

• Hidden Markov Models

• (Time varying) AR, ARMA, MA models

• Linear Dynamical Systems, Kalman Filter models

• Switching state space models

• Dynamic Bayesian networks

• Nonlinear Stochastic Dynamical Systems

(6)

Online Inference, Terminology

• Filtering: p(x

_k

|y

_1:k

)

– Distribution of current state given all past information – Realtime/Online/Sequential Processing

x₀ x₁ . . . x^k₋₁ x^k . . . x^K

y₁ . . . y^k₋₁ y^k . . . y^K

• Potentially confusing misnomer:

– More general than “digital filtering” (convolution) in DSP – but

algoritmically related for some models (KFM)

(7)

Online Inference, Terminology

• Prediction p(y

_k:K

, x

_k:K

|y

_1:k−1

)

– evaluation of possible future outcomes; like filtering without observations

x₀ x₁ . . . x^k₋₁ x^k . . . x^K

y1 . . . y^k₋₁ y^k . . . y^K

• Tracking, Restoration

(8)

Offline Inference, Terminology

• Smoothing p(x_0:K|y_1:K),

Most likely trajectory – Viterbi path arg max_x_0:K p(x_0:K|y_1:K) better estimate of past states, essential for learning

x0 x1 . . . x^k−1 x^k . . . x^K

y1 . . . y^k₋₁ y^k . . . y^K

• Interpolation p(y_k, x_k|y_1:k−1, y_k+1:K)

fill in lost observations given past and future

x0 x1 . . . x^k₋₁ x^k . . . x^K

y1 . . . y^k₋₁ y^k . . . y^K

(9)

Deterministic Linear Dynamical Systems

• The latent variables s

_k

and observations y

_k

are continuous

• The transition and observations models are linear

• Examples

– A deterministic dynamical system with two state variables – Particle moving on the real line, a perfect metronome

s

_k

=

phase period

k

= 1 1 0 1

s

_k−1

= As

_k−1

y

_k

= phase

_k

= 1 0

s

_k

= Cs

_k

(10)

Kalman Filter Models, Stochastic Dynamical Systems

• We allow random (unknown) accelerations and observation error

s

_k

= 1 1 0 1

s

_k−1

+ ǫ

_k

= As

_k−1

+ ǫ

_k

y

_k

= 1 0

s

_k

+ ν

_k

= Cs

_k

+ ν

_k

(11)

Tracking

s₀ s₁ . . . s^k₋₁ s^k . . . s^K

y₁ . . . y^k₋₁ y^k . . . y^K

• In generative model notation

s_k ∼ N (s_k; As_k−1, Q) y_k ∼ N (y_k; Cs_k, R)

• Tracking = estimating the latent state of the system = Kalman filtering

(12)

Kalman Filtering and Smoothing (two filter formulation)

p(x₁)

x1

p(x₂|x₁) x2

p(x₃|x₂) x3

p(x₄|x₃) x4

p(y1|x1) p(y2|x2) p(y3|x3) p(y4|x4)

• Forward Pass

p(y_1:K) =

Z

xK

p(y_T|x_K)

Z

xK−1

p(x_K|x_K−1)

| {z }

αK

. . .

Z

x2

p(x₃|x₂) p(y₂|x₂)

α2|1

z }| {

Z

x1

p(x₂|x₁)

| {z }

α2

p(y₁|x₁)

α1|0

z }| {

p(x₁)

| {z }

α1

• Backward Pass

p(y_1:K) =

Z

x1

p(x₁)p(y₁|x₁) . . .

Z

xK−1

p(x_K−1|x_K−2)p(y_K−1|x_K−1)

| {z }

βK−2

Z

xK

p(x_K|x_K−1)p(y_K|x_K)

| {z }

βK−1

1 11

|{z}

βK

• Smoothing: p(x_k|y_1:K) ∝ α_kβ_k

(13)

α

_1|0

= p(x

₁

)

Period

0 1 2 3 4 5

0 0.5 1 1.5

p(y k|y 1:k−1)

Phase

(14)

α

_1|1

= p(y

₁

|x

₁

)p(x

₁

)

0 1 2 3 4 5

0 0.5 1 1.5

p(y k|y 1:k−1)

Phase

Period

(15)

α

_2|1

= R

dx

₁

p(x

₂

|x

₁

)p(y

₁

|x

₁

)p(x

₁

) ∝ p(x

₂

|y

₁

)

Period

0 1 2 3 4 5

0 0.5 1 1.5

p(y k|y 1:k−1)

Phase

(16)

α

_2|2

= p(y

₂

|x

₂

)p(x

₂

|y

₁

)

0 1 2 3 4 5

0 0.5 1 1.5

p(y k|y 1:k−1)

Phase

Period

(17)

α

_5|5

∝ p(x

₅

|y

_1:5

)

0 1 2 3 4 5

0 0.5 1 1.5

p(y k|y 1:k−1)

Phase

Period

(18)

Importance Sampling (IS)

Consider a probability distribution with (possibly unknown) normalisation constant

p(x) = 1

Zφ(x) Z =

Z

dxφ(x).

IS: Estimate expectations (or features) of p(x) by a weighted sample hf (x)i_p(x) =

Z

dxf (x)p(x)

hf (x)i_p(x) ≈

XN i=1

˜

w⁽ⁱ⁾f (x⁽ⁱ⁾)

(19)

Importance Sampling (cont.)

• Change of measure with weight function W (x) ≡ φ(x)/q(x) hf (x)i_p(x) = 1

Z Z

dxf (x)φ(x)

q(x)q(x) = 1 Z

f (x)φ(x) q(x)

q(x)

≡ 1

Z hf (x)W (x)i_q(x)

• If Z is unknown, as is often the case in Bayesian inference

Z =

Z

dxφ(x) = Z

dxφ(x)

q(x)q(x) = hW (x)i_q(x)

hf (x)i_p(x) = hf (x)W (x)i_q(x) hW (x)i_q(x)

(20)

Importance Sampling (cont.)

• Draw i = 1, . . . N independent samples from q x⁽ⁱ⁾ ∼ q(x)

• We calculate the importance weights

W⁽ⁱ⁾ = W (x⁽ⁱ⁾) = φ(x⁽ⁱ⁾)/q(x⁽ⁱ⁾)

• Approximate the normalizing constant

Z = hW (x)i_q(x) ≈

N

X

i=1

W⁽ⁱ⁾

• Desired expectation is approximated by

hf (x)i_p(x) = hf (x)W (x)i_q(_x₎ hW (x)i_q(_x₎ ≈

PN

i=1 W⁽ⁱ⁾f (x⁽ⁱ⁾)

PN

i=1W⁽ⁱ⁾ ≡

N

X

i=1

˜

w⁽ⁱ⁾f (x⁽ⁱ⁾)

Here w˜⁽ⁱ⁾ = W⁽ⁱ⁾/^P^N_j=1 W^(j) are normalized importance weights.

(21)

Importance Sampling (cont.)

−100 −5 0 5 10 15 20 25

0.1 0.2

−100 −5 0 5 10 15 20 25

10 20 30

−100 −5 0 5 10 15 20 25

0.1 0.2

φ(x) q(x)

W(x)

(22)

Resampling

• Importance sampling computes an approximation with weighted delta functions

p(x) ≈ X

i

W˜ ⁽ⁱ⁾δ(x − x⁽ⁱ⁾)

• In this representation, most of W˜ ⁽ⁱ⁾ will be very close to zero and the representation may be dominated by few large weights.

• Resampling samples a set of new “particles”

x^(j)_new ∼

X

i

W˜ ⁽ⁱ⁾δ(x − x⁽ⁱ⁾)

p(x) ≈ 1 N

X

j

δ(x − x^(j)_new)

• Since we sample from a degenerate distribution, particle locations stay unchanged. We merely dublicate (, triplicate, ...) or discard particles according to their weight.

• This process is also named “selection”, “survival of the fittest”, e.t.c., in various fields (Genetic algorithms, AI..).

(23)

Resampling

−100 −5 0 5 10 15 20 25

0.1 0.2

−100 −5 0 5 10 15 20 25

10 20 30

−100 −5 0 5 10 15 20 25

1 2

−100 −5 0 5 10 15 20 25

0.1 0.2

φ(x) q(x)

W(x)

xnew

x^(j)_new ∼ P

i W˜ ⁽ⁱ⁾δ(x − x⁽ⁱ⁾)

(24)

Examples of Proposal Distributions

x y p(x|y) ∝ p(y|x)p(x)

• Prior as the proposal. q(x) = p(x)

W (x) = p(y|x)p(x)

p(x) = p(y|x)

(25)

Examples of Proposal Distributions

x y p(x|y) ∝ p(y|x)p(x)

• Likelihood as the proposal. q(x) = p(y|x)/R

dxp(y|x) = p(y|x)/c(y) W (x) = p(y|x)p(x)

p(y|x)/c(y) = p(x)c(y) ∝ p(x)

• Interesting when sensors are very accurate and dim(y) ≫ dim(x).

Since there are many proposals, is there a “best” proposal distribution?

(26)

Optimal Proposal Distribution

x y p(x|y) ∝ p(y|x)p(x)

Task: Estimate hf (x)i_p(x|y)

• IS constructs the estimator I(f ) = hf (x)W (x)i_q(x)

• Minimize the variance of the estimator D(f (x)W (x) − hf (x)W (x)i)²E

q(x) =

f²(x)W²(x)

q(x) − hf (x)W (x)i²_q(x)(1)

=

f²(x)W²(x)

q(x) − hf (x)i²_p(x) (2)

=

f²(x)W²(x)

q(x) − I²(f ) (3)

• Minimize the first term since only it depends upon q

(27)

Optimal Proposal Distribution

• (By Jensen’s inequality) The first term is lower bounded:

f²(x)W²(x)

q(x) ≥ h|f (x)|W (x)i²_q(x) =

Z

|f (x)| p(x|y)dx

2

• We well look for a distribution q^∗ that attains this lower bound. Take q^∗(x) = |f (x)|p(x|y)

R |f(x^′)|p(x^′|y)dx^′

(28)

Optimal Proposal Distribution (cont.)

• The weight function for this particular proposal q^∗ is

W_∗(x) = p(x|y)/q^∗(x) = R |f(x^′)|p(x^′|y)dx^′

|f (x)|

• We show that q^∗ attains its lower bound f²(x)W_∗²(x)

q^∗(x) =

*

f²(x) R |f(x^′)|p(x^′|y)dx^′2

|f (x)|²

+

q^∗(x)

=

Z

|f (x^′)|p(x^′|y)dx^′

2

= h|f (x)|i²_p(x|y)

= h|f (x)|W_∗(x)i²_q∗(x)

• ⇒ There are distributions q^∗ that are even “better” than the exact posterior!

(29)

A link to alpha divergences

The α-divergence between two distributions is defined as D_α(p||q) ≡ 1

β(1 − β)

1 −

Z

dxp(x)^βq(x)^1−β

where β = (1 + α)/2 and p and q are two probability distributions

• lim_β→0 D_α(p||q) = KL(q||p)

• lim_β→1 D_α(p||q) = KL(p||q)

• β = 2, (α = 3)

D₃(p||q) ≡ 1 2

Z

dxp(x)²q(x)⁻¹ − 1

2 = 1 2

W (x)²

q(x) − 1 2

Best q (in a constrained family) is typically a heavy-tailed approximation to p

(30)

Examples of Proposal Distributions

x₁ x₂

p(x|y) ∝ p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

y₁ y₂

Task: Obtain samples from the posterior p(x_1:2|y_1:2) = _Z¹

yφ(x_1:2)

• Prior as the proposal. q(x_1:2) = p(x₁)p(x₂|x₁) W (x_1:2) = φ(x_1:2)

q(x_1:2) = p(y₁|x₁)p(y₂|x₂)

• We sample from the prior as follows:

x⁽ⁱ⁾₁ ∼ p(x₁) x⁽ⁱ⁾₂ ∼ p(x₂|x₁ = x₁⁽ⁱ⁾) W (x⁽ⁱ⁾) = p(y₁|x⁽ⁱ⁾₁ )p(y₂|x⁽ⁱ⁾₂ )

(31)

Examples of Proposal Distributions

φ(x_1:2) = p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

• State prediction as the proposal. q(x_1:2) = p(x₁|y₁)p(x₂|x₁) W (x_1:2) = p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

p(x₁|y₁)p(x₂|x₁) = p(y₁)p(y₂|x₂)

• We sample from the proposal and compute the weight

x⁽ⁱ⁾₁ ∼ p(x₁|y₁) x⁽ⁱ⁾₂ ∼ p(x₂|x₁ = x⁽ⁱ⁾₁ ) W (x⁽ⁱ⁾) = p(y₁)p(y₂|x⁽ⁱ⁾₂ )

• Note that this weight does not depend on x₁

(32)

Examples of Proposal Distributions

φ(x_1:2) = p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

• Filtering distribution as the proposal. q(x_1:2) = p(x₁|y₁)p(x₂|x₁, y₂) W (x_1:2) = p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

p(x₁|y₁)p(x₂|x₁, y₂) = p(y₁)p(y₂|x₁)

• We sample from the proposal and compute the weight

x⁽ⁱ⁾₁ ∼ p(x₁|y₁) x⁽ⁱ⁾₂ ∼ p(x₂|x₁ = x⁽ⁱ⁾₁ , y₂) W (x⁽ⁱ⁾) = p(y₁)p(y₂|x⁽ⁱ⁾₁ )

• Note that this weight does not depend on x₂

(33)

Examples of Proposal Distributions

φ(x_1:2) = p(y₁|x₁)p(x₁)p(y₂|x₂)p(x₂|x₁)

p(x₁|y₁)p(x₂|x₁, y₂) = p(y₁)p(y₂|y₁)

• Note that this weight is constant, i.e.

W (x_1:2)² − hW (x_1:2)i² = 0

(34)

Variance reduction

q(x) W (x) = φ(x)/q(x) p(x₁)p(x₂|x₁) p(y₁|x₁)p(y₂|x₂) p(x₁|y₁)p(x₂|x₁) p(y₁)p(y₂|x₂) p(x₁|y₁)p(x₂|x₁, y₂) p(y₁)p(y₂|x₁) p(x₁|y₁, y₂)p(x₂|x₁, y₂) p(y₁)p(y₂|y₁) Accurate proposals

• gradually decrease the variance

• but take more time to compute

(35)

Rao Blackwellization

• Suppose x = (r, s),

r s y

φ_y(s, r) = p(y|s, r)p(s|r)p(r) ≡ φ_y(s|r)φ_y(r)

• and for a given f (r, s) it is possible to compute the following integral (conditioned on r)

f (r) ≡ Z

dsf (s, r)φ_y(s|r) = hf (s, r)i_φ_y_(s|r)

(36)

Rao Blackwellization

• Consider two procedures for estimating I(f ) = hf (r, s)i_φ_y_(r,s) 1. q = q(r, s)

I(f ) = Z

dsdrf (r, s)φ_y(r, s)

q(r, s) q(r, s) ≈ 1 N

X

i

f (r⁽ⁱ⁾, s⁽ⁱ⁾)W (r⁽ⁱ⁾, s⁽ⁱ⁾) = I_N(f )

2. q^RB = q(r) with f (r) ≡ hf (s, r)i_φ(s|r) I(f ) =

Z

drdsf (s, r)φ_y(s|r)φ_y(r) = Z

dr hf (s, r)i_φ(s|r) φ_y(r)

= Z

drf (r)φ_y(r)

q(r) q(r) ≈

XN i

f (r⁽ⁱ⁾)φ_y(r⁽ⁱ⁾) q(r⁽ⁱ⁾) ≡

XN i

f (r⁽ⁱ⁾)W (r⁽ⁱ⁾) = I_N^RB(f )

• Rao-Blackwell theorem says

(I(f ) − I_N^RB(f ))² ≤ (I(f) − I_N(f ))²

(37)

Rao Blackwellization Example

Suppose the posterior φ_y(s, r) = φ_y(r)φ_y(s|r) = φ_y(r)N (s; µ_r, Σ_r), a conditional Gaussian

φ(s) f(s)

• I_N(f ) (with q(s, r) = q(r)φ_y(s|r)) – Sample

r⁽ⁱ⁾ ∼ q(r), s⁽ⁱ⁾ ∼ N (s; µ_r(i), Σ_r(i)) – Evaluate

I_N(f ) = 1 N

X

i

W (s⁽ⁱ⁾, r⁽ⁱ⁾)f (s⁽ⁱ⁾, r⁽ⁱ⁾)

(38)

Rao Blackwellization Example

Suppose the posterior φ_y(s, r) = φ_y(r)N (s; µ_r, Σ_r) is a conditional Gaussian

φ(s) f(s)

• I_N^RB(f ) – Sample

r⁽ⁱ⁾ ∼ q(r) – Evaluate

f (r⁽ⁱ⁾) = Z

dsf (s, r⁽ⁱ⁾)N (s; µ_r(i), Σ_r(i)) – Evaluate

I_N^RB(f ) = 1 N

X

i

φ_y(r⁽ⁱ⁾)

q(r⁽ⁱ⁾) f (r⁽ⁱ⁾) = 1 N

X

i

W (r⁽ⁱ⁾)f (r⁽ⁱ⁾)

(39)

Sequential Importance Sampling, Particle Filtering

Apply importance sampling to the SSM to obtain some samples from the posterior p(x_0:K|y_1:K).

p(x_0:K|y_1:K) = 1

p(y_1:K)p(y_1:K|x_0:K)p(x_0:K) ≡ 1

Z_yφ(x_0:K) (4)

Key idea: sequential construction of the proposal distribution q, possibly using the available observations y_1:k, i.e.

q(x_0:K|y_1:K) = q(x₀) YK k=1

q(x_k|x_1:k−1y_1:k)

(40)

Sequential Importance Sampling

Due to the sequential nature of the model and the proposal, the importance weight function W (x_0:k) ≡ W_k admits recursive computation

W_k = φ(x_0:k)

q(x_0:k|y_1:k) = p(y_k|x_k)p(x_k|x_k−1) q(x_k|x_0:k−1y_1:k)

φ(x_0:k−1)

q(x_0:k−1|y_1:k−1) (5)

= p(y_k|x_k)p(x_k|x_k−1)

q(x_k|x_0:k−1, y_1:k) W_k−1 ≡ u_k|0:k−1W_k−1 (6)

Suppose we had an approximation to the posterior (in the sense hf (x)i_φ ≈ ^P_i W_k−1⁽ⁱ⁾ f (x⁽ⁱ⁾_0:k−1))

φ(x_0:k−1) ≈ X

i

W_k−1⁽ⁱ⁾ δ(x_0:k−1 − x⁽ⁱ⁾_0:k−1)

x⁽ⁱ⁾_k ∼ q(x_k|x⁽ⁱ⁾_0:k−1, y_1:k) Extend trajectory W_k⁽ⁱ⁾ = u⁽ⁱ⁾_k|0:k−1W_k−1 Update weight φ(x_0:k) ≈ X

i

W_k⁽ⁱ⁾δ(x_0:k − x⁽ⁱ⁾_0:k)

(41)

Example

• Prior as the proposal density

q(x_k|x_0:k−1, y_1:k) = p(x_k|x_k−1)

• The weight is given by

x⁽ⁱ⁾_k ∼ p(x_k|x⁽ⁱ⁾_k−1) Extend trajectory W_k⁽ⁱ⁾ = u⁽ⁱ⁾_k|0:k−1W_k−1 Update weight

= p(y_k|x⁽ⁱ⁾_k )p(x⁽ⁱ⁾_k |x⁽ⁱ⁾_k−1)

p(x⁽ⁱ⁾_k |x⁽ⁱ⁾_k−1) W_k−1⁽ⁱ⁾ = p(y_k|x⁽ⁱ⁾_k )W_k−1⁽ⁱ⁾

• However, this schema will not work, since we blindly sample from the prior. But ...

(42)

Example (cont.)

• Perhaps surprisingly, interleaving importance sampling steps with (occasional) resampling steps makes the approach work quite well !!

x⁽ⁱ⁾_k ∼ p(x_k|x⁽ⁱ⁾_k−1) Extend trajectory W_k⁽ⁱ⁾ = p(y_k|x⁽ⁱ⁾_k )W_k−1⁽ⁱ⁾ Update weight W˜ _k⁽ⁱ⁾ = W_k⁽ⁱ⁾/ ˜Z_k Normalize ( ˜Z_k ≡ X

i^′ W_k⁽ⁱ^′⁾) x^(j)_0:k,new ∼

XN i=1

W˜ ⁽ⁱ⁾δ(x_0:k − x⁽ⁱ⁾_0:k) Resample j = 1 . . . N

• This results in a new representation as φ(x) ≈ 1

N

X

j

Z˜_kδ(x_0:k − x^(j)_0:k,new)

x⁽ⁱ⁾_0:k ← x^(j)_0:k,_new W_k⁽ⁱ⁾ ← ˜Z_k/N

(43)

Optimal proposal distribution

• The algorithm in the previous example is known as Bootstrap particle filter or Sequential Importance Sampling/Resampling (SIS/SIR).

• Can we come up with a better proposal in a sequential setting?

– We are not allowed to move previous sampling points x⁽ⁱ⁾_1:k−1 (because in many applications we can’t even store them)

– Better in the sense of minimizing the variance of weight function W_k(x).

(remember the optimality story in Eq.(3) and set f (x) = 1).

• The answer turns out to be the filtering distribution

q(x_k|x_1:k−1, y_1:k) = p(x_k|x_k−1, y_k) (7)

(44)

Optimal proposal distribution (cont.)

• The weight is given by

p(x⁽ⁱ⁾_k |x⁽ⁱ⁾_k−1, y_k) × p(y_k|x⁽ⁱ⁾_k−1) p(y_k|x⁽ⁱ⁾_k−1)

= p(y_k, x⁽ⁱ⁾_k |x⁽ⁱ⁾_k−1)p(y_k|x⁽ⁱ⁾_k−1)

p(x⁽ⁱ⁾_k , y_k|x⁽ⁱ⁾_k−1) = p(y_k|x⁽ⁱ⁾_k−1)

(45)

A Generic Particle Filter

1. Generation:

Compute the proposal distribution q(x_k|x⁽ⁱ⁾_0:k−1, y_1:k).

Generate offsprings for i = 1 . . . N ˆ

x⁽ⁱ⁾_k ∼ q(x_k|x⁽ⁱ⁾_0:k−1, y_1:k) 2. Evaluate importance weights

W_k⁽ⁱ⁾ = p(yk|ˆx_k⁽ⁱ⁾)p(ˆx⁽ⁱ⁾_k |x⁽ⁱ⁾_k−1)

q(ˆx⁽ⁱ⁾_k |x⁽ⁱ⁾_0:k−1, y1:k) W_k−1⁽ⁱ⁾ x⁽ⁱ⁾_0:k = (ˆx⁽ⁱ⁾_k , x⁽ⁱ⁾_0:k−1) 3. Resampling (optional but recommended)

Normalize weigts W˜_k⁽ⁱ⁾ = W_k⁽ⁱ⁾/ ˜Z_k Z˜_k ≡

X

j W_k^(j) Resample x^(j)_0:k,new ∼

N

X

i=1

W˜ ⁽ⁱ⁾δ(x_0:k − x⁽ⁱ⁾_0:k) j = 1 . . . N

Reset x⁽ⁱ⁾_0:k ← x^(j)_0:k,new W_k⁽ⁱ⁾ ← ˜Zk/N

(46)

Switching State space models - Segmentation -

Changepoint detection

(47)

Different names for the same model

• Jump Markov Linear System

• Switching Kalman filter models

• Conditionally Gaussian switching state space models

(48)

Segmentation and Changepoint detection

• Complicated processes can be modeled by using simple processes with occasional regime switches

– Piecewise constant

0 10 20 30 40 50 60 70 80 90 100

−5 0 5 10 15

(49)

Segmentation and Changepoint detection

– Piecewise linear

0 20 40 60 80 100 120 140 160 180 200

−10

−5 0 5 10 15

• What is the true state of the process given noisy data ?

• Where are the changepoints ?

• How many changepoints ?

(50)

Conditionally Gaussian Changepoint Model

r_k ∼ p(r_k|r_k−1) Changepoint flags ∈ {new,reg} θ_k ∼ [r_k = reg] f (θ_k|θ_k−1)

| {z }

Transition

+[r_k = new] π(θ_k)

| {z }

Reinitialization

Latent State

y_k ∼ p(y_k|θ_k) Observations

r1 r2 r3 r4 r5

θ₀ θ₁ θ₂ θ₃ θ₄ θ₅

y1 y2 y3 y4 y5

(51)

Example: Piecewise constant signal

0 10 20 30 40 50 60 70 80 90 100

−5 0 5 10 15

θ₀ ∼ N (µ, P ) r_k|r_k−1 ∼ p(r_k|r_k−1)

θ_k|θ_k−1, r_k ∼ [r_k = 0]δ(θ_k − θ_k−1)

| {z }

reg

+ [r_k = 1]N (m, V )

| {z }

new

y_k|θ_k ∼ N (θ_k, R)

(52)

A More General Model

Each segment is modelled by a linear dynamical system

θ₀ ∼ N (µ, P ) r_k ∼ p(r_k|r_k−1)

θ_k ∼ [r_k = 0]N (Aθ_k−1, Q)

| {z }

reg

+ [r_k = 1]N (m, V )

| {z }

new

y_k ∼ N (Cθ_k, R)

(53)

Example: piecewise linear regimes

0 20 40 60 80 100 120 140 160 180 200

−10

−5 0 5 10 15

A =

1 1 0 1

Q =

0 0 0 0

C = 1 0

k 1 2 3 4 5

r_k ∼ p(r_k) 1 0 0 1 0

θ_k ∼ N (Aθ_k−1, Q) N (m, V )

0.0 1.0

1.0 1.0

2.0 1.0

8.0

−0.5

7.5

−0.5 y_k ∼ N (θ_k, R) 0.1 0.97 2.03 8.1 7.3

(54)

Conditionally Gaussian Switching State Space Model

r_k ∼ p(r_k|r_k−1) Regime label θ_k ∼ p(θ_k|θ_k−1, r_k) Latent State y_k ∼ p(y_k|θ_k) Observations

r1 r2 r3 r4 r5

θ₀ θ₁ θ₂ θ₃ θ₄ θ₅

y1 y2 y3 y4 y5

• ... can model outliers or sensor failures by adding links form r to y

(55)

Example: piecewise quadratic splines (with mirror slopes)

200 400 600 800 1000

0 2 4 6 8 10 12

y

0 200 400 600 800 1000

−2 0 2 4 6 8

θ(2)

r_k ∼ p(r_k|r_k−1) Regime label θ_k ∼ N (θ_k; A_r_k−1_→r_kθ_k−1, Q_r_k−1_→r_k)

y_k ∼ N (y_k; Cθ_k, R) Observations

(56)

Example: piecewise quadratic splines (with mirror slopes)

200 400 600 800 1000

0 2 4 6 8 10 12

y

0 200 400 600 800 1000

−2 0 2 4 6 8

θ(2)

A_1→1 =

0

1 ∆ ¹₂∆²

0 1 ∆

0 0 1

1

A A_1→2 = A_2→2 =

0

1 ∆ −¹₂∆²

0 1 −∆

0 0 1

1

AA_2→1 =

0

1 ∆ 0

0 1 0

0 0 0

1

A

Q_1→1 = Q_1→2 = Q_2→2 = 0 Q_2→1 =

0

0 0 0 0 0 0 0 0 σ²

1

A

C = 1 0 0

(57)

Example: Audio Signal Analysis

r_k|r_k−1 ∼ p(r_k|r_k−1)

θ_k|θ_k−1, r_k ∼ [r_k = 0]N (Aθ_k−1, Q)

| {z }

reg

+ [r_k = 1]N (0, S)

| {z }

new

y_k|θ_k ∼ N (Cθ_k, R)

A =







G_ω

G²_ω

. . .

G^H_ω





 G_ω = ρ_k

cos(ω) − sin(ω) sin(ω) cos(ω)

0 < ρk < 1 is a damping factor and C =

1 0 1 0 . . . 1 0

is a projection matrix.

(58)

Audio Signal Analysis

r kfrequency

k k

x k

(59)

Application to music transcription

500 1000 1500 2000 2500 3000 3500

Cemgil et. al. 2006, IEEE TSALP

(60)

Factorial Changepoint model

r_0,ν ∼ C(r0,ν; π_0,ν) θ_0,ν ∼ N (θ_0,ν; µ_ν, P_ν)

r_k,ν|r_k−1,ν ∼ C(r_k,ν; π_ν(r_t−1,ν)) Changepoint indicator θ_k,ν|θ_k−1,ν ∼ N (θ_k,ν; A_ν(r_k)θ_k−1,ν, Q_ν(r_k)) Latent state

yk|θk,1:W ∼ N (yk; Ckθk,1:W, R) Observation

r₀^ν · · · r^νk · · · rK^ν

θ^ν

0 · · · θ^ν_k · · · θ^ν_K

ν = 1 . . . W

y^k y^K

(61)

Application: Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent changepoint process ν = 1 . . . W corresponds to a “piano key”.

Indicators r_1:W,1:K encode a latent “piano roll”

(62)

Sequential Inference Problems

• Filtering p(θ_k|y_1:k) = P

r_1:k

R dθ_0:k−1p(y_1:k|θ_0:k)p(θ_0:k|r_1:k)p(r_1:k)

• Viterbi path (e.g. Raphael 2001)

(r_1:k, θ_1:k)^∗ = argmax

r_1:k,θ_1:k

p(y_1:k|θ_0:k)p(θ_0:k|r_1:k)p(r_1:k)

• Best segmentation (MMAP) r_1:k^∗ = argmax

r_1:k

Z

dθ_0:kp(y_1:k|θ_0:k)p(θ_0:k|r_1:k)p(r_1:k)

– Each configuration of r_1:K encodes one of the possible 2^K possible models, i.e., segmentation.

• All problems are similar, but MMAP is usually harder because max and R

do not commute

(63)

Exact Inference in switching state space models

• In general, exact inference is intractable (NP hard)

– Conditional Gaussians are not closed under marginalization

⇒ Unlike HMM’s or KFM’s, summing over r_k does not simplify the filtering density

⇒ Number of Gaussian kernels to represent exact filtering density p(r_k, θ_k|y_1:k) increases exponentially

−7.9036 6.6343

0.76292

−10.3422

−10.1982

−2.393

−2.7957

−0.4593

(64)

Sequential Inference

• Filtering: Mixture Kalman Filter (Rao-Blackwellized PF) (Chen and Liu 2001)

• MMAP: Breadth-first search algorithm with greedy or randomised pruning, multi-hypothesis tracker (MHT)

(65)

Mixture Kalman Filter

• Particles are conditional Gaussians ψ_k⁽ⁱ⁾(θ_k; r_1:k⁽ⁱ⁾) ≡ W (r_1:k⁽ⁱ⁾) exp

−1

2θ_k^⊤K⁽ⁱ⁾θ_k + θ_k^⊤h⁽ⁱ⁾ + g⁽ⁱ⁾

• Use Rao-Blackwellization to compute the optimal proposal distribution (the marginal filtering density) by Kalman predict and update steps

q(r_k|y_1:k) = Z

dθ_kp(y_k|θ_k) Z

dθ_k−1p(θ_k|θ_k−1, r_k)p(r_k|r_k−1⁽ⁱ⁾ )ψ_k−1⁽ⁱ⁾ (θ_k−1; r_1:k−1⁽ⁱ⁾ )

· · · rk⁽ⁱ⁾−1 r^k · · ·

· · · θ_k₋₁ θ_k · · ·

y^k−1 y^k

(66)

Exact Inference for Changepoint detection?

• Exact inference is achievable in polynomial time/space

– Intuition: When a changepoint occurs, the old state vector is reinitialized

⇒ Number of Gaussians kernels grows only polynomially (See, e.g., Barry and Hartigan 1992, Digalakis et. al. 1993, `O Ruanaidh and Fitzgerald 1996, Gustaffson 2000, Fearnhead 2003)

r1 = 1 r2 = 0 r3 = 0 r4 = 1 r5 = 0

θ0 θ1 θ2 θ3 θ4 θ5

y1 y2 y3 y4 y5

– The same structure can be exploited for the MMAP problem

⇒ Trajectories r_1:k⁽ⁱ⁾ which are dominated in terms of conditional evidence p(y_1:k, r_1:k⁽ⁱ⁾) can be discarded without destroying optimality

Periodp(y k 1:k−1 |y )

Introduction to Sequential Monte Carlo and

Inference in (switching) state space models

Abstract

Outline

Time series models and Inference, Terminology

Time series models and applications

• Hidden Markov Models

• (Time varying) AR, ARMA, MA models

• Linear Dynamical Systems, Kalman Filter models

• Switching state space models

• Dynamic Bayesian networks

• Nonlinear Stochastic Dynamical Systems

Online Inference, Terminology

• Filtering: p(x

|y

)

– Distribution of current state given all past information – Realtime/Online/Sequential Processing

• Potentially confusing misnomer:

– More general than “digital filtering” (convolution) in DSP – but

algoritmically related for some models (KFM)

Online Inference, Terminology

• Prediction p(y

, x

|y

)

– evaluation of possible future outcomes; like filtering without observations

• Tracking, Restoration

Offline Inference, Terminology

Deterministic Linear Dynamical Systems

• The latent variables s

and observations y

are continuous

• The transition and observations models are linear

• Examples

– A deterministic dynamical system with two state variables – Particle moving on the real line, a perfect metronome

s

=

 phase period



=  1 1 0 1



s

= As

y

= phase

= 1 0

s

= Cs

Kalman Filter Models, Stochastic Dynamical Systems

• We allow random (unknown) accelerations and observation error

s

=  1 1 0 1



s

+ ǫ

= As

+ ǫ

y

= 1 0

s

+ ν

= Cs

+ ν

Tracking

Kalman Filtering and Smoothing (two filter formulation)

α

= p(x

)

α

= p(y

|x

)p(x

)

α

= R

dx

p(x

|x

)p(y

phase period

= 1 1 0 1

= 1 1 0 1