Generative models for audio and music processing

(1)

Generative models for audio and music processing

A. Taylan Cemgil

Signal Processing and Communications Lab.

26 June 2007

Microsoft Research, Cambridge

(2)

Motivation

• Audio: Core information processing modality. Central to the scientific understanding of human hearing abilities.

• Broad spectrum of applications

– Speech processing, Sound localisation, – Hearing aids, Auditory Scene Analysis,

– Music information retrieval, Music performance systems

• Realistic generative modelling appears to be more feasible than, for example, video.

⇒ Combined with modern approaches to inference in dynamical systems,

we may make significant progress in this area.

(3)

Traditional Analysis

• Digital filtering theory, Transform methods/Fourier techniques

• Deterministic model based, System identification

• Procedural – no clear distinction between “what” and “how”

(4)

Statistical Approaches

• Probabilistic

• Hierarchical signal models to incorporate prior knowledge/inspiration from various sources:

– Physics

(acoustics, physical models, ...)

– Studies of human cognition and perception

(masking, psychoacoustics, ...)

– Musicology

(musical constructs, harmony, tempo, form ...)

– Neuroscience

(efficient auditory representations, ...)

• Consistent framework for developing inference algorithms

• Practical obstacles such as memory requirement, computation time

(5)

Generative Models for audition

• Computer audition ⇔ inverse synthesis

p(Structure |Observations) ∝ p(Observations|Structure)p(Structure) – restoration, interpolation

– source separation, – transcription,

– localisation, – identification,

– coding, compression

– resynthesis, cross synthesis

– ...

(6)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(7)

Source Separation

sk,1 . . . sk,n . . . sk,N

xk,1 . . . xk,M

k = 1 . . . K

a₁ r₁ . . . a_M rM

• Joint estimation Sources, Channel noise and mixing system x_k,1:M ∼ N (x^k,1:M; As_k,1:N, R)

(8)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

(9)

Polyphonic Music Transcription

• Extracting a human readable representation from music data

– Auditory scene analysis task with a lot of structure (both natural and man-made) – Reminiscent to the “cocktail party” problem in speech recognition or “object

recognition” in computer vision

– Varies in modelling effort and computational complexity

∗ duets, simple piano music (easier)

∗ symphonies, contemporary music (hard)

• Applications

– Music Education, Interactive Music Performance

– Musicology, Music Perception and Cognition Research – Music Information Retrieval

content-based querying and retrieval, music recommendation and playlist generation, automatic classification, music summarisation, annotation

(10)

Modelling and Computational issues

• Hierarchical – Signal level

pitch, onsets, timbre – Symbolic level

melody, motives, harmony, chords, tonality, rhythm, beat, tempo, articulation, instrumentation, voice ...

– Cognitive level

expression, genre, form, style, mood, emotion

• Uncertainty

– Parameter Learning

Which pitch, rhythm, tempo, meter, time signature ... ? – Model Selection

How many notes, harmonics, onsets, sections ... ?

(11)

Generative Models for Music

(12)

Generative Models for Music

Score Expression

Piano-Roll

Signal

(13)

Hierarchical Modeling of Music

M

1

2

:::

t

v1 v2

:::

vt

k

1

k

2

::: k

t

h

1

h

2

::: h

t

1 2 ::: t

m1 m2 ::: mt

g

j;1

g

j;2

::: g

j;t

rj;1 rj;2

:::

rj;t

n

j;1

n

j;2

::: n

j;t

x

j;1

x

j;2

::: x

j;t

yj;1 yj;2

:::

yj;t

y1 y2

:::

yt

(14)

Research Questions

• What kinds of prior knowledge and modeling techniques are useful?

• How can we speed up inference to make them competitive with more

traditional approaches?

(15)

Outline

• Audio Signal, Time frequency representations, Spectrogram

• Transform Domain

– Gamma Chains and Fields

∗ Denoising

∗ Source Separation

∗ Score Following, transcription,

∗ Tempo, rhythm, meter estimation

(16)

Outline

• Time Domain/State Space

– Kalman Filter: Probabilistic Phase Vocoder

∗ Interpolation/Restoration

– Switching State space models, Changepoint models

∗ Pitch tracking, Onset/offset detection – Factorial Changepoint Model

∗ Bayesian model selection

∗ Polyphonic pitch tracking

• Conclusions and Final Remarks

(17)

Colaborators

• Nick Whiteley, Cambridge

• Paul Peeling, Cambridge

• Onur Dikmen, Bo˘gazi¸ci, Istanbul

• Simon Godsill, Cambridge

• Cedric Fevotte, ENST, Paris Telecom

• David Barber, UCL

• Bert Kappen, Nijmegen, The Netherlands

(18)

Audio Signal

x t

t (Speech)

t

x t

(Piano)

x = x₁ . . . x_t . . .

(19)

Signal Models for Audio

• “Time Domain”

– State space modeling,

– Conditional Linear Dynamical Systems, Gaussian processes (e.g.

AR, ARMA)

– Flexible, Physically realistic,

– Analysis down to sample precision (if required) – Computationally quite heavy

• “Frequency Domain”

– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)

– Inherent limitations (analysis windows, frequency resolution)

(20)

Spectrogram

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

• A linear expansion using a collection of basis functions φ^k(t) centered around time-frequency atom k = k(τ, ω) at time τ and frequency ω

x(t) = X

k

s_kφ_k(t)

• Popular transforms for audio: STFT, MDCT

• Spectrogram displays 2 log |s^k| or |s^k|² (of STFT)

(21)

Models for time-frequency Energy distributions

Models for time-frequency Energy distributions

• Mask models

(Roweis 2001, Reyes-Gomez, Jojic, Ellis 2005 ...)

X

_ν,τ

= [r

_ν,τ

= 0]S

_ν,τ⁽⁰⁾

+ [r

_ν,τ

= 1]S

_ν,τ⁽¹⁾

Spectrogram = Mask × Source

⁰

+ (1 − Mask) × Source

¹

= + +

– ... however, sources do overlap in time and frequency

(23)

Prior structures on time-frequency Energy distributions

• Main Idea: Spectrogram is a point estimate of the energy at a time-frequency atom k(ν, τ ). We place a suitable prior on the variance of transform coefficients s

_k

– The inverse Gamma distribution is a natural candidate as a prior

• There is significant structure on a spectrogram

– Need to introduce correlations among variances of harmonically and temporally related time-frequency atoms

p(s |v)p(v) = Y

k

p(s

_k

|v

^k

)

!

p(v)

– Gabor regression (Wolfe and Godsill 2005): v discrete

(24)

Example, one channel source separation

v_k,1 . . . v_k,N

sk,1 . . . sk,N

xk

k = 1 : K

s_k,n|v^k,n ∼ N (s^k,n; 0; v_k,n)

x_k|s^k,1:N = PN

n=1 s_k,n

• Straightforward application of Bayes’ theorem yields

p(s

_n

/ X

n^′

v

_n^′

0

or h1/vni⁻¹ /

X

n′

h1/v_n′i⁻¹

1

A

• Each source coefficient sⁿ gets a fraction κ_n of the observation x

• Model additive in variances Z

dsp(x|s1:N)p(s_1:N|v1:N)p(v_1:N) = N (x; 0,

XN j=n

v_n)p(v_1:N)

• Unconditional marginal p(x) is heavy tailed (e.g. Student-t)

(26)

Gamma Distribution

• Gamma Distribution with shape a and scale z

G(λ; a, z) ≡ exp((a − 1) log λ − z

⁻¹

λ + a log z

⁻¹

− log Γ(a))

– Conjugate prior for Gaussian precision, Poisson intensity and Inverse Gamma scale

– Sufficient statistics hλi_G = az and hlog λi_IG = Ψ(a) − log z⁻¹

• Inverse Gamma Distribution

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5 a=2 b=1

(28)

Gamma Chains

We define an Inverse Gamma-Markov chain for k = 1 . . . K as follows

v

_k

|z

^k

∼ IG(v

^k

; a, z

_k

/a)

z

_k+1

|v

^k

∼ IG(z

^k+1

; a

_z

, v

_k

/a

_z

)

z₁ · · · vk−1 az z_k a v_k az zk+1 · · ·

p(z, v; a) ∝ ψ(b⁻¹_z , a_zz₁⁻¹)Y

k

φ(v_k⁻¹; a + a_z)φ(z_k⁻¹; a + a_z)ψ(az_k⁻¹, v_k⁻¹)ψ(a_zv_k⁻¹, z_k+1⁻¹ )

φ(ξ; α) = exp((α + 1) log ξ) ψ(ξ, η) = exp(−ξη)

(Singletons) (Pairwise)

(29)

The need for auxillary variables z

• We want conjugacy (full conditional p(v

^k

|v

^k−1

, v

_k+1

) is IG) and positive correlation beetween v

_k

and v

_k−1

• Conjugate, but no positive correlation

v

_k

|v

^k−1

∼ IG(v

^k

; a, v

_k−1

/a)

• Positive correlation but not conjugate

v

_k

|v

^k−1

∼ IG(v

^k

; a, (v

_k−1

a)

⁻¹

)

(30)

IG − IG Distribution

(Bernardo-Smith)

• Scale Mixture of Inverse Gamma p(v_k|vk−1) =

Z

dz_kIG(vk; a, z_k/a)IG(zk; a_z, v_k−1/a_z)

= Γ(a + a_z) Γ(a_z)Γ(a)

(a_zv_k−1⁻¹ )^a^z(av_k⁻¹)^a

(a_zv_k−1⁻¹ + av_k⁻¹)^(a^z^+a)v_k⁻¹

a = 0.1 a

z = 0.1

log10 v

k

log 10 v k−1

−5 0 5

a = 2 a

z = 2

log10 v

k

log 10 v k−1

−5 0 5

a = 0.1 a

z = 10

log10 v

k

log 10 v k−1

−5 0 5

a = 3 a

z = 0.3

log10 v

k

log 10 v k−1

−5 0 5

• Small a and az ⇒ weak coupling.

• a^z/a < 1 Systematic drift towards small variances ⇒ Potentially useful for damping effects in audio

(31)

Example: Nonstationary Gaussian Process

• Time varying variance – Stochastic Volatility Model

v_1:K ∼ Inverse Gamma Chain(v1:K; a, a_z) y_k ∼ N (y^k; 0, v_k)

0 200 400 600 800 1000

0 5 10 15 20

v k

0 200 400 600 800 1000

−10

−5 0 5 10

y k

k

True VB

(32)

Example: Nonhomogeneous Poisson Process

• Time varying intensity (L : length of a small interval where the intensity is constant)

λ_1:K ∼ Gamma Chain(λ1:K; a, a_z)

c_k|λ^k ∼ PO(c^k; λ_kL) ∝ exp (c^k log λ_k − Lλ^k)

0 50 100 150

λ k

0 0.2 0.4 0.6 0.8 1

c k

Arrival time

True VB

(33)

Inference : Structured Mean Field, Variational Bayes

(MacKay 1995, Attias 1999, Wiegerinck 2000, Ghahramani and Beal 2000, Winn and Bishop 2005)

Approximate P with a tractable distribution Q = Q

α Q^α KL(Q||P) = hlog Qi_Q −

log 1

Z_xφ(S)

Q

hf(x)i_p(x) ≡ R

dxp(x)f (x). Using KL ≥ 0, we obtain a lower bound on the evidence log Z_x ≥ hlogφ(S)i_Q − hlog Qi_Q

This leads to a set of fixed point equations

Q^α ∝ exp hlog φ(S)i_Q_¬α

Depending upon the initial starting point Q⁽⁰⁾, the iteration converges to a local maximum of the lower bound.

(34)

Gibbs Sampler (with grouping)

• MCMC: Construct a markov chain with stationary distribution as the desired posterior P

• Gibbs sampler: We visit each block α ∈ C and sample from full conditionals C^α ∼ p(C^α|C^¬α)

• For a graphical model, full conditionals are functions of the variables of the Markov Blanket

• Algorithmically very similar to VB

(35)

Inverse Gamma Markov Random Field

• A conditional Markov Random Field

• IGMRF on ξ = {ξi}i∈V, Given

– an Undirected graph with vertex set V and undirected edge set E – A set of connection weights a = {a^i,j}(i,j)∈E for i, j ∈ V and i 6= j

p(ξ; a) = 1 Za

Y

i∈V

φ(ξ_i⁻¹;X

j

a_i,j) Y

(i,j)∈E

ψ(ξ_i⁻¹, (a_i,j/2)ξ_j⁻¹)

φ(ξ; α) = exp((α + 1) log ξ) (Singleton)

ψ(ξ, η) = exp(−ξη) (Pairwise)

• Gamma field φ(ξi;P

j a_i,j), ψ(ξ_i, (a_i,j/2)ξ_j) but with φ(ξ; α) = exp((α− 1) log ξ).

(36)

Possible Model Topologies

(37)

Harmonic-Transient Decomposition

• Horizontal : Tie across time : harmonic continuity

• Vertical : Tie across frequency : transients, pulse like sounds

(38)

Harmonic-Transient Decomposition

Time (τ)

Frequency Bin (ν)

Xorg S

hor S

ver

(Original) (Hor) (Vert)

(39)

Denoising - Piano

X

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−16

−14

−12

−10

−8

−6

−4

−2 0

Xorg

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−16

−14

−12

−10

−8

−6

−4

−2 0

(Noisy) (Original)

(Hor) (Ver) (Band) (Grid)

(40)

Denoising - Speech

X

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−14

−12

−10

−8

−6

−4

−2 0

Xorg

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−14

−12

−10

−8

−6

−4

−2 0

(Noisy) (Original)

(Hor) (Ver) (Band) (Grid)

(41)

Source Separation

s_k,1 . . . s_k,n . . . s_k,N

x_k,1 . . . x_k,M

k = 1 . . . K

a₁ r₁ . . . a_M r_M

• Joint estimation Sources, Channel noise and mixing system x_k,1:M ∼ N (x^k,1:M; As_k,1:N, R)

(42)

Multichannel Source Separation

• Hierarchical Prior Model (Fevotte and Godsill 2005, Cemgil et. al. 2006)

λ₁ . . . λ_n . . . λ_N ∼ G(λn; a_λ, b_λ)

v_k,1 . . . v_k,n · · · v_k,N ∼ IG(vk,n; ν/2, 2/(νλn))

s_k,1 . . . s_k,n . . . s_k,N ∼ N (sk,n; 0, v_k,n)

x_k,1 . . . x_k,M

k = 1 . . . K

∼ N (xk,m; a^⊤_ms_k,1:N, r_m)

a₁ r₁ . . . a_M

∼ N (am;· · · ) r_M

∼ IG(rm;· · · )

(43)

Equivalent Gamma MRF

• A tree for each source

• λn can be interpreted as the over energy of source n

(44)

Factor Graph

(Kschischang et. al. 2001, Frey and Jojic 2005)

• Many approximation algorithms (LBP, VMP, EP, EC, Gibbs Sampling...) can be formulated as message passing schemata

λ1 . . . λn . . . λN

. . . . . .

vk,1 . . . vk,n · · · vk,N

. . . · · ·

sk,1:N

. . .

k = 1 . . . K

a₁ r1 . . . a_M rM

Factor Graph: Useful for visualising the inference problem and the approximating distribution Q = ^Q_α Q_α

Factor nodes: Black squares. Factor potentials defining the posterior P

Variable nodes: Circles. “factors” of the approximating distribution Q

Edges: denote membership. A variable is connected to a factor if it is a variable of the local function

(45)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Mix)

(46)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25

(Guitar)

(47)

Multimodality

• Typically underdetermined (Channels < Sources) ⇒ Multimodal posterior

(48)

Annealing, Bridging, Overrelaxation, Tempering

• Define a sequence of inverse temparatures τ1, τ₂, . . . , τ_T (an annealing schedule)

• Sample from/approximate

P^τi ∝ P^τⁱ

s1

s 2

prior

exact posterior

factorized MF

s1

s 2

prior

exact posterior

factorized MF R1

R2

R_τ

(49)

Multimodality

Observations:

m

k

20 40 60 80 100 120 140 160 180 200

1

2 −1000

−500 0 500

0 500 1000 1500 2000

−0.8024 0.8295 2.0375

a

0 500 1000 1500 2000

7.2408 25.1398 36.2295

λ

0 500 1000 1500 2000

0.5451 1.8648

r

Epoch

(50)

Reconstructions

n

Epoch = 500 1

2 3

n

Epoch = 1000 1

2 3

n

Epoch = 1500 1

2 3

n

Epoch = 2000 1

2 3

n

k Original

20 40 60 80 100 120 140 160 180 200

1 2 3

Posterior surface is multimodal, each mode corresponding to a viable separation

(51)

Mixing System, Best Solution

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1

x 2

(52)

Mixing System, stable but suboptimal solutions

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1 x2

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1 x2

(53)

Reconstruction Error

60 80 100 120 140 160

0 10 20 30 40 50

Reconstruction Error

The histogram for 100 independent runs.

(54)

Training Gamma Chains and Fields (with Onur Dikmen)

• Chains and trees: simple since log Z^a has a known expression – By variational Bound maximisation

• Fields:– work in progress – partition function log Z^a is unknown (? – at least seems to be quite complicated)

• Approximate the derivative

– Contrastive Divergence (Hinton)

• Cancel out Z^a/Z_a^′ term in acceptance probability

– Metropolis-Hastings with “artificial data” (Moller, Pettitt, Reeves, Berthelsen)

• Power EP

(55)

Further Applications in Music Processing

(with Paul Peeling)

• Score-Performance matching

• Musical Score guided source separation

• Transcription

• Chord Detection

(56)

Score-Performance matching

• Given a musical score, associate note events with the audio

4

t

x t

(57)

Score-Performance matching - Graphical Model

ν = 1, . . . , W

t₁ t₂ . . . t_K

r₁ r₂ . . . r_K

λ₁ λ₂ . . . λ_K

v_ν,1 v_ν,2 . . . v_ν,K

sν,1 sν,2 . . . sν,K

6 7 8 1 2 3 4 5

r_k

0 500 1000 1500 2000 2500 3000 3500 4000

−12

−10

−8

−6

−4

−2 0

Frequency ν / Hz log σν

v ∼ IG(v ; a, 1/(aλσ (r )))

(58)

Score-Performance matching - Signal model

0 500 1000 1500 2000 2500 3000 3500 4000

−12

−10

−8

−6

−4

−2 0

Frequency ν / Hz log σν

(59)

Score-Performance matching

Spectrogram Data

Time / s

Frequency / Hz

0 2 4 6 8 10 12 14

0 1000 2000 3000 4000

50 100 150 200 250 300 350 400 450

55 60 65 70 75 80 85

MIDI Data

Score position

MIDI note

(60)

Monophonic Transcription

log p(r_τ|s^τ)

MIDInotenumber

Time / s

1 2 3 4

60 65 70 75 80

1 2 3 4

−10

−5 0 5 10

P

iw˜τ⁽ⁱ⁾λ˜⁽ⁱ⁾τ

Time / s

logλ

MDCT of audio (source: Daniel-Ben Pienaar)

Time / s

Frequency/Hz

1 2 3 4

0 500 1000 1500 2000 2500 3000 3500 4000

(61)

Tempo, Rhythm, Meter analysis (

Cemgil 2000; Whiteley, Cemgil, Godsill 2006) Example: A Performed Onset Sequence

1.18 0.59 0.29 0.34 0.44 0.34 0.39 0.6 0.63 0.3 0.28 0.3 0.35 1.19

Very accurate but too complex

Simple but a very poor description of the rhythm

Desired quantization balances accuracy and simplicity

3

(62)

Bar Pointer Model

| | |

3 • • • • • • • •

nk 2 • • • • • • • •

1 • • • • • • • •

1 2 3 4 5 6 7 8

mk

3/4 time 4/4 time

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

m_k

µ k

Triplet Rhythm

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

mk

µ k

Duplet Rhythm

n0 n1 n2 n3

θ0 θ1 θ2 θ3

m0 m1 m2 m3

r0 r1 r2 r3

λ1 λ2 λ3

y1 y2 y3

(63)

Filtering

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(m_k|y_1:k)

50 100 150 200 250 300 350 400 450

800 600 400 200

−10

−5 0

Quarter notes per min.

log p(n k|y

1:k)

50 100 150 200 250 300 350 400 450

180 120

60 −4

−2 0

p(rk|y 1:k)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets

0 0.2 0.4 0.6 0.8

(64)

Smoothing

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(m k|y

1:K)

50 100 150 200 250 300 350 400 450

800 600 400

200 −10

−5 0

Quarter notes per min.

log p(n k|y

1:K)

50 100 150 200 250 300 350 400 450

180 120 60

−10

−5 0

p(rk|y 1:K)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets 0.2

0.4 0.6 0.8

(65)

Time Domain Modeling

(66)

Sinusoidal Modeling

• Sound is primarily about oscillations and resonance

• Cascade of second order sytems

• Audio signals can often be compactly represented by sinusoidals

(real) y_n =

Xp k=1

α_ke^−γ^kⁿ cos(ω_kn + φ_k)

(complex) y_n =

Xp k=1

c_k(e^−γ^k^+jω^k)ⁿ

y = F (γ_1:p, ω_1:p)c

(67)

State space Parametrisation

x_n+1 =





e^−γ¹^+jω¹

. ..

e^−γ^p^+jω^p





| {z }

A

x_n x₀ =





 c₁ c₂ ...

c_p







y_n = 1 1 . . . 1 1

| {z }

C

x_n

x₀ x₁ . . . x_k−1 x_k . . . x_K

y₁ . . . yk−1 y_k . . . y_K

(68)

Diagonal Parametrisation for strictly real y

x^dr₀ =

0

B

c₁ c^∗₁ ...

c^∗_p cp

1

C

A

x^dr_n+1 =







e^jω¹

e^−jω¹

. ..

e^jω^p

e^−jω^p







| {z }

A

_dr

x^dr_n

y_n = 1 1 . . . 1 1

| {z }

C

_dr

x^dr_n

(69)

Alternative realisations – Reparametrisations

• A linear system has infinitely many realisations, each corresponding to a state space representation

¯

x_n+1 = A¯¯x_n y_n = C ¯¯x_n with

¯

x_n ≡ T xn A¯ ≡ T AT⁻¹ C¯ ≡ CT⁻¹

T x_n+1 = T AT⁻¹T x_n y_n = CT⁻¹T x_n

• There are many parametrisations, but some are more suitable for extracting relevant information or a compact representation

(70)

Transformation from Diagonal canonical form to rotation matrix

cos(θ) = e^jθ + e^−jθ

2 sin(θ) = e^jθ − e^−jθ 2j

T AT⁻¹ = A¯

√1 2

1 1

−j j

e^−γ+jω 0

0 e^−γ−jω

1√ 2

1 j 1 −j

= e^−γ

cos(ω) − sin(ω) sin(ω) cos(ω)

CT⁻¹ = C¯ 1 1 1

√2

1 j 1 −j

= 1 0

(71)

State Space Parametrisation

(72)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(73)

Audio Interpolation

p(x_¬κ|x^κ) ∝ Z

dHp(x_¬κ|H)p(x^κ|H)p(H) H ≡ (parameters, hidden states)

H

x_¬κ x^κ

Missing Observed

0

(74)

Probabilistic Phase Vocoder

(Cemgil and Godsill 2005)

A_ν Q_ν

s^ν₀ · · · s^ν_k · · · s^ν_K−1

ν = 0 . . . W − 1

x₀ x_k x_K−1

s^ν_k ∼ N (s^ν_k; A_νs^ν_k−1, Q_ν) A_ν ∼ N

A_ν;

cos(ω_ν) − sin(ων) sin(ω_ν) cos(ω_ν)

, Ψ

(75)

Exact Inference

1. Infer the posterior distribution

p(S, Θ|x^κ) = 1

Z_xp(x^κ|S, Θ)p(S|Θ)p(Θ)

≡ 1

Z_xφ(S, Θ) ≡ P

S = s^{0:W −1}_0:K−1 Θ = (A, Q) Z_x = p(x^κ)

2. Then, compute the predictive distribution p(x_¬κ|x^κ) =

Z

dSdΘp(x^¬κ|S, Θ)p(S, Θ|x^κ) Exact evaluation of P is intractable.

(76)

Approximating distribution Q

Q ≡ Y

α∈C

q(s^α_0:K−1)q(Θ_α) ≡ Y

α∈C

Q^α(S^α)Q^α(Θ_α)

where C = {ν¹, . . . , ν_N} is a set of disjoint clusters of frequency bands such that ν_i ∩ ν^j = ∅ for i 6= j and S

i ν_i = {0, . . . , W − 1}.

q(s^α_0:K−1) ≡ q(s^α₀) YK k=1

q(s^α_k|s^α_k−1) q(Θ_α) ≡ q(A_α)q(Q_α)

(77)

Inference: Structured Variational Bayes

Aα q(Aα) Qα q(Qα)

· · · s^α_k−1 s^α_k s^α_k+1 · · ·

α∈ C

Q

kq(s^α_k|s^α_k−1)

xk q(xk)

• Intuitive algorithm:

– Substract from the observed signal x the prediction of the frequency bands in ¬α.

– Compute a fit for α to this residual and iterate.

• For fixed A, Q, this is equivalent to Gauss-Seidel, an iterative method for solving linear systems of

(78)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%)

– Reconstruction, 7.10 dB improvement

– Original

(79)

Hierarchical Factorial Models

• Each component models a latent process

• The observations are projections

r^ν₀ · · · r^ν_k · · · r^ν_K

θ^ν₀ · · · θ^ν_k · · · θ^ν_K

ν = 1 . . . W

yk yK

• Generalises Source-filter models

(80)

Harmonic model with changepoints

r_k|r^k−1 ∼ p(r^k|r^k−1) r_k ∈ {0, 1}

θ_k|θ^k−1, r_k ∼ [r^k = 0]N (Aθ^k−1, Q)

| {z }

reg

+ [r_k = 1]N (0, S)

| {z }

new

y_k|θ^k ∼ N (Cθ^k, R)

A =







G_ω

G²_ω

. ..

G^H_ω







N

G_ω = ρ_k

cos(ω) − sin(ω) sin(ω) cos(ω)

damping factor 0 < ρk < 1, framelength N and damped sinusoidal basis matrix C of size N × 2H

(81)

Harmonic model with changepoints

r kfrequency

k k

x k

(S)

• Each changepoint denotes the onset of a new audio event

(82)

Inference

• Filtering

p(r_k, θ_k|y^1:k) ∝ X

r_1:k−1

Z

dθ_1:k−1p(y_1:k|θ^1:k)p(θ_1:k|r^1:k)p(r_1:k)

• Smoothing, Marginal MAP – MMAP r^∗_1:K = arg max

r_1:K

Z

θ_1:Kp(y_1:K|θ^1:K)p(θ_1:K|r^1:K)p(r_1:K)

– Each configuration of r_1:K encodes one of the possible 2^K possible models, i.e., segmentation.

– MMAP is equivalent to Bayesian Model Selection

• All problems are similar, but MMAP is usually harder because max and R

do not commute

(83)

Exact Inference in switching state space models is intractable

• In general, exact inference is NP hard

– Conditional Gaussians are not closed under marginalization

⇒ Unlike HMM’s or KFM’s, summing over r^k does not simplify the filtering density

⇒ Number of Gaussian kernels to represent exact filtering density p(rk, θ_k|y1:k) increases exponentially

−7.9036 6.6343

0.76292

−10.3422

−10.1982

−2.393

−2.7957

−0.4593