Generative models for audio and music processing
A. Taylan Cemgil
Signal Processing and Communications Lab.
26 June 2007
Microsoft Research, Cambridge
Motivation
• Audio: Core information processing modality. Central to the scientific understanding of human hearing abilities.
• Broad spectrum of applications
– Speech processing, Sound localisation, – Hearing aids, Auditory Scene Analysis,
– Music information retrieval, Music performance systems
• Realistic generative modelling appears to be more feasible than, for example, video.
⇒ Combined with modern approaches to inference in dynamical systems,
we may make significant progress in this area.
Traditional Analysis
• Digital filtering theory, Transform methods/Fourier techniques
• Deterministic model based, System identification
• Procedural – no clear distinction between “what” and “how”
Statistical Approaches
• Probabilistic
• Hierarchical signal models to incorporate prior knowledge/inspiration from various sources:
– Physics
(acoustics, physical models, ...)– Studies of human cognition and perception
(masking, psychoacoustics, ...)– Musicology
(musical constructs, harmony, tempo, form ...)– Neuroscience
(efficient auditory representations, ...)• Consistent framework for developing inference algorithms
• Practical obstacles such as memory requirement, computation time
Generative Models for audition
• Computer audition ⇔ inverse synthesis
p(Structure |Observations) ∝ p(Observations|Structure)p(Structure) – restoration, interpolation
– source separation, – transcription,
– localisation, – identification,
– coding, compression
– resynthesis, cross synthesis
– ...
Audio Restoration/Interpolation
• Estimate missing samples given observed ones
• Restoration, concatenative expressive speech synthesis, ...
0 50 100 150 200 250 300 350 400 450 500
0
Source Separation
sk,1 . . . sk,n . . . sk,N
xk,1 . . . xk,M
k = 1 . . . K
a1 r1 . . . aM rM
• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)
Polyphonic Music Transcription
• from sound ...
t/sec
f/Hz
0 1 2 3 4 5 6 7 8
0 1000 2000 3000 4000 5000
0 10 20
(S)
• ... to score
Polyphonic Music Transcription
• Extracting a human readable representation from music data
– Auditory scene analysis task with a lot of structure (both natural and man-made) – Reminiscent to the “cocktail party” problem in speech recognition or “object
recognition” in computer vision
– Varies in modelling effort and computational complexity
∗ duets, simple piano music (easier)
∗ symphonies, contemporary music (hard)
• Applications
– Music Education, Interactive Music Performance
– Musicology, Music Perception and Cognition Research – Music Information Retrieval
content-based querying and retrieval, music recommendation and playlist generation, automatic classification, music summarisation, annotation
Modelling and Computational issues
• Hierarchical – Signal level
pitch, onsets, timbre – Symbolic level
melody, motives, harmony, chords, tonality, rhythm, beat, tempo, articulation, instrumentation, voice ...
– Cognitive level
expression, genre, form, style, mood, emotion
• Uncertainty
– Parameter Learning
Which pitch, rhythm, tempo, meter, time signature ... ? – Model Selection
How many notes, harmonics, onsets, sections ... ?
Generative Models for Music
Generative Models for Music
Score Expression
Piano-Roll
Signal
Hierarchical Modeling of Music
M
1
2
:::
t
v1 v2
:::
vt
k
1
k
2
::: k
t
h
1
h
2
::: h
t
1 2 ::: t
m1 m2 ::: mt
g
j;1
g
j;2
::: g
j;t
rj;1 rj;2
:::
rj;t
n
j;1
n
j;2
::: n
j;t
x
j;1
x
j;2
::: x
j;t
yj;1 yj;2
:::
yj;t
y1 y2
:::
yt
Research Questions
• What kinds of prior knowledge and modeling techniques are useful?
• How can we speed up inference to make them competitive with more
traditional approaches?
Outline
• Audio Signal, Time frequency representations, Spectrogram
• Transform Domain
– Gamma Chains and Fields
∗ Denoising
∗ Source Separation
∗ Score Following, transcription,
∗ Tempo, rhythm, meter estimation
Outline
• Time Domain/State Space
– Kalman Filter: Probabilistic Phase Vocoder
∗ Interpolation/Restoration
– Switching State space models, Changepoint models
∗ Pitch tracking, Onset/offset detection – Factorial Changepoint Model
∗ Bayesian model selection
∗ Polyphonic pitch tracking
• Conclusions and Final Remarks
Colaborators
• Nick Whiteley, Cambridge
• Paul Peeling, Cambridge
• Onur Dikmen, Bo˘gazi¸ci, Istanbul
• Simon Godsill, Cambridge
• Cedric Fevotte, ENST, Paris Telecom
• David Barber, UCL
• Bert Kappen, Nijmegen, The Netherlands
Audio Signal
x t
t (Speech)
t
x t
(Piano)
x = x1 . . . xt . . .
Signal Models for Audio
• “Time Domain”
– State space modeling,
– Conditional Linear Dynamical Systems, Gaussian processes (e.g.
AR, ARMA)
– Flexible, Physically realistic,
– Analysis down to sample precision (if required) – Computationally quite heavy
• “Frequency Domain”
– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)
– Inherent limitations (analysis windows, frequency resolution)
Spectrogram
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
• A linear expansion using a collection of basis functions φk(t) centered around time-frequency atom k = k(τ, ω) at time τ and frequency ω
x(t) = X
k
skφk(t)
• Popular transforms for audio: STFT, MDCT
• Spectrogram displays 2 log |sk| or |sk|2 (of STFT)
Models for time-frequency Energy distributions
• Non-Negative Matrix factorisation
(Sha, Saul, Lee 2002, Smaragdis, Brown 2003, Abdallah, Plumbley 2004, Virtanen 2005 ...)X
ν,τ= W
ν,jS
j,τSpectrogram = Spectral Templates × Excitations
= ×
– ... however, spectrograms are not additive (a
2+ b
26= (a + b)
2)
Models for time-frequency Energy distributions
• Mask models
(Roweis 2001, Reyes-Gomez, Jojic, Ellis 2005 ...)X
ν,τ= [r
ν,τ= 0]S
ν,τ(0)+ [r
ν,τ= 1]S
ν,τ(1)Spectrogram = Mask × Source
0+ (1 − Mask) × Source
1= + +
– ... however, sources do overlap in time and frequency
Prior structures on time-frequency Energy distributions
• Main Idea: Spectrogram is a point estimate of the energy at a time-frequency atom k(ν, τ ). We place a suitable prior on the variance of transform coefficients s
k– The inverse Gamma distribution is a natural candidate as a prior
• There is significant structure on a spectrogram
– Need to introduce correlations among variances of harmonically and temporally related time-frequency atoms
p(s |v)p(v) = Y
k
p(s
k|v
k)
!
p(v)
– Gabor regression (Wolfe and Godsill 2005): v discrete
Example, one channel source separation
vk,1 . . . vk,N
sk,1 . . . sk,N
xk
k = 1 : K
sk,n|vk,n ∼ N (sk,n; 0; vk,n)
xk|sk,1:N = PN
n=1 sk,n
• Straightforward application of Bayes’ theorem yields
p(s
k,n|v
k,1:N, x
k) = N (s
k,n; κ
k,nx
k, v
k,n(1 − κ
k,n)) κ
k,n= v
k,n/ X
n′
v
k,n′ (Responsibilities)Example, one channel source separation
p(s
n|v
1:N, x) = N (s
n; κ
nx, v
n(1 − κ
n)) κ
n= v
n/ X
n′
v
n′0
or h1/vni−1 /
X
n′
h1/vn′i−1
1
A
• Each source coefficient sn gets a fraction κn of the observation x
• Model additive in variances Z
dsp(x|s1:N)p(s1:N|v1:N)p(v1:N) = N (x; 0,
XN j=n
vn)p(v1:N)
• Unconditional marginal p(x) is heavy tailed (e.g. Student-t)
Gamma Distribution
• Gamma Distribution with shape a and scale z
G(λ; a, z) ≡ exp((a − 1) log λ − z
−1λ + a log z
−1− log Γ(a))
– Conjugate prior for Gaussian precision, Poisson intensity and Inverse Gamma scale
– Sufficient statistics hλiG = az and hlog λiIG = Ψ(a) − log z−1
• Inverse Gamma Distribution
IG(v; a, z) ≡ exp((a + 1) log v
−1− z
−1v
−1+ a log z
−1− log Γ(a))
– Conjugate prior for Gaussian variance and Gamma scale – Sufficient statistics
v−1
IG = az and
log v−1
IG = Ψ(a) − log z−1
• λ ∼ G(λ; a, z) ⇔ λ
−1= v ∼ IG(v; a, z)
Inverse Gamma Distribution
0 1 2 3 4 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
a=1 b=1
a=1 b=0.5 a=2 b=1
Gamma Chains
We define an Inverse Gamma-Markov chain for k = 1 . . . K as follows
v
k|z
k∼ IG(v
k; a, z
k/a)
z
k+1|v
k∼ IG(z
k+1; a
z, v
k/a
z)
z1 · · · vk−1 az zk a vk az zk+1 · · ·
p(z, v; a) ∝ ψ(b−1z , azz1−1)Y
k
φ(vk−1; a + az)φ(zk−1; a + az)ψ(azk−1, vk−1)ψ(azvk−1, zk+1−1 )
φ(ξ; α) = exp((α + 1) log ξ) ψ(ξ, η) = exp(−ξη)
(Singletons) (Pairwise)
The need for auxillary variables z
• We want conjugacy (full conditional p(v
k|v
k−1, v
k+1) is IG) and positive correlation beetween v
kand v
k−1• Conjugate, but no positive correlation
v
k|v
k−1∼ IG(v
k; a, v
k−1/a)
• Positive correlation but not conjugate
v
k|v
k−1∼ IG(v
k; a, (v
k−1a)
−1)
IG − IG Distribution
(Bernardo-Smith)• Scale Mixture of Inverse Gamma p(vk|vk−1) =
Z
dzkIG(vk; a, zk/a)IG(zk; az, vk−1/az)
= Γ(a + az) Γ(az)Γ(a)
(azvk−1−1 )az(avk−1)a
(azvk−1−1 + avk−1)(az+a)vk−1
a = 0.1 a
z = 0.1
log10 v
k
log 10 v k−1
−5 0 5
−5 0 5
a = 2 a
z = 2
log10 v
k
log 10 v k−1
−5 0 5
−5 0 5
a = 0.1 a
z = 10
log10 v
k
log 10 v k−1
−5 0 5
−5 0 5
a = 3 a
z = 0.3
log10 v
k
log 10 v k−1
−5 0 5
−5 0 5
• Small a and az ⇒ weak coupling.
• az/a < 1 Systematic drift towards small variances ⇒ Potentially useful for damping effects in audio
Example: Nonstationary Gaussian Process
• Time varying variance – Stochastic Volatility Model
v1:K ∼ Inverse Gamma Chain(v1:K; a, az) yk ∼ N (yk; 0, vk)
0 200 400 600 800 1000
0 5 10 15 20
v k
0 200 400 600 800 1000
−10
−5 0 5 10
y k
k
True VB
Example: Nonhomogeneous Poisson Process
• Time varying intensity (L : length of a small interval where the intensity is constant)
λ1:K ∼ Gamma Chain(λ1:K; a, az)
ck|λk ∼ PO(ck; λkL) ∝ exp (ck log λk − Lλk)
0 50 100 150
λ k
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
c k
Arrival time
True VB
Inference : Structured Mean Field, Variational Bayes
(MacKay 1995, Attias 1999, Wiegerinck 2000, Ghahramani and Beal 2000, Winn and Bishop 2005)
Approximate P with a tractable distribution Q = Q
α Qα KL(Q||P) = hlog QiQ −
log 1
Zxφ(S)
Q
hf(x)ip(x) ≡ R
dxp(x)f (x). Using KL ≥ 0, we obtain a lower bound on the evidence log Zx ≥ hlogφ(S)iQ − hlog QiQ
This leads to a set of fixed point equations
Qα ∝ exp hlog φ(S)iQ¬α
Depending upon the initial starting point Q(0), the iteration converges to a local maximum of the lower bound.
Gibbs Sampler (with grouping)
• MCMC: Construct a markov chain with stationary distribution as the desired posterior P
• Gibbs sampler: We visit each block α ∈ C and sample from full conditionals Cα ∼ p(Cα|C¬α)
• For a graphical model, full conditionals are functions of the variables of the Markov Blanket
• Algorithmically very similar to VB
Inverse Gamma Markov Random Field
• A conditional Markov Random Field
• IGMRF on ξ = {ξi}i∈V, Given
– an Undirected graph with vertex set V and undirected edge set E – A set of connection weights a = {ai,j}(i,j)∈E for i, j ∈ V and i 6= j
p(ξ; a) = 1 Za
Y
i∈V
φ(ξi−1;X
j
ai,j) Y
(i,j)∈E
ψ(ξi−1, (ai,j/2)ξj−1)
φ(ξ; α) = exp((α + 1) log ξ) (Singleton)
ψ(ξ, η) = exp(−ξη) (Pairwise)
• Gamma field φ(ξi;P
j ai,j), ψ(ξi, (ai,j/2)ξj) but with φ(ξ; α) = exp((α− 1) log ξ).
Possible Model Topologies
Harmonic-Transient Decomposition
• Horizontal : Tie across time : harmonic continuity
• Vertical : Tie across frequency : transients, pulse like sounds
Harmonic-Transient Decomposition
Time (τ)
Frequency Bin (ν)
Xorg S
hor S
ver
(Original) (Hor) (Vert)
Denoising - Piano
X
20 40 60 80 100 120 50
100 150 200 250 300 350 400 450 500
−16
−14
−12
−10
−8
−6
−4
−2 0
Xorg
20 40 60 80 100 120 50
100 150 200 250 300 350 400 450 500
−16
−14
−12
−10
−8
−6
−4
−2 0
(Noisy) (Original)
(Hor) (Ver) (Band) (Grid)
Denoising - Speech
X
20 40 60 80 100 120 50
100 150 200 250 300 350 400 450 500
−14
−12
−10
−8
−6
−4
−2 0
Xorg
20 40 60 80 100 120 50
100 150 200 250 300 350 400 450 500
−14
−12
−10
−8
−6
−4
−2 0
(Noisy) (Original)
(Hor) (Ver) (Band) (Grid)
Source Separation
sk,1 . . . sk,n . . . sk,N
xk,1 . . . xk,M
k = 1 . . . K
a1 r1 . . . aM rM
• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)
Multichannel Source Separation
• Hierarchical Prior Model (Fevotte and Godsill 2005, Cemgil et. al. 2006)
λ1 . . . λn . . . λN ∼ G(λn; aλ, bλ)
vk,1 . . . vk,n · · · vk,N ∼ IG(vk,n; ν/2, 2/(νλn))
sk,1 . . . sk,n . . . sk,N ∼ N (sk,n; 0, vk,n)
xk,1 . . . xk,M
k = 1 . . . K
∼ N (xk,m; a⊤msk,1:N, rm)
a1 r1 . . . aM
∼ N (am;· · · ) rM
∼ IG(rm;· · · )
Equivalent Gamma MRF
• A tree for each source
• λn can be interpreted as the over energy of source n
Factor Graph
(Kschischang et. al. 2001, Frey and Jojic 2005)• Many approximation algorithms (LBP, VMP, EP, EC, Gibbs Sampling...) can be formulated as message passing schemata
λ1 . . . λn . . . λN
. . . . . .
vk,1 . . . vk,n · · · vk,N
. . . · · ·
sk,1:N
. . .
k = 1 . . . K
a1 r1 . . . aM rM
Factor Graph: Useful for visualising the inference problem and the approximating distribution Q = Qα Qα
Factor nodes: Black squares. Factor potentials defining the posterior P
Variable nodes: Circles. “factors” of the approximating distribution Q
Edges: denote membership. A variable is connected to a factor if it is a variable of the local function
Source Separation
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 5 10 15 20 25
(Guitar)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
10 15 20 25
(Mix)
Reconstructions
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25
(Guitar)
Multimodality
• Typically underdetermined (Channels < Sources) ⇒ Multimodal posterior
Annealing, Bridging, Overrelaxation, Tempering
• Define a sequence of inverse temparatures τ1, τ2, . . . , τT (an annealing schedule)
• Sample from/approximate
Pτi ∝ Pτi
s1
s 2
prior
exact posterior
factorized MF
s1
s 2
prior
exact posterior
factorized MF R1
R2
Rτ
Multimodality
Observations:
m
k
20 40 60 80 100 120 140 160 180 200
1
2 −1000
−500 0 500
0 500 1000 1500 2000
−0.8024 0.8295 2.0375
a
0 500 1000 1500 2000
7.2408 25.1398 36.2295
λ
0 500 1000 1500 2000
0.5451 1.8648
r
Epoch
Reconstructions
n
Epoch = 500 1
2 3
n
Epoch = 1000 1
2 3
n
Epoch = 1500 1
2 3
n
Epoch = 2000 1
2 3
n
k Original
20 40 60 80 100 120 140 160 180 200
1 2 3
Posterior surface is multimodal, each mode corresponding to a viable separation
Mixing System, Best Solution
−20 −15 −10 −5 0 5 10 15 20
−20
−15
−10
−5 0 5 10 15 20
x1
x 2
Mixing System, stable but suboptimal solutions
−20 −15 −10 −5 0 5 10 15 20
−20
−15
−10
−5 0 5 10 15 20
x1 x2
−20 −15 −10 −5 0 5 10 15 20
−20
−15
−10
−5 0 5 10 15 20
x1 x2
Reconstruction Error
60 80 100 120 140 160
0 10 20 30 40 50
Reconstruction Error
The histogram for 100 independent runs.
Training Gamma Chains and Fields (with Onur Dikmen)
• Chains and trees: simple since log Za has a known expression – By variational Bound maximisation
• Fields:– work in progress – partition function log Za is unknown (? – at least seems to be quite complicated)
• Approximate the derivative
– Contrastive Divergence (Hinton)
• Cancel out Za/Za′ term in acceptance probability
– Metropolis-Hastings with “artificial data” (Moller, Pettitt, Reeves, Berthelsen)
• Power EP
Further Applications in Music Processing
(with Paul Peeling)• Score-Performance matching
• Musical Score guided source separation
• Transcription
• Chord Detection
Score-Performance matching
• Given a musical score, associate note events with the audio
4
t
x t
Score-Performance matching - Graphical Model
ν = 1, . . . , W
t1 t2 . . . tK
r1 r2 . . . rK
λ1 λ2 . . . λK
vν,1 vν,2 . . . vν,K
sν,1 sν,2 . . . sν,K
6 7 8 1 2 3 4 5
rk
0 500 1000 1500 2000 2500 3000 3500 4000
−12
−10
−8
−6
−4
−2 0
Frequency ν / Hz log σν
v ∼ IG(v ; a, 1/(aλσ (r )))
Score-Performance matching - Signal model
0 500 1000 1500 2000 2500 3000 3500 4000
−12
−10
−8
−6
−4
−2 0
Frequency ν / Hz log σν
Score-Performance matching
Spectrogram Data
Time / s
Frequency / Hz
0 2 4 6 8 10 12 14
0 1000 2000 3000 4000
50 100 150 200 250 300 350 400 450
55 60 65 70 75 80 85
MIDI Data
Score position
MIDI note
Monophonic Transcription
log p(rτ|sτ)
MIDInotenumber
Time / s
1 2 3 4
60 65 70 75 80
1 2 3 4
−10
−5 0 5 10
P
iw˜τ(i)λ˜(i)τ
Time / s
logλ
MDCT of audio (source: Daniel-Ben Pienaar)
Time / s
Frequency/Hz
1 2 3 4
0 500 1000 1500 2000 2500 3000 3500 4000
Tempo, Rhythm, Meter analysis (
Cemgil 2000; Whiteley, Cemgil, Godsill 2006) Example: A Performed Onset Sequence1.18 0.59 0.29 0.34 0.44 0.34 0.39 0.6 0.63 0.3 0.28 0.3 0.35 1.19
Very accurate but too complex
Simple but a very poor description of the rhythm
Desired quantization balances accuracy and simplicity
3
Bar Pointer Model
| | |
3 • • • • • • • •
nk 2 • • • • • • • •
1 • • • • • • • •
1 2 3 4 5 6 7 8
mk
3/4 time 4/4 time
0 100 200 300 400 500 600 700 800 900 1000
0 2 4
mk
µ k
Triplet Rhythm
0 100 200 300 400 500 600 700 800 900 1000
0 2 4
mk
µ k
Duplet Rhythm
n0 n1 n2 n3
θ0 θ1 θ2 θ3
m0 m1 m2 m3
r0 r1 r2 r3
λ1 λ2 λ3
y1 y2 y3
Filtering
0 50 100 150 200 250 300 350 400 450
0 1 2
yk
Observed Data
m k
log p(mk|y1:k)
50 100 150 200 250 300 350 400 450
800 600 400 200
−10
−5 0
Quarter notes per min.
log p(n k|y
1:k)
50 100 150 200 250 300 350 400 450
180 120
60 −4
−2 0
p(rk|y 1:k)
Frame Index, k
50 100 150 200 250 300 350 400 450
Triplets
Duplets
0 0.2 0.4 0.6 0.8
Smoothing
0 50 100 150 200 250 300 350 400 450
0 1 2
yk
Observed Data
m k
log p(m k|y
1:K)
50 100 150 200 250 300 350 400 450
800 600 400
200 −10
−5 0
Quarter notes per min.
log p(n k|y
1:K)
50 100 150 200 250 300 350 400 450
180 120 60
−10
−5 0
p(rk|y 1:K)
Frame Index, k
50 100 150 200 250 300 350 400 450
Triplets
Duplets 0.2
0.4 0.6 0.8
Time Domain Modeling
Sinusoidal Modeling
• Sound is primarily about oscillations and resonance
• Cascade of second order sytems
• Audio signals can often be compactly represented by sinusoidals
(real) yn =
Xp k=1
αke−γkn cos(ωkn + φk)
(complex) yn =
Xp k=1
ck(e−γk+jωk)n
y = F (γ1:p, ω1:p)c
State space Parametrisation
xn+1 =
e−γ1+jω1
. ..
e−γp+jωp
| {z }
A
xn x0 =
c1 c2 ...
cp
yn = 1 1 . . . 1 1
| {z }
C
xn
x0 x1 . . . xk−1 xk . . . xK
y1 . . . yk−1 yk . . . yK
Diagonal Parametrisation for strictly real y
xdr0 =
0
B
B
B
B
B
c1 c∗1 ...
c∗p cp
1
C
C
C
C
C
A
xdrn+1 =
ejω1
e−jω1
. ..
ejωp
e−jωp
| {z }
A
drxdrn
yn = 1 1 . . . 1 1
| {z }
C
drxdrn
Alternative realisations – Reparametrisations
• A linear system has infinitely many realisations, each corresponding to a state space representation
¯
xn+1 = A¯¯xn yn = C ¯¯xn with
¯
xn ≡ T xn A¯ ≡ T AT−1 C¯ ≡ CT−1
T xn+1 = T AT−1T xn yn = CT−1T xn
• There are many parametrisations, but some are more suitable for extracting relevant information or a compact representation
Transformation from Diagonal canonical form to rotation matrix
cos(θ) = ejθ + e−jθ
2 sin(θ) = ejθ − e−jθ 2j
T AT−1 = A¯
√1 2
1 1
−j j
e−γ+jω 0
0 e−γ−jω
1√ 2
1 j 1 −j
= e−γ
cos(ω) − sin(ω) sin(ω) cos(ω)
CT−1 = C¯ 1 1 1
√2
1 j 1 −j
= 1 0
State Space Parametrisation
Audio Restoration/Interpolation
• Estimate missing samples given observed ones
• Restoration, concatenative expressive speech synthesis, ...
0 50 100 150 200 250 300 350 400 450 500
0
Audio Interpolation
p(x¬κ|xκ) ∝ Z
dHp(x¬κ|H)p(xκ|H)p(H) H ≡ (parameters, hidden states)
H
x¬κ xκ
Missing Observed
0
Probabilistic Phase Vocoder
(Cemgil and Godsill 2005)Aν Qν
sν0 · · · sνk · · · sνK−1
ν = 0 . . . W − 1
x0 xk xK−1
sνk ∼ N (sνk; Aνsνk−1, Qν) Aν ∼ N
Aν;
cos(ων) − sin(ων) sin(ων) cos(ων)
, Ψ
Exact Inference
1. Infer the posterior distribution
p(S, Θ|xκ) = 1
Zxp(xκ|S, Θ)p(S|Θ)p(Θ)
≡ 1
Zxφ(S, Θ) ≡ P
S = s0:W −10:K−1 Θ = (A, Q) Zx = p(xκ)
2. Then, compute the predictive distribution p(x¬κ|xκ) =
Z
dSdΘp(x¬κ|S, Θ)p(S, Θ|xκ) Exact evaluation of P is intractable.
Approximating distribution Q
Q ≡ Y
α∈C
q(sα0:K−1)q(Θα) ≡ Y
α∈C
Qα(Sα)Qα(Θα)
where C = {ν1, . . . , νN} is a set of disjoint clusters of frequency bands such that νi ∩ νj = ∅ for i 6= j and S
i νi = {0, . . . , W − 1}.
q(sα0:K−1) ≡ q(sα0) YK k=1
q(sαk|sαk−1) q(Θα) ≡ q(Aα)q(Qα)
Inference: Structured Variational Bayes
Aα q(Aα) Qα q(Qα)
· · · sαk−1 sαk sαk+1 · · ·
α∈ C
Q
kq(sαk|sαk−1)
xk q(xk)
• Intuitive algorithm:
– Substract from the observed signal x the prediction of the frequency bands in ¬α.
– Compute a fit for α to this residual and iterate.
• For fixed A, Q, this is equivalent to Gauss-Seidel, an iterative method for solving linear systems of
Restoration
• Piano
– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original
• Trumpet
– Signal with missing samples (37%)
– Reconstruction, 7.10 dB improvement
– Original
Hierarchical Factorial Models
• Each component models a latent process
• The observations are projections
rν0 · · · rνk · · · rνK
θν0 · · · θνk · · · θνK
ν = 1 . . . W
yk yK
• Generalises Source-filter models
Harmonic model with changepoints
rk|rk−1 ∼ p(rk|rk−1) rk ∈ {0, 1}
θk|θk−1, rk ∼ [rk = 0]N (Aθk−1, Q)
| {z }
reg
+ [rk = 1]N (0, S)
| {z }
new
yk|θk ∼ N (Cθk, R)
A =
Gω
G2ω
. ..
GHω
N
Gω = ρk
cos(ω) − sin(ω) sin(ω) cos(ω)
damping factor 0 < ρk < 1, framelength N and damped sinusoidal basis matrix C of size N × 2H
Harmonic model with changepoints
r kfrequency
k k
x k
(S)
• Each changepoint denotes the onset of a new audio event
Inference
• Filtering
p(rk, θk|y1:k) ∝ X
r1:k−1
Z
dθ1:k−1p(y1:k|θ1:k)p(θ1:k|r1:k)p(r1:k)
• Smoothing, Marginal MAP – MMAP r∗1:K = arg max
r1:K
Z
θ1:Kp(y1:K|θ1:K)p(θ1:K|r1:K)p(r1:K)
– Each configuration of r1:K encodes one of the possible 2K possible models, i.e., segmentation.
– MMAP is equivalent to Bayesian Model Selection
• All problems are similar, but MMAP is usually harder because max and R
do not commute
Exact Inference in switching state space models is intractable
• In general, exact inference is NP hard
– Conditional Gaussians are not closed under marginalization
⇒ Unlike HMM’s or KFM’s, summing over rk does not simplify the filtering density
⇒ Number of Gaussian kernels to represent exact filtering density p(rk, θk|y1:k) increases exponentially
−7.9036 6.6343
0.76292
−10.3422
−10.1982
−2.393
−2.7957
−0.4593