Source Separation

(1)

Advances in Models for Acoustic Processing

David Barber and Taylan Cemgil

Signal Processing and Communications Lab.

9 Dec 2006

(2)

Outline

• Acoustic Modeling and applications

• Parameter estimation and Inference

– Subspace methods, Variational, Monte Carlo

• Issues

(3)

Acoustic Modeling

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

(4)

Probabilistic Models

• Once a realistic model is constructed many related task can be cast to posterior inference problems

p(Structure|Observations) ∝ p(Observations|Structure)p(Structure) – analysis,

– localisation, – restoration, – transcription,

– source separation, – identification,

– coding,

– resynthesis, cross synthesis

(5)

Source Separation

s¹t s²t . . . sⁿt

x¹t . . . x^mt

t = 1 . . . T

a¹ r¹ . . . a^m r^m

• Joint estimation Sources, Channel noise and mixing system

• Typically underdetermined (Channels < Sources) ⇒ Multimodal posterior

(6)

Source Separation

(7)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Speech + Piano + Guitar)

(8)

Audio Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(9)

Audio Interpolation

p(x_¬κ|x^κ) ∝ Z

dHp(x_¬κ|H)p(x^κ|H)p(H) H ≡ (parameters, hidden states)

H

x_¬κ x^κ

Missing y

0 50 100 150 200 250 300 350 400 450 500

0

(10)

Application: Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent process ν = 1 . . . W corresponds to a “voice”. Indicators r_1:W,1:K encode a latent “piano roll”

(11)

Tempo, Rhythm, Meter analysis

0 50 100 150 200 250 300 350 400 450

0 1 2

y k

Observed Data

mk

log p(m_k|y_1:K)

50 100 150 200 250 300 350 400 450

800 600 400

200 −10

−5 0

Quarter notes per min.

log p(n k|y

1:K)

50 100 150 200 250 300 350 400 450

180 120

60 −10

−5 0

p(r_k|y_1:K)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets 0.2

0.4 0.6 0.8

(12)

Hierarchical Modeling

S ore Expression

Ä ä Å ä ø ø

tt tt tt tIt tt tYt

c d tY t t ø

ø tYt tt tIt tIt tt tY"t

t d tY t t

0 500 1000 1500 2000 2500

−3.5

−3

−2.5

ω

t

True Estimated

Piano Roll

0 500 1000 1500 2000 2500

θ

Audio Signal

(13)

Hierarchical Modeling

M

1 2

:::

t

v

1

v

2

:::

v

t

k

1

k

2

::: k

t

h1 h2 ::: ht

1

2

:::

t

m

1

m

2

::: m

t

g

j;1

g

j;2

::: g

j;t

r

j;1

r

j;2

::: r

j;t

n

j;1

n

j;2

::: n

j;t

x

j;1

x

j;2

::: x

j;t

y

j;1

y

j;2

::: y

j;t

y1 y2

:::

yt

(14)

Time Series Modeling

• Sound is primarily about oscillations and resonance

• Cascade of second order sytems

• Audio signals can often be compactly represented by sinusoidals

(real) y_n =

Xp k=1

α_ke^−γ^kⁿ cos(ω_kn + φ_k)

(complex) y_n =

Xp k=1

c_k(e^−γ^k^+jω^k)ⁿ

y = F (γ_1:p, ω_1:p)c

(15)

State space Parametrisation

x_n+1 =





e^−γ¹^+jω¹

. . .

e^−γ^p^+jω^p





| {z }

A

x_n x₀ =





 c₁ c₂ ...

c_p







y_n = 1 1 . . . 1 1

| {z }

C

x_n

x₀ x₁ . . . x_k−1 x_k . . . x_K

y₁ . . . y_k−1 y_k . . . y_K

(16)

State Space Parametrisation

(17)

Classical System identification approach

• The state space representation implies x_n+1 = Ax_n

y_n = Cx_n ⇒ yn = CAⁿx₀

• Therefore we can write for arbitrary L and M the Hankel matrix







y₀ y₁ . . . y_M y₁ y₂ . . . y_M₊₁

... ... . . . ...

y_L y_L₊₁ . . . y_L_+M







| {z }

Y

=







C CA...

CA^L







| {z }

Γ

_L+1

x₀ Ax₀ . . . A^Mx₀

| {z }

Ω

_M₊₁

(18)

Identification via matrix factorisation

1. Given the “impulse response” Hankel matrix Y (Ho and Kalman 1966, Rao and Arun 1992, Viberg 1995), compute a matrix factorisation (typically via SVD)

Y = Γ¯_L+1Ω¯_M₊₁ =







C CA...

CA^L





 x₀ Ax₀ . . . A^Mx₀

| {z }

2. Read off C and x₀ from factors Γ¯_L+1 and Ω¯_M₊₁

3. Compute transition matrix by exploiting shift invariance







CA CA²

...

CA^L





 =







C CA...

CA^L−1





A ⇒ A = Γ^†_1:nΓ_2:n+1

Matrix factorisation ideas have lead to useful methods (N4SID, NMF, MMMF...)

(19)

Pros and Cons

• Uses well understood algorithms from numerical linear algebra ⇒ often quite fast and numerically stable

• Model selection can be based on numerical rank analysis; inspection of singular values e.t.c.

• Handling of uncertainty and nonstationarity is not very transparent

• Prior knowledge is hard to incorporate

(20)

Hierarchical Factorial Models

• Each component models a latent process

• The observations are projections

r^ν₀ · · · r_k^ν · · · r^ν_K

θ^ν

0 · · · θ^ν_k · · · θ^ν_K

ν = 1 . . . W

y_k y_K

• Generalises Source-filter models

(21)

Harmonic model with changepoints

r_k|r_k−1 ∼ p(r_k|r_k−1)

θ_k|θ_k−1, r_k ∼ [r_k = 0]N (Aθ_k−1, Q)

| {z }

reg

+ [r_k = 1]N (0, S)

| {z }

new

y_k|θk ∼ N (Cθk, R)

A =







G_ω

G²_ω

. . .

G^H_ω







N

G_ω = ρ_k

cos(ω) − sin(ω) sin(ω) cos(ω)

damping factor 0 < ρk < 1, framelength N and damped sinusoidal basis matrix C of size N × 2H

(22)

Harmonic model with changepoints

r kfrequency

k k

x k

• Each changepoint denotes the onset of a new audio event

(23)

Monophonic transcription

• Detecting onsets, offsets and pitch (Cemgil et. al. 2006, IEEE TSALP)

500 1000 1500 2000 2500 3000 3500

Exact inference is possible

(24)

Factorial Changepoint model

r_0,ν ∼ C(r_0,ν; π_0,ν) θ_0,ν ∼ N (θ0,ν; µν, P_ν)

r_k,ν|r_k−1,ν ∼ C(rk,ν; πν(r_t−1,ν)) Changepoint indicator θ_k,ν|θ_k−1,ν ∼ N (θ_k,ν; A_ν(r_k)θ_k−1,ν, Q_ν(r_k)) Latent state

y_k|θ_k,1:W ∼ N (y_k; C_kθ_k,1:W, R) Observation

r₀^ν · · · r_k^ν · · · r_K^ν

θ^ν₀ · · · θ^ν

k · · · θ^ν

K

ν = 1 . . . W

y_k y_K

(25)

Application: Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent changepoint process ν = 1 . . . W corresponds to a “piano key”.

Indicators r_1:W,1:K encode a latent “piano roll”

(26)

Single time slice - Bayesian Variable Selection

r_i ∼ C(ri; πon, π_off)

s_i|ri ∼ [r_i = on]N (s_i; 0, Σ) + [r_i 6= on]δ(s_i) x|s_1:W ∼ N (x; Cs_1:W, R)

C ≡ [ C₁ . . . C_i . . . C_W ]

r₁ . . . r_W

s₁ . . . s_W

x

• Generalized Linear Model – Column’s of C are the basis vectors

• The exact posterior is a mixture of 2^W Gaussians

• When W is large, computation of posterior features becomes intractable.

• Sparsity by construction (Olshausen and Millman, Attias, ...)

(27)

Chord detection example

0 50 100 150 200 250 300 350 400

−20

−10 0 10 20

0 π/4 π/2 3π/4

0 100 200 300 400 500 600

(28)

Inference : Iterative Improvement

r_1:W^∗ = arg max

r_1:W

Z

ds_1:Wp(y|s_1:W)p(s_1:W|r_1:W)p(r_1:W)

iteration r1 r_M log p(y1:T, r_1:M)

1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −1220638254 2 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −665073975 3 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • −311983860 4 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • −162334351 5 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • −43419569 6 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −1633593 7 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −14336 8 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −5766 9 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −5210 10 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −4664 True ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −4664

(29)

Inference : MCMC/Gibbs sampler

• MCMC: Construct a markov chain with stationary distribution as the desired posterior P

• Gibbs sampler: We cycle through all variables r_ν = 1 . . . W and sample from full conditionals

r_ν ∼ p(rν|r₁^(t+1), r₂^(t+1), . . . , r_ν^(t+1)₋₁ , r_ν^(t)₊₁, . . . , r_W^(t))

• Rao-Blackwellisation: Conditioned on r_1:W, the latent variables s_1:W can be integrated over analytically.

(30)

Variational Bayes – Structured mean field

• VB: Approximate a complicated distribution P with a simpler, tractable one Q in the sense of

Q^∗ = argmin

Q

KL(Q||P)

• KL is the Kullback-Leibler divergence

KL(Q||P) ≡ hlog Qi_Q − hlog Pi_Q ≥ 0

• If Q obeys the factorisation as Q = Q

ν Qν the solution is given by the fixed point

Qν ∝ exp hlog Pi_Q_¬ν

• Leads to powerful generalisations of the Expectation Maximisation (EM) algorithm (Hinton and Neal 1998, Attias 2000)

(31)

MCMC versus Variational Bayes (VB)

• Each configuration of r_1:W corresponds to a corner of a W dimensional hypercube

b

b b

b

• MCMC moves along the edges stochastically

• VB moves inside the hypercube deterministically

(32)

Sequential Inference

• Filtering: Mixture Kalman Filter (Rao-Blackwellized PF) (Chen and Liu 2001)

• MMAP: Breadth-first search algorithm with greedy or randomised pruning, multi-hypothesis tracker (MHT)

• For each hypothesis, there are 2^W possible branches at each timeslice

⇒ Need a fast proposal to find promising branches without exhaustive evaluation

(33)

Music Processing challenges

• Computational modeling of human listening and music performance abilities – complex and nonstationary temporal structure, both on physical-signal and

cognitive-symbolic level

– Applications: Interactive Music performance, Musicology, Music Information Retrieval, Education

• Analysis

– identification of individual sound events - notes, kicks – invariant characteristics - timbre

– extraction of higher structure information - tempo, harmony, rhythm – not well defined attributes - expression, mood, genre

• Synthesis

– design of soud synthesis models - abstract or physical

– performance rendering: generation of a physically, perceptually or artistically feasible control policy

(34)

Issues

• What types of modelling approaches are useful for acoustic processing (e.g.

hierarchical, generative, discriminative) ?

• What classes of inference algorithms are suitable for these potentially large and hybrid models of sound ?

• How can we improve the quality and speed of inference ?

• Can efficient online algorithms be developed?

• How can we learn efficient auditory codes based on independence assumptions about the generating processes?

• What can biology and cognitive science can tell us about acoustic representations and processing? (and vice versa)