• Sonuç bulunamadı

Source Separation

N/A
N/A
Protected

Academic year: 2021

Share "Source Separation"

Copied!
34
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Advances in Models for Acoustic Processing

David Barber and Taylan Cemgil

Signal Processing and Communications Lab.

9 Dec 2006

(2)

Outline

• Acoustic Modeling and applications

• Parameter estimation and Inference

– Subspace methods, Variational, Monte Carlo

• Issues

(3)

Acoustic Modeling

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

(4)

Probabilistic Models

• Once a realistic model is constructed many related task can be cast to posterior inference problems

p(Structure|Observations) ∝ p(Observations|Structure)p(Structure) – analysis,

– localisation, – restoration, – transcription,

– source separation, – identification,

– coding,

– resynthesis, cross synthesis

(5)

Source Separation

s1t s2t . . . snt

x1t . . . xmt

t = 1 . . . T

a1 r1 . . . am rm

• Joint estimation Sources, Channel noise and mixing system

• Typically underdetermined (Channels < Sources) ⇒ Multimodal posterior

(6)

Source Separation

(7)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Speech + Piano + Guitar)

(8)

Audio Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(9)

Audio Interpolation

p(x¬κ|xκ) ∝ Z

dHp(x¬κ|H)p(xκ|H)p(H) H ≡ (parameters, hidden states)

H

x¬κ xκ

Missing y

0 50 100 150 200 250 300 350 400 450 500

0

(10)

Application: Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent process ν = 1 . . . W corresponds to a “voice”. Indicators r1:W,1:K encode a latent “piano roll”

(11)

Tempo, Rhythm, Meter analysis

0 50 100 150 200 250 300 350 400 450

0 1 2

y k

Observed Data

mk

log p(mk|y1:K)

50 100 150 200 250 300 350 400 450

800 600 400

200 −10

−5 0

Quarter notes per min.

log p(n k|y

1:K)

50 100 150 200 250 300 350 400 450

180 120

60 −10

−5 0

p(rk|y1:K)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets 0.2

0.4 0.6 0.8

(12)

Hierarchical Modeling

S ore Expression

Ä ä Å ä ø ø

tt tt tt tIt tt tYt

c d tY t t ø

ø tYt tt tIt tIt tt tY"t

t d tY t t

0 500 1000 1500 2000 2500

−3.5

−3

−2.5

ω

t

True Estimated

Piano Roll

0 500 1000 1500 2000 2500

θ

Audio Signal

(13)

Hierarchical Modeling

M

1 2

:::

t

v

1

v

2

:::

v

t

k

1

k

2

::: k

t

h1 h2 ::: ht



1



2

::: 

t

m

1

m

2

::: m

t

g

j;1

g

j;2

::: g

j;t

r

j;1

r

j;2

::: r

j;t

n

j;1

n

j;2

::: n

j;t

x

j;1

x

j;2

::: x

j;t

y

j;1

y

j;2

::: y

j;t

y1 y2

:::

yt

(14)

Time Series Modeling

• Sound is primarily about oscillations and resonance

• Cascade of second order sytems

• Audio signals can often be compactly represented by sinusoidals

(real) yn =

Xp k=1

αke−γkn cos(ωkn + φk)

(complex) yn =

Xp k=1

ck(e−γk+jωk)n

y = F (γ1:p, ω1:p)c

(15)

State space Parametrisation

xn+1 =

e−γ1+jω1

. . .

e−γp+jωp

| {z }

A

xn x0 =



 c1 c2 ...

cp



yn = 1 1 . . . 1 1 

| {z }

C

xn

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

(16)

State Space Parametrisation

(17)

Classical System identification approach

• The state space representation implies xn+1 = Axn

yn = Cxn ⇒ yn = CAnx0

• Therefore we can write for arbitrary L and M the Hankel matrix



y0 y1 . . . yM y1 y2 . . . yM+1

... ... . . . ...

yL yL+1 . . . yL+M



| {z }

Y

=



C CA...

CAL



| {z }

Γ

L+1

x0 Ax0 . . . AMx0 

| {z }

M+1

(18)

Identification via matrix factorisation

1. Given the “impulse response” Hankel matrix Y (Ho and Kalman 1966, Rao and Arun 1992, Viberg 1995), compute a matrix factorisation (typically via SVD)

Y = Γ¯L+1Ω¯M+1 =



C CA...

CAL



 x0 Ax0 . . . AMx0 

| {z }

2. Read off C and x0 from factors Γ¯L+1 and Ω¯M+1

3. Compute transition matrix by exploiting shift invariance



CA CA2

...

CAL



 =



C CA...

CAL−1



A ⇒ A = Γ1:nΓ2:n+1

Matrix factorisation ideas have lead to useful methods (N4SID, NMF, MMMF...)

(19)

Pros and Cons

• Uses well understood algorithms from numerical linear algebra ⇒ often quite fast and numerically stable

• Model selection can be based on numerical rank analysis; inspection of singular values e.t.c.

• Handling of uncertainty and nonstationarity is not very transparent

• Prior knowledge is hard to incorporate

(20)

Hierarchical Factorial Models

• Each component models a latent process

• The observations are projections

rν0 · · · rkν · · · rνK

θν

0 · · · θνk · · · θνK

ν = 1 . . . W

yk yK

• Generalises Source-filter models

(21)

Harmonic model with changepoints

rk|rk−1 ∼ p(rk|rk−1)

θkk−1, rk ∼ [rk = 0]N (Aθk−1, Q)

| {z }

reg

+ [rk = 1]N (0, S)

| {z }

new

ykk ∼ N (Cθk, R)

A =



Gω

G2ω

. . .

GHω



N

Gω = ρk

 cos(ω) − sin(ω) sin(ω) cos(ω)



damping factor 0 < ρk < 1, framelength N and damped sinusoidal basis matrix C of size N × 2H

(22)

Harmonic model with changepoints

r kfrequency

k k

x k

• Each changepoint denotes the onset of a new audio event

(23)

Monophonic transcription

• Detecting onsets, offsets and pitch (Cemgil et. al. 2006, IEEE TSALP)

500 1000 1500 2000 2500 3000 3500

Exact inference is possible

(24)

Factorial Changepoint model

r0,ν ∼ C(r0,ν; π0,ν) θ0,ν ∼ N (θ0,ν; µν, Pν)

rk,ν|rk−1,ν ∼ C(rk,ν; πν(rt−1,ν)) Changepoint indicator θk,νk−1,ν ∼ N (θk,ν; Aν(rkk−1,ν, Qν(rk)) Latent state

ykk,1:W ∼ N (yk; Ckθk,1:W, R) Observation

r0ν · · · rkν · · · rKν

θν0 · · · θν

k · · · θν

K

ν = 1 . . . W

yk yK

(25)

Application: Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent changepoint process ν = 1 . . . W corresponds to a “piano key”.

Indicators r1:W,1:K encode a latent “piano roll”

(26)

Single time slice - Bayesian Variable Selection

ri C(ri; πon, πoff)

si|ri [ri = on]N (si; 0, Σ) + [ri 6= on]δ(si) x|s1:W N (x; Cs1:W, R)

C [ C1 . . . Ci . . . CW ]

r1 . . . rW

s1 . . . sW

x

Generalized Linear Model – Column’s of C are the basis vectors

The exact posterior is a mixture of 2W Gaussians

When W is large, computation of posterior features becomes intractable.

Sparsity by construction (Olshausen and Millman, Attias, ...)

(27)

Chord detection example

0 50 100 150 200 250 300 350 400

−20

−10 0 10 20

0 π/4 π/2 3π/4

0 100 200 300 400 500 600

(28)

Inference : Iterative Improvement

r1:W = arg max

r1:W

Z

ds1:Wp(y|s1:W)p(s1:W|r1:W)p(r1:W)

iteration r1 rM log p(y1:T, r1:M)

1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −1220638254 2 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −665073975 3 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • −311983860 4 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • −162334351 5 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • −43419569 6 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −1633593 7 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −14336 8 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −5766 9 ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −5210 10 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −4664 True ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ • ◦ • −4664

(29)

Inference : MCMC/Gibbs sampler

• MCMC: Construct a markov chain with stationary distribution as the desired posterior P

• Gibbs sampler: We cycle through all variables rν = 1 . . . W and sample from full conditionals

rν ∼ p(rν|r1(t+1), r2(t+1), . . . , rν(t+1)−1 , rν(t)+1, . . . , rW(t))

• Rao-Blackwellisation: Conditioned on r1:W, the latent variables s1:W can be integrated over analytically.

(30)

Variational Bayes – Structured mean field

• VB: Approximate a complicated distribution P with a simpler, tractable one Q in the sense of

Q = argmin

Q

KL(Q||P)

• KL is the Kullback-Leibler divergence

KL(Q||P) ≡ hlog QiQ − hlog PiQ ≥ 0

• If Q obeys the factorisation as Q = Q

ν Qν the solution is given by the fixed point

Qν ∝ exp hlog PiQ¬ν

• Leads to powerful generalisations of the Expectation Maximisation (EM) algorithm (Hinton and Neal 1998, Attias 2000)

(31)

MCMC versus Variational Bayes (VB)

• Each configuration of r1:W corresponds to a corner of a W dimensional hypercube

b

b b

b b

b b

b

• MCMC moves along the edges stochastically

• VB moves inside the hypercube deterministically

(32)

Sequential Inference

• Filtering: Mixture Kalman Filter (Rao-Blackwellized PF) (Chen and Liu 2001)

• MMAP: Breadth-first search algorithm with greedy or randomised pruning, multi-hypothesis tracker (MHT)

• For each hypothesis, there are 2W possible branches at each timeslice

⇒ Need a fast proposal to find promising branches without exhaustive evaluation

(33)

Music Processing challenges

• Computational modeling of human listening and music performance abilities – complex and nonstationary temporal structure, both on physical-signal and

cognitive-symbolic level

– Applications: Interactive Music performance, Musicology, Music Information Retrieval, Education

• Analysis

– identification of individual sound events - notes, kicks – invariant characteristics - timbre

– extraction of higher structure information - tempo, harmony, rhythm – not well defined attributes - expression, mood, genre

• Synthesis

– design of soud synthesis models - abstract or physical

– performance rendering: generation of a physically, perceptually or artistically feasible control policy

(34)

Issues

• What types of modelling approaches are useful for acoustic processing (e.g.

hierarchical, generative, discriminative) ?

• What classes of inference algorithms are suitable for these potentially large and hybrid models of sound ?

• How can we improve the quality and speed of inference ?

• Can efficient online algorithms be developed?

• How can we learn efficient auditory codes based on independence assumptions about the generating processes?

• What can biology and cognitive science can tell us about acoustic representations and processing? (and vice versa)

Referanslar

Benzer Belgeler

In this approach, fast digital algorithms for LCTs, which are often referred to as FLCT algorithms, are derived by first defining a DLCT, which is to the continuous LCT what the DFT

(One difference is that the Blue Chip forecast has nowcasting ability for GDP as well as inflation.) The DSGE model forecast is similar to the judgmental forecast and is better than

Urolithiasis in ankylosing spondylitis: Correlation with Bath ankylosing spondylitis disease activity index (BASDAI), Bath ankylosing spondylitis functional index (BASFI) and

Tıp okulunun gastroenteroloji bölümünde eğitim akademis- yen ve bilim adamı yetiştirme odaklıdır. Gastroenteroloji bö- lümü moleküler biyoloji, hücre biyolojisi,

Çalışmamız sonucunda; 6-19 yaş grubu Türk bireylerde, bir veya daha fazla sayıda TMD semptomu görülme sıklığı % 88,39 ve kızlarda erkeklere oranla daha fazla

Geleneksel Kazak toplumunda Nevruz ayında yapılan Nevruznâme adlı yeni yılı karşılama merasimleri birkaç gün sürmüştür.. Öncelikle Kazaklar Naurız köje [Nevruz

We prove that if the family is diagonalizable over the complex numbers, then V has an A- invariant direct sum decomposition into subspaces V α such that the restriction of the family

A specific traffic engineering strategy, called Dynamic Cost TE, was defined that is based on routing the LSPs along the shortest paths when the network is lightly loaded and using