Hierarchical Bayesian Models for Audio and Music Signal Processing

(1)

Hierarchical Bayesian Models for Audio and Music Signal Processing

A. Taylan Cemgil

Signal Processing and Communications Lab.

8 December 2007

NIPS 07 Workshop on Music

(2)

Colaborators

• Onur Dikmen, Bo˘gazi¸ci, Istanbul

• Paul Peeling, Cambridge

• Nick Whiteley, Cambridge

• Simon Godsill, Cambridge

• Cedric Fevotte, ENST, Paris Telecom

• David Barber, UCL, London

• Bert Kappen, Nijmegen, The Netherlands

(3)

Statistical Approaches

• Probabilistic

• Hierarchical signal models to incorporate prior knowledge/inspiration from various sources:

– Physics (acoustics, physical models, ...)

– Studies of human cognition and perception (masking, psychoacoustics, ...)

– Musicology (musical constructs, harmony, tempo, form ...)

• Consistent framework for developing inference algorithms

• Contrast to Traditional/Procedural approaches – where no clear distinction between “what” and “how”

• Need to overcome computational obstacles (time, memory)

(4)

Generative Models for audition

• Computer audition ⇔ inverse synthesis via Bayesian inference

p(Structure|Observations) ∝ p(Observations|Structure)p(Structure)

Goal: Developing flexible prior structures for modelling nonstationary sources

∗ source separation, transcription,

∗ restoration, interpolation, localisation, identification,

(5)

Bayesian Source Separation

• Joint estimation of Sources, given Observations

Source Model, v : Parameters of Source prior

s

k,1

. . . s

k,n

. . . s

k,N

v

x

k,1

. . . x

k,M

k = 1 . . . K

λ

Observation Model, λ : Channel noise, mixing system p(Src|Obs) ∝

Z

dλdvp(Obs|Src, λ)p(Src|v)p(v)

(6)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(7)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

(8)

Modelling and Computational issues

• Hierarchical – Signal level

pitch, onsets, timbre – Symbolic level

melody, motives, harmony, chords, tonality, rhythm, beat, tempo, articulation, instrumentation, voice ...

– Cognitive level

expression, genre, form, style, mood, emotion

• Uncertainty

– Parameter Learning

Which pitch, rhythm, tempo, meter, time signature ... ? – Model Selection

How many notes, harmonics, onsets, sections ... ?

(9)

Generative Models for Music

(10)

Generative Models for Music

Score Expression

Piano-Roll

(11)

Hierarchical Modeling of Music

M

1

2

:::

t

v1 v2

:::

vt

k

1

k

2

::: k

t

h

1

h

2

::: h

t

1 2 ::: t

m1 m2 ::: mt

g

j;1

g

j;2

::: g

j;t

rj;1 rj;2

:::

rj;t

n

j;1

n

j;2

::: n

j;t

x

j;1

x

j;2

::: x

j;t

yj;1 yj;2

:::

yj;t

y1 y2

:::

yt

(12)

Modelling levels

• Physical - acoustical

• Time domain – state space, dynamical models

• Transform domain – Fourier representations, Generalised Linear model

• Feature Based

Research Questions:

What kinds of prior knowledge and modelling techniques are useful?

How can we do efficient inference ?

(13)

Signal Models for Audio

• Time domain – state space, dynamical models

– Conditional Linear Dynamical Systems, Gaussian processes (e.g.

AR, ARMA), switching state space models – Flexible, Physically realistic,

– Analysis down to sample precision, Computationally quite heavy

• Transform domain – Fourier representations, Generalised Linear model

– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)

– Inherent limitations (analysis windows, frequency resolution)

(14)

Sinusoidal Modeling

• Sound is primarily about oscillations and resonance

• Cascade of second order sytems

• Audio signals can often be compactly represented by sinusoidals

(real) y _n =

X p k=1

α _k e ^−γ

^k

ⁿ cos(ω _k n + φ _k )

(complex) y n =

X p k=1

c k (e ^−γ

^k

^+jω

^k

) ⁿ

y = F (γ _1:p , ω _1:p )c

(15)

State space Parametrisation

x _n+1 =





e ^−γ

¹

^+jω

¹

. ..

e ^−γ

^p

^+jω

^p





| {z }

A

x n x ₀ =



 

 c ₁ c ₂ ...

c _p



 



y n = 1 1 . . . 1 1

| {z }

C

x n

x 0 x 1 . . . x _k−1 x k . . . x K

y 1 . . . y k −1 y k . . . y K

(16)

State Space Parametrisation

(17)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(18)

Audio Interpolation

p(x _¬κ |x ^κ ) ∝ Z

dHp(x _¬κ |H)p(x ^κ |H)p(H) H ≡ (parameters, hidden states)

H

x

_¬κ

x

^κ

Missing Observed

0

(19)

Probabilistic Phase Vocoder (Cemgil and Godsill 2005)

A

ν

Q

ν

s

^ν0

· · · s

^ν_k

· · · s

^ν_K−1

ν = 0 . . . W − 1

x

0

x

k

x

K−1

s ^ν _k ∼ N (s ^ν _k ; A ν s ^ν _k−1 , Q ν ) A ν ∼ N

A ν ;

cos(ω ν ) − sin(ω ν ) sin(ω _ν ) cos(ω _ν )

, Ψ

(20)

Inference: Structured Variational Bayes

A

α

q(A

α

) Q

α

q(Q

α

)

· · · s

^α_k−1

s

^α_k

s

^α_k+1

· · ·

α∈ C

Q

k

q(s

^α_k

|s

^α_k₋₁

)

x

k

q(x

k

)

• Intuitive algorithm:

– Substract from the observed signal x the prediction of the frequency bands in ¬α.

(21)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%)

– Reconstruction, 7.10 dB improvement

– Original

(22)

Hierarchical Factorial Models

• Each component models a latent process

• The observations are projections

r

^ν0

· · · r

^ν_k

· · · r

^ν_K

θ

^ν₀

· · · θ

^ν

k

· · · θ

^ν

K

ν = 1 . . . W

y

k

y

K

• Generalises Source-filter models

(23)

Harmonic model with changepoints

r k |r _k−1 ∼ p(r k |r _k−1 ) r k ∈ {0, 1}

θ k |θ _k−1 , r k ∼ [r k = 0]N (Aθ _k−1 , Q)

| {z }

reg

+ [r k = 1]N (0, S)

| {z }

new

y k |θ k ∼ N (Cθ k , R)

A =



 



G ω

G ² _ω

. ..

G ^H _ω



 



N

G _ω = ρ _k

cos(ω) − sin(ω) sin(ω) cos(ω)

damping factor 0 < ρ

k

< 1, framelength N and damped sinusoidal basis matrix C of size N × 2H

(24)

Exact Inference in switching state space models is intractable

• In general, exact inference is NP hard

– Conditional Gaussians are not closed under marginalization

⇒ Unlike HMM’s or KFM’s, summing over r k does not simplify the filtering density

⇒ Number of Gaussian kernels to represent exact filtering density p(r k , θ k |y _1:k ) increases exponentially

−7.9036 6.6343

0.76292

−10.3422

−10.1982

−2.393

−2.7957

−0.4593

(25)

Exact Inference for Changepoint detection

• Exact inference is achievable in polynomial time/space

– Intuition: When a changepoint occurs, the state vector θ is reinitialized

⇒ Number of Gaussians kernels grows only polynomially (See, e.g., Barry and Hartigan 1992, Digalakis et. al. 1993, ` O Ruanaidh and Fitzgerald 1996, Gustaffson 2000, Fearnhead 2003, Zoeter and Heskes 2006)

r

1

= 1 r

2

= 0 r

3

= 0 r

4

= 1 r

5

= 0

θ

0

θ

1

θ

2

θ

3

θ

4

θ

5

y

1

y

2

y

3

y

4

y

5

• The same structure can be exploited for the MMAP problem arg max r

_1:k

p(r _1:k |y _1:k )

⇒ Trajectories of r _1:k ⁽ⁱ⁾ which are dominated in terms of conditional evidence

p(y _1:k , r _1:k ⁽ⁱ⁾ ) can be discarded without destroying optimality

(26)

Monophonic model (Cemgil et. al. 2006)

• We introduce a pitch label indicator m

• At each time k, the process can be in one of the {“mute”, “sound”} × M states.

r

0

r

1

. . . r

T

m

0

m

1

. . . m

T

s

0

s

1

. . . s

T

y

1

. . . y

T

(27)

Monophonic Pitch Tracking

Monophonic Pitch Tracking = Online estimation (filtering) of p(r _k , m _k |y _1:k ).

100 200 300 400 500 600 700 800 900 1000

−100

−50 0 50

100 200 300 400 500 600 700 800 900 1000

5 10 15

• If pitch is constant exact inference is possible

(28)

Transcription

• Detecting onsets, offsets and pitch to sample precision (Cemgil et. al. 2006, IEEE

TSALP)

(29)

Tracking Pitch Variations

• Allow m to change with k.

50 100 150 200 250 300 350 400 450 500

• Intractable, need to resort to approximate inference (Mixture Kalman Filter -

Rao-Blackwellized Particle Filter)

(30)

Factorial Generative models for Analysis of Polyphonic Audio

νfrequency

k

x k

• Each latent changepoint process ν = 1 . . . W corresponds to a “piano key”.

(31)

Single time slice - Bayesian Variable Selection

r

_i

∼ C(r

i

; π

_on

, π

_off

)

s

_i

|r

i

∼ [r

i

= on]N (s

i

; 0, Σ) + [r

i

6= on]δ(s

i

) x|s

_1:W

∼ N (x; Cs

_1:W

, R)

C ≡ [ C

₁

. . . C

_i

. . . C

_W

]

r 1 . . . r W

s 1 . . . s W

x

• Generalized Linear Model – Column’s of C are the basis vectors

• The exact posterior is a mixture of 2

^W

Gaussians

• When W is large, computation of posterior features becomes intractable.

• Sparsity by construction (Olshausen and Millman, Attias, ...)

(32)

Factorial Switching State space model

r

_0,ν

∼ C(r

0,ν

; π

0,ν

) θ

_0,ν

∼ N (θ

0,ν

; µ

_ν

, P

_ν

)

r

_k,ν

|r

k−1,ν

∼ C(r

k,ν

; π

_ν

(r

_t−1,ν

)) Changepoint indicator θ

_k,ν

|θ

_k−1,ν

∼ N (θ

k,ν

; A

ν

(r

k

)θ

_k−1,ν

, Q

_ν

(r

k

)) Latent state

y

_k

|θ

_k,1:W

∼ N (y

k

; C

k

θ

_k,1:W

, R) Observation

r

^ν₀

· · · r

_k^ν

· · · r

^ν_K

s

^ν₀

· · · s

^ν_k

· · · s

^ν_K

ν = 1 . . . W

(33)

Synthetic Data

ν x freq. ν ν

k

(S)

(34)

Technical Difficulties

• Inference is quite heavy

• Vanilla Kalman filtering methods are not stable – computations with large matrices

– Need advance techniques from linear algebra – Interesting links to subspace methods

• Hyperparameter learning is necessary

(35)

Modelling levels

• Physical - acoustical

• Time domain – state space, dynamical models

• Transform domain – Fourier representations, Generalised Linear model

• Feature Based

(36)

Spectrogram

• Basis functions φ k (t), centered around time-frequency atom k = k(ν , τ ) = (Frequency , Time ), such as STFT or MDCT.

x(t) = X

k

s _k φ _k (t)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

(37)

Models for time-frequency Energy distributions

• Non-Negative Matrix factorisation (Sha, Saul, Lee 2002, Smaragdis, Brown 2003, Virtanen 2003, Abdallah, Plumbley 2004 )

X _ν,τ = W _ν,j S _j,τ

Spectrogram = Spectral Templates × Excitations

= ×

– ... however, spectrograms are not additive (a ² + b ² 6= (a + b) ² )

(38)

Models for time-frequency Energy distributions

• Mask models (Roweis 2001, Reyes-Gomez, Jojic, Ellis 2005 ...)

X _ν,τ = [r _ν,τ = 0]S _ν,τ ⁽⁰⁾ + [r _ν,τ = 1]S _ν,τ ⁽¹⁾

Spectrogram = Mask × Source ₀ + (1 − Mask) × Source ₁

= + +

(39)

Prior structures on time-frequency Energy distributions

• Main Idea: Spectrogram is a point estimate of the energy at a time-frequency atom k(ν, τ ).

• We place a suitable prior on the variance of transform coefficients s _k and tie the prior variances across harmonically and temporally related time-frequency atoms

p(s|v)p(v) = Y

k

p(s _k |v _k )

!

p(v)

(40)

One channel source separation, Gaussian source model

vk,1 . . . vk,N

sk,1 . . . sk,N

xk

k = 1 : K

s _k,n |v k,n ∼ N (s k,n ; 0; v _k,n )

x k |s _k,1:N = P N

n=1 s k,n

• Straightforward application of Bayes’ theorem yields

p(s _k,n |v _k,1:N , x _k ) = N (s _k,n ; κ _k,n x _k , v _k,n (1 − κ _k,n )) κ k,n = v k,n / X

′

v _k,n ^′ (Responsibilities)

(41)

One channel source separation, Poisson source model

vk,1 . . . vk,N

sk,1 . . . sk,N

xk

k = 1 : K

s _k,n |v _k,n ∼ PO(s _k,n ; v _k,n )

x _k |s _k,1:N = P N

n=1 s _k,n

• This is the generative model for the NMF when we write

v _{k(ν,τ ),n} = t _ν,n × e _τ,n (Template × Excitation)

(42)

Gamma G(x; a, b) and Inverse Gamma IG(x; a, b)

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2

a = 0.9 b =1

a = 1 b =1

a = 1.3 b =1

a = 2 b =1

x

p(x)

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5 a=2 b=1

G(x; a, z) ≡ exp((a − 1) log x − z ⁻¹ x + a log z ⁻¹ − log Γ(a))

IG(x; a, z) ≡ exp((a + 1) log x ⁻¹ − z ⁻¹ x ⁻¹ + a log z ⁻¹ − log Γ(a))

(43)

Gamma Chains

We define an inverse Gamma-Markov chain for k = 1 . . . K as follows

v _k |z _k ∼ IG(v _k ; a, z _k /a)

z _k+1 |v _k ∼ IG(z _k+1 ; a _z , v _k /a _z )

z

1

· · · v

k−1

a

z

k

a v

k

a

z

k+1

· · ·

• Variance variables v are priors for sources

• Auxillary variables z are needed for conjugacy and positive correlation

• Shape parameters a and a z describe coupling strength and drift of the chain

(44)

Gamma Chains, typical draws

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a_z = 10

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a

z = 4

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a

z = 40

k

(45)

Gamma Chains with changepoints, typical draws

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a

z = 10

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a

z = 4

100 200 300 400 500 600 700 800 900 1000

−20 0 20

log v k

a = 10 a

z = 40

k

(46)

Gamma Chains

• The joint can be written as product of singleton and pairwise potentials of form

ψ _k,k = exp(−az _k ⁻¹ v _k ⁻¹ ) (Pairwise)

z

¹

· · · v

_k−1

a

z

k

a v

k

a

z

k+1

· · ·

φ ^z _k = exp((a _z + a + 1) log z _k ⁻¹ ) (Singletons)

(47)

Gamma Fields

• The joint can be written as product of singleton and pairwise potentials

ψ _i,j = exp(−a _i,j ξ _i ⁻¹ ξ _j ⁻¹ ) (Pairwise)

φ _i = exp(( X

j

a _i,j + 1) log ξ _i ⁻¹ ) (Singletons)

(48)

Possible Model Topologies

(49)

Approximate Inference

• Stochastic

– Markov Chain Monte Carlo, Gibbs sampler – Sequential Monte Carlo, Particle Filtering

• Deterministic

– Variational Bayes

In all these, conjugacy helps.

(50)

VB or Gibbs

ψ

0,1

z

1

ψ

1,1

v

1

ψ

1,2

z

2

ψ

2,2

v

2

· · ·

p(y

1

|v

1

) p(y

2

|v

2

)

• VB

q ^{(τ )} (v k ) ← exp(φ k + hlog ψ k,k + log ψ _k,k+1 i _q

(τ )

(z

_k

)q

^{(τ )}

(z

_k+1

) )

• Gibbs

(τ ) (τ ) (τ )

(51)

VB or Gibbs

ψ

0,1

z

¹

ψ

1,1

v

¹

ψ

1,2

z

²

ψ

2,2

v

²

· · ·

p(y

1

|v

1

) p(y

2

|v

2

)

• VB

q ^{(τ )} (z _k ) ← exp(φ _k + hlog ψ _k,k−1 + log ψ _k,k i _q

(τ )

(v

_k

)q

^{(τ )}

(v

_k+1

) )

• Gibbs

z _k ^{(τ )} ∼ p(z k |v _k−1 , v k ) ∝ ψ _k,k−1 (v _k−1 ^{(τ )} )ψ k,k (v _k ^{(τ )} )

(52)

Denoising - Speech (VB)

• Additive Gaussian noise with unknown variance

• Inference : Variational Bayes

Noisy Original

X

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

Xorg

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

X_h SNR:19.98

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

X_v SNR:20.79

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

X_b SNR:19.68

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

X_g SNR:19.97

150 200 250 300 350 400 450 500

−10

−8

−6

−4

−2 0

(53)

Denoising – Music

Original

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

−18

−16

−14

−12

−10

−8

−6

−4

−2 0

Noisy

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

−18

−16

−14

−12

−10

−8

−6

−4

−2 0

PF SNR:8.53

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

−18

−16

−14

−12

−10

−8

−6

−4

−2 0

Gibbs SNR:8.66

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

−18

−16

−14

−12

−10

−8

−6

−4

−2 0

VB SNR:2.08

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

−18

−16

−14

−12

−10

−8

−6

−4

−2 0

“Tristram (Matt Uelmen)” +∼ 0dB white noise

(54)

Single Channel Source Separation (with Onur Dikmen)

• Source 1: Horizontal : Tie across time : harmonic continuity

• Source 2: Vertical : Tie across frequency : transients, percussive sounds

Frequency Bin (ν)

Xorg S

hor S

ver

(55)

Single Channel Source Separation with IGMCs

E-guitar

“Matte Kudasai (King Crimson)”

+ Drums

“Territory (Sepultura)”

= Mix

ˆs ₁ ˆs ₂

SDR SIR SAR SDR SIR SAR

VB -4.74 -3.28 5.67 -1.58 15.46 -1.37

Gibbs -4.5 -2.62 4.57 1.05 12.46 1.61

Gibbs _EM -4.23 -2.42 4.82 1.34 13.13 1.85

Pre − trained -4.04 -3.15 8.13 3.56 11.44 4.64

Oracle 6.14 17.16 6.58 12.66 19.95 13.6

• Oracle: We use the square of the source coefficient as the latent variance estimate

• Pre-trained: We use the best coupling parameters a z and a trained on isolated

sources

(56)

Single Channel Source Separation with IGMCs

“Vandringar I Vilsenhet (Anglagard)” + “Moby Dick (Led Zeppelin)” = Mix

ˆs ₁ ˆs ₂

SDR SIR SAR SDR SIR SAR

VB -7.8 -6.22 4.53 -2.35 18.4 -2.25

Gibbs -8.46 -7.53 6.93 -4.04 14.59 -3.83

Gibbs _EM -7.74 -6.19 4.62 -1.14 16.62 -0.97 Pre − trained -6.4 -5.39 6.95 3.8 16.39 4.14

Oracle 12.1 32.9 12.14 21.13 33.89 21.37

(57)

Harmonic-Transient Decomposition

Time ( τ )

Frequency Bin ( ν )

X org S

hor S

ver

(Original) (Hor) (Vert)

(58)

Chord Detection - Signal model (with Paul Peeling)

−10

−8

−6

−4

−2 0

log σν

(59)

Chord Detection

.. .

Time τ / s

Frequency/Hz

MDCT of piano chord {41,48,51,56}

0.5 1 1.5 2 2.5

0 500 1000 1500 2000 2500 3000 3500 4000

Time τ / s

MIDInotej

logP

νvν,j,τ

0.5 1 1.5 2 2.5

40 45 50 55 60 65

(60)

Multichannel Source Separation

• Hierarchical Prior Model (Fevotte and Godsill 2005, Cemgil et. al. 2006)

λ

1

. . . λ

n

. . . λ

N

∼ G(λ

n

; a

λ

, b

λ

)

v

k,1

. . . v

k,n

· · · v

k,N

∼ IG(v

k,n

; ν/2, 2/(νλ

n

))

s

k,1

. . . s

k,n

. . . s

k,N

∼ N (s

k,n

; 0, v

k,n

)

x

k,1

. . . x

k,M

k = 1 . . . K

∼ N (x

k,m

; a

^⊤_m

s

k,1:N

, r

m

)

(61)

Equivalent Gamma MRF

• A tree for each source

• λ n can be interpreted as the overall “volume” of source n

(62)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

f/Hz

4000 6000 8000 10000

15 20 25

(63)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25

(Guitar)

(64)

Multimodality

(65)

Multimodality

Annealing, Bridging, Overrelaxation, Tempering

0 500 1000 1500 2000

−0.8024 0.8295 2.0375

a

0 500 1000 1500 2000

7.2408 25.1398 36.2295

λ

0 500 1000 1500 2000

0.5451 1.8648

r

Epoch

(66)

Tempo tracking and score performance matching

• Given expressive music data (onsets/detections/spectral features/....) – Determine the position of a performance on a score

– Determine where a human listener would clap her hand – Create a quantized/human readable score

– ....

• Online-Realtime or Offline-Batch

• All of these problems can be mapped to inference problems in a HMM

(67)

Bar position Pointer (Whiteley, Cemgil, Godsill 2006)

| | |

3 • • • • • • • •

n

k

2 • • • • • • • •

1 • • • • • • • •

1 2 3 4 5 6 7 8

m

k

3/4 time

4/4 time

• Each dot denotes a state x = (m, n) (Score Position – Tempo level)

• Directed Arcs denote state transitions with positive probability

(68)

Bar position Pointer - transition model

Tempo Level

Bar Position p(x

2

| x

1

)

1 2 3 4 5 6 7 8

1 2 3 4 5

Tempo Level

p(x2| x 1)

2 3 4 5

Tempo Level

p(x2| x 1)

2 3 4 5

(69)

Bar position Pointer - k = 1

Tempo Level

Bar Position p(x1)

(70)

Bar position Pointer - k = 2

Tempo Level

p(x2)

(71)

Bar position Pointer - k = 3

Tempo Level

Bar Position p(x3)

(72)

Bar position Pointer - k = 4

Tempo Level

p(x4)

(73)

Bar position Pointer - k = 5

Tempo Level

Bar Position y5 = 0 p(x

5| y

1:5)

(74)

Bar position Pointer - k = 10

Tempo Level

p(x10)

(75)

Bar position Pointer - observation model (Poisson)

• Observation model p(y k |x k ) : Poisson intensity

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

m_k

µ k

Triplet Rhythm

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

mk

µ k

Duplet Rhythm

(76)

Tempo, Rhythm, Meter analysis

Bar Pointer Model ( Whiteley, Cemgil, Godsill 2006)

n0 n1 n2 n3

θ0 θ1 θ2 θ3

m0 m1 m2 m3

r0 r1 r2 r3

λ1 λ2 λ3

y1 y2 y3

(77)

Filtering

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(m_k|y_1:k)

50 100 150 200 250 300 350 400 450

800 600 400 200

−10

−5 0

Quarter notes per min.

log p(n k|y

1:k)

50 100 150 200 250 300 350 400 450

180 120

60 −4

−2 0

p(rk|y 1:k)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets

0 0.2 0.4 0.6 0.8

(78)

Smoothing

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(m k|y

1:K)

50 100 150 200 250 300 350 400 450

800 600 400

200 −10

−5 0

log p(n k|y

1:K)

50 100 150 200 250 300 350 400 450

180 120 60

−10

−5 0

p(rk|y 1:K) Triplets

Duplets 0.4

0.6 0.8

(79)

Time Signature

0 2 4 6 8 10 12

−1 0 1

sample value

time, s Observed Data

mk

log p(m k|z

1:K)

100 200 300 400 500

800 600 400

200 −10

−5 0

log p(n k|z

1:K)

100 200 300 400 500

155 103 52

−10

−5 0

p(θ_k|z 1:K)

Frame Index, k

100 200 300 400 500

4/4

3/4 0.2

0.4 0.6 0.8

(80)

Score-Performance matching (ISMIR) 2007

• Given a musical score, associate note events with the audio

4

x t

(81)

Score-Performance matching - Graphical Model

ν = 1, . . . , W

t

1

t

2

. . . t

K

r

1

r

2

. . . r

K

λ

1

λ

2

. . . λ

K

v

ν,1

v

ν,2

. . . v

ν,K

s

ν,1

s

ν,2

. . . s

ν,K

6 7 8 1 2 3 4 5

r k

v ν,τ ∼ IG(v ν,τ ; a, 1/(aλσ ν (r τ )))

(82)

Score-Performance matching - Signal model

−10

−8

−6

−4

−2 0

log σν

(83)

Score-Performance matching

Spectrogram Data

Time / s

Frequency / Hz

0 2 4 6 8 10 12 14

0 1000 2000 3000 4000

50 100 150 200 250 300 350 400 450

55 60 65 70 75 80 85

MIDI Data

Score position

MIDI note

Online (filtering) or Offline (smoothing) processing is possible

(84)

Transcription

log p(r

τ

|s

τ

)

M ID I n ot e n u m b er

Time / s

1 2 3 4

60 65 70 75 80

1 2 3 4

−10

−5 0 5 10

P

i

w ˜

τ⁽ⁱ⁾

λ ˜

⁽ⁱ⁾τ

log λ

MDCT of audio (source: Daniel-Ben Pienaar)

F re q u en cy / Hz

1 2 3 4

0

500 1000

1500

2000

2500

3000

3500

4000

(85)

Summary

• “Time Domain” – Switching State Space Models – State space modeling,

– Conditional Linear Dynamical Systems, Gaussian processes (e.g.

AR, ARMA)

– Analysis down to sample precision (if required) – Computationally quite heavy

• “Transform Domain” – Gamma Fields

– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)

– Inherent limitations (analysis windows, frequency resolution)

(86)