Factorisation based models for analysis of musical audio

(1)

Variations on a theme:

Factorisation based models for analysis of musical audio

A. Taylan Cemgil

Department of Computer Engineering, Bo˘gazi¸ci University, ˙Istanbul, Turkey

17 Dec 2011, Nips Workshops, Granada

(2)

Acknowledgements

• Umut Simsekli, Kenan Yılmaz, Orhan S¨onmez, Barıs Kurt, Bogazici

• Cumhur Erkut, Antti Jylh¨a, Aalto (Helsinki)

• Onur Dikmen, C´edric F´evotte (CNRS, Telecom ParisTech)

• Tuomas Virtanen (Tampere)

• Evrim Acar (Copenhagen)

• Funding for our work

– TUBITAK, BAP, CNRS

(3)

Outline

• Introduction, Motivations

• Matrix Factorisation,

• Tensors

• Probabilistic Latent Tensor Factorisation

• Example models

• Inference framework

• Coupled Tensor Factorisations

• Applications

• Results and Conclusions

(4)

Statistical Approach

• Machine Listening ⇔ inverse synthesis via Bayesian inference p(Structure|Audio) ∝ p(Audio|Structure)p(Structure) – Hierarchical signal models to incorporate prior knowledge – Consistent framework for developing inference algorithms

• Contrast to Traditional/Procedural approaches – where no clear

distinction between “what” and “how”

(5)

Superposition

• Signal from sound sources are mixed – Denoising

– Separation

– Dereverberation

• Feature extraction is hard: Feature(P

is_i) 6= P

i Feature(s_i)

piano + piccolo + cymbals

(6)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

• ... to score

(7)

Computational issues

• Parameter Estimation

Which pitch, rhythm, tempo, meter, time signature ... ?

• Model Selection

How many notes, onsets, sections ... ?

• Online/Offline inference

(8)

Signal Models for Audio

• Time domain – state space, dynamical models

– Conditional Linear Dynamical Systems, Gaussian processes (e.g.

AR, ARMA), switching state space models

– Flexible, Physically realistic, Analysis down to sample precision

• Transform domain – Fourier representations, Generalised Linear model

– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)

• Feature based

– MFCC’s, Chroma,

(9)

Matrix Factorisation

• An Inverse problem: estimate Z₁, Z₂ given X.

X ≈ Z₁Z₂ X(ν, τ ) ≈ X

i

Z₁(ν, i)Z₂(i, τ )

i j

i k

k j

≈ =

X ˆ Z

1

Z

₂

X ◦ M

(10)

Matrix Factorisation

• An Inverse problem: estimate Z₁, Z₂ given X.

X ≈ Z₁Z₂ X(ν, τ ) ≈ X

i

Z₁(ν, i)Z₂(i, τ )

• Many well known algorithms can be cast as matrix factorisation problems – Clustering: Z₁ is arbitrary, columns of Z₂ are unit vectors

– NMF: Z₁, Z₂ are nonnegative (Paatero and Tapper, 1994; Lee and Seung 1999, 2000) – PCA, Latent Semantic Indexing, Latent Dirichlet Allocation ...

• Minimise a suitable error function

(Z₁, Z₂)^∗ = arg min

Z₁,Z₂ D(X||Z₁Z₂)

(11)

NMF in Acoustic and Music modeling

• We seek a factorisation of the spectrogram (Smaragdis and Brown 2003; Sha, Saul, Lee;

Virtanen; Abdallah and Plumbley; Schmidt and Olsson; Fevotte)

X ≈ Z₁ × Z₂

Spectrogram ≈ Templates × Excitations

20 40 60 80 100 120

Freq. Z1

Z Intensity

(12)

Underlying Generative Model

• One-Rank

X(ν, τ ) ∼ p(·; Z₁(ν)Z₂(τ ))

• Higher Rank (Composite structure)

S(ν, τ, i) ∼ p(·; Z₁(ν, i)Z₂(τ, i)) X(ν, τ ) = X

i

S(ν, τ, i)

(13)

Music Analysis

τ/Frame

ν/Frequency Bin

Log |MDCT| coefficients

50 100 150 200 250

50 100 150 200 250 300 350 400 450 500

≈

i/Key index

ν/Frequency index

Estimated Scale Parameters of the template prior

10 20 30 40 50 60 70 80

50 100 150 200 250 300 350 400 450 500

×

pitch

τ/Frame index

20 40 60 80 100 120

5 10 15 20 25 30 35 40

One needs to incorporate a lot of prior knowledge to arrive at the “desired factorisation”

• Provide Priors (Spectral continuity, Gamma chains, HMM’s)

• Provide Side Information

(14)

Side Information for guiding the factorisation

X2 (Isolated Notes) X1 (Audio Spectrum)

f p

i p

i t

f t

Observed TensorsHidden Tensors

f i

D (Spectral Templates)

F (Excitations of X2) E (Excitations of X1)

[X₁, X₂] ≈ D[E, F ]

Many other extensions for audio (Smaragdis, Raj; Morup, Schmidt; Vincent; Virtanen; Fevotte; Coyle, FitzGerald; Bertin, Liutkus, Badeau, Richard)

(15)

Factorisation based audio models

• Need highly structured (and complicated) models

• A unifying and practical framework inspired by graphical models:

– Probabilistic Latent Tensor Factorisation – Generalised Coupled Tensor Factorisation

(16)

Main Research Questions

• Understand several popular models in Audio and Music processing, invent new ones Incorporation of prior knowledge via hierarchical modeling

• A general framework for derivation of decomposition algorithms

Inspiration from probabilistic graphical models, factor graphs, message passing

• Understand the statistical interpretation of Matrix/Tensor factorisation.

Certain Error criteria lead to hierarchical probabilistic models

• Model Selection, sparsity

Bayesian Model Selection via Variational free energy minimisation (VB) or MCMC (not here)

(17)

Tensors

• Tensor ≡ Multidimensional Array (X(i, j, k, . . .))

Disclaimer: not a tensor field

(18)

Tensor Factorisations

Kolda and Bader; Chichocki, Zdunek, Amari

• PARAFAC (parallel factors – Carroll and Chang 1970 CANDECOMP (canonical decomposition – Harshman 1970)

X(ν, ξ, τ ) ≈ X

i

T (ν, i)V (τ, i)W (ξ, i)

• Three-mode FA, Higher order SVD (Tucker, 1966; De Lathauwer et al., 2000)

X(ν, ξ, τ ) ≈ X

i,j,k

G(i, j, k)T (ν, i)V (τ, j)W (ξ, k)

• N-Mode generalisation of Tucker and dozens of variations X(ν₁, ν₂, . . . , ν_N) ≈ X

i₁,i₂,...,i_N

G(i₁, i₂, . . . , i_N)V₁(ν₁, i₁)V₂(ν₂, i₂) · · · V_N(ν_N, i_N)

Actually Tensor factorisations are quite useful even if data are not multiway.

(19)

Example 1: Deconvolution as (latent) tensor factorisation

• X: Observed Signal

• Z₁: Original Signal

• Z₂: Filter impulse response

X(i) ≈ ˆX(i) = X

t

Z₁(t)Z₂(

z}|{d

i − t)

= X

t

X

d

Z₁(t)Z₂(d)δ(d − i + t)

= X

t,d

Z₁(t)Z₂(d)Z₃(d, i, t)

(20)

Example 1: Hierarchical modeling (1)

Assume that the original signal can be confined into a subspace

• U : Original Signal

• Z₂: Filter impulse response

X(i) ≈ ˆX(i) = X

t,d P

rZ₄(t,r)Z₁(r)

z}|{U (t) Z₂(d)Z₃(d, i, t)

= X

t

X

d

X

r

Z₁(r)Z₂(d)Z₃(d, i, t)Z₄(t, r)

(21)

Example 1: Hierarchical modeling (1)

U (t) = X

r

Z₄(t, r)Z₁(r)

0 0.5 1

0 100 200 300 400 500 600 700

=

0 1 2 3 4 5 6 7

0 100 200 300 400 500 600 700

×

0.5 1 1.5

1

2

3

4

5

6

7

(22)

Example 1: Hierarchical modeling (2)

Assume the filter can also be confined into a subspace

• Z₁: Original Signals expansion coefficients

• H: Filter Impulse response

X(i) ≈ ˆX(i) = X

t

X

d

X

r

Z₁(r)

P

q Z₅(d,q)Z₂(q)

z }| {

H(d) Z₃(d, i, t)Z₄(t, r)

= X

t

X

d

X

r

X

q

Z₁(r)Z₂(q)Z₃(d, i, t)Z₄(t, r)Z₅(d, q)

This process may continue until we run out letters in the alphabet

(23)

Deconvolution

Synthetically blurred image

10 20 30 40 50

10

20

30

40

50

Original Image

10 20 30 40 50

5 10 15 20 25 30 35 40 45 50

Original filter

2 4 6 8 10

1 2 3 4 5 6 7 8 9 10

z1 and z2 convolved

10 20 30 40 50

10

20

30

40

50

z1

10 20 30 40 50

5 10 15 20 25 30 35 40 45 50

z2

2 4 6 8 10

1 2 3 4 5 6 7 8 9 10 20

40 60 80 100

20 40 60 80 100 120 140 160

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

20 40 60 80 100

5 10 15 20 25 30 35

0.02 0.04 0.06 0.08 0.1 0.12 0.14

(24)

Ex2: Nonnegative Matrix deconvolution (NMFD)

(Smaragdis, 2004)

X(f, t) =ˆ X

τ,i

W (f, i, τ )H(i,

z }| {d

t − τ )

= X

τ,i,d

W (f, i, τ )H(i, d)Z(d, t, τ ).

X: Spectrogram, W : Basis, H: Weights

(25)

Excitation Filter Models

• (NMF2D) Nonnegative Matrix 2-D deconvolution Schmidt and Mørup 2008

X(f, t) =ˆ X

i,φ,τ,ν,d

D(ν, τ, i)E(φ, d, i)Z₁(ν, f, φ)Z₂(d, t, τ )

• (SF-SSNTF) Source Filter Sinusoidal Shifted Nonnegative Tensor Factorisation

FitzGerald, Cranitch, Coyle 2008

X(c, t, f ) =ˆ X

i,p,r,τ

G(c, i)H(f, i)N (f, p, r)W (p, i, τ )E(r, i, d)Z(d, t, τ )

– mimics physically inspired source-filter models in spectral domain, – harmonic excitation multiplied by spectral envelope

• Excitation filter model Klapuri and Virtanen

X(t, k) =ˆ P

i,j,n G(n, t)E(n, k)C(i, j)A(j, k)Z(i, n, t)

(26)

Probabilistic Latent Tensor Factorisation

X(v₀) ≈ ˆX(v₀) = X

¯ v₀

Y

α

Z_α(v_α)

Z_1:|α|^∗ = argmin

Z

D(X(v₀)|| ˆX(v₀))

v ∈ V All model indices,

v₀ ∈ V₀ indices of observation X,

vα ∈ Vα indices of factor Zα for α = 1 . . . |α|,

¯

v₀ ∈ ¯V₀ V \V₀,

¯

v_i ∈ ¯V_i V \V_α, α = 1 . . . |α|.

(27)

Example: NMF

Nonnegative Matrix factorisation

X(v₀) ≈ ˆX(v₀) = X

¯ v₀

Y

α

Zα(vα)

X(f, t) ≈ ˆX(f, t) = X

i

Z₁(f, i)Z₂(i, t)

v ∈ V = {f, t, i} All model indices,

v₀ ∈ V₀ = {f, t} indices of observation X, v₁ ∈ V₁ = {f, i} indices of factor Z₁,

v₂ ∈ V₂ = {i, t} indices of factor Z₂,

¯

v₀ ∈ ¯V₀ = {i} v¯₁ ∈ ¯V₁ = {t} v¯₂ ∈ ¯V₂ = {f }

(28)

Factor Graph Representation for TF Models

• Multivariate Probability densities have a Factor graph representation

• Tensor Factorisation models can be represented similarly – Clique Potentials ↔ Factors

– Random variables ↔ Indices

• Example: NMF

V = {f, i, t} V₀ = {f, t}

V₁ = {f, i} V₂ = {i, t}

X(f, t) ≈ P

i Z₁(f, i)Z₂(i, t)

Z₁

Z₂ f

i

t

(29)

Factor Graph Representation for TF Models

NMF NMFD NMF2D SF-SSNTF

Model V {f, t, i} {f, t, τ, i, d} {f, t, ν, τ, i, φ, d} {c, t, f, i, p, r, τ, d}

Observed V₀ {f, t} {f, t} {f, t} {c, t, f }

Latent V¯₀ {i} {τ, i, d} {ν, τ, i, φ, d} {i, p, r, τ, d}

Factors {f, i} {f, τ, i} {d, i} {ν, τ, i} {φ, d, i} {c, i} {f, i} {f, p, r}

{i, t} {d, t, τ } {ν, f, φ} {d, t, τ } {p, i, τ } {r, i, d} {d, t, τ }

D

E f

i

t

D

E Z

f

i

d t

τ

Z¹

E D

Z² f

ν i φ t τ d

N

G H

E W

Z c f

p i r t τ d

(30)

Bayesian Networks for MF

Θ_v v_i,1

· · · v_i,τ

· · · v_i,K

Θ_t t_ν,i

s_ν,i,1

· · · s_ν,i,τ

· · · s_ν,i,K i = 1 . . . I

x_ν,1 x_ν,τ x_ν,K

ν = 1 . . . W

• A Directed acyclic graph, not to be confused with our notation

• Nodes denote Random variables, Rectangles denote plates (repeat nodes inside)

• However, fairly complicated even for NMF → not very insightful

(31)

Update Rules for Non-Negative GTF

From GLM one can derive multiplicative update rules (MUR) (Yilmaz et.al., NIPS 2011).

Z_α ← Z_α ◦ ∆_α(M ◦ W ( ˆX) ◦ X)

∆_α(M ◦ W ( ˆX) ◦ ˆX) s.t. Z_α(v_α) > 0

(inverse variance function), i.e. W ( ˆX) = 1/v( ˆX) where for the Gaussian, Poisson, Exponential and Inverse Gaussian distributions we have simply

W ( ˆX) = ˆX^−p with p = {0, 1, 2, 3}.

∆_α(Q) = h X

¯ vα

Q(v₀) Y

α6=α

Z_α^′(v_α^′)i

(1)

(32)

PLTF: Iterative Maximum Likelihood

• Specialize to β-divergences (Yilmaz and Cemgil, 2010)

Zα ← Zα ◦ ∆_α(M ◦ X ◦ ˆX^β⁻²)

∆_α

M ◦ ˆX^β−1 D(·, ·) IS KL Euclidian

β 0 1 2

• M : mask tensor (M (v₀) = 1 if X(v₀) is observed, 0 otherwise)

• Evaluating ∆, equivalent to computing marginal potentials! ⇒ via message passing on a factor graph

∆_α(Q) ≡



X

¯ vα

Q(v₀) Y

α^′6=α

Z_α^′(v_α^′)





(33)

General Update Rule for GTF

By dropping the non-negativity requirement we obtain the following update equation:

Z_α ← Z_α + 2 λ_α\0

∆_α(W ◦ (X − ˆX))

∆²_α(W ) with λ_α\0 = |v_α ∩ ¯v₀|

∆^ε_α(Q) = h X

¯ v_α

Q(v₀) Y

α6=α

Z_α^′(v_α^′)εi

(34)

NMF

Z₁

Z₂ ν

i

τ

X(ν, τ ) ←ˆ X

i

Z₁(ν, i)Z₂(i, τ )

(35)

NMF

M ◦ X/ ˆX Z₂ M

ν

i

τ

Z₁ ← Z₁ ◦ ∆₁(M ◦ X/ ˆX)

∆₁(M )

(36)

NMF

M ◦ X/ ˆX M

Z₁ ν

i

τ

Z₂ ← Z₂ ◦ ∆₂(M ◦ X/ ˆX)

∆₂(M )

(37)

MAP estimation (for the KL case)

• Conjugate Prior for a Poisson (Gamma G(x; a, b) = b^ax^a−1 exp(−bx)/Γ(a)) Z_α ∼ G(Z_α;A_α, B_α/A_α)

• Update Rule

Zα ← (A_α − 1) + Z_α ◦ ∆_α(M ◦ X/ ˆX)

A_α/B_α + ∆_α (M ) (2)

(38)

Deriving the update equations

• Matrix factorisations as a Generalised Linear Model

• Consider a MF model

g( ˆX) = Z₁Z₂

where Z₁, Z₂ and g( ˆX) are matrices of compatible sizes.

• Use vec(AXB) = (B^⊤ ⊗ A) vec X to obtain

vec(g( ˆX)) = (I_|j| ⊗ Z₁) vec(Z₂) ≡ Lz

• We can compute a factorisation using the general GLM update equation by alternating between Z₁ and Z₂

• Readily generalised to arbitrary tensors

(39)

Extensions to NMFD and NMF2D

• Convolutive model (I)

E(φ, d, i) = X

k,l

B(k, l)C(k,

z }| {α

d − l, φ, i) (I) (3)

• Basis spline model (II)

E(φ, d, i) = X

k

B(k, d)C(k, φ, i) (II) (4)

φ Note index, i source index, d local time index

(40)

Extensions to NMFD and NMF2D

D

B Z

¹

C

Z

²

f

i

d α

t k

l τ

(a) NMFD+I

D

Z

¹

C

B f

τ t i

k d

(b) NMFD+II

Z

¹

B

C D

Z

²

Z

³

f

φ

τ α

i k

t ν

d l

(c) NMF2D+I

Z

¹

B

C D

Z

²

f

ν i φ

t

τ k

d

(d) NMF2D+II

(41)

Application: Missing audio restoration

• 50 short mono audio examples sampled at 44.1kHz (from FitzGerald, Cranitch, Coyle 2008).

• Compute a spectrogram of 1024 samples windows with no overlap.

• Remove randomly blocks of 10 consecutive time frames, approx. 250ms gaps.

• 20 per cent of each audio file is removed with long gaps.

Table 1: Evaluation of the models on missing audio restoration

SNR MSE

IS KL EUC IS KL EUC

NMFD 2.99 4.74 5.05 4.43 2.91 2.68

SF-SSNTF −0.28 5.09 5.06 15.00 2.57 2.59 NMFD + I 3.01 6.00 6.91 5.89 2.23 1.68 NMFD + II 5.00 5.79 5.80 2.74 2.20 2.17

(42)

Application: Source Separation

• 50 short mono audio examples sampled at 44.1kHz (from FitzGerald, Cranitch, Coyle 2008∗)

• Mix pairs of examples

• operate on Constant-Q magnitude (computed via Schoerkhuber and Klapuri)

• Using KL cost

• Reconstruct sources using the estimated magnitudes and phase of the mixture

Model SDR SIR SAR

NMF2D 6.10 19.00 7.50

NMF2D + I 6.19 19.84 6.84 SF-SSNTF^∗ ≈ 8.00 ≈ 24.00 ≈ 8.00

(43)

Coupled Tensor Factorisations

Example Problem

X₁^i,j,k ≈ ˆX₁^i,j,k = X

r

A^i,rB^j,rC^k,r X₂^j,p ≈ ˆX₂^j,p = X

r

B^j,rD^p,r X₃^j,q ≈ ˆX₃^j,q = X

r

B^j,rE^q,r

A B C D E

X₁ X₂ X₃

(44)

Coupled Tensor Factorisations

• Factorise multiple observed tensors simultaneously: X_ν for ν = 1 . . . |ν|.

• Each observed tensor X_ν now has a corresponding index set V_0,ν and a particular configuration will be denoted by v_0,ν ≡ u_ν

• We define a |ν| × |α| coupling matrix R where

R^ν,α =

1 Xν and Zα connected

0 otherwise Xˆν(uν) = X

¯ u_ν

Y

α

Zα(vα)^R^ν,α (5)

Example

X₁^i,j,k ≈ X

r

Z₁^i,rZ₂^j,rZ₃^k,r X₂^j,p ≈ X

r

Z₂^j,rZ₄^p,r X₃^j,q ≈ X

r

Z₂^j,rZ₅^q,r (6)

(45)

Update rules for Coupled Tensor Factorisations

∆^ε_α,ν(Q) = h X

u_ν∩¯v_α

Q(uν) Y

α6=α

Z_α^′(v_α^′)^R^ν,αεi

(7)

• Update for Nonnegative CTF

Zα ← Zα ◦ P

ν R^ν,α∆α,ν Wν ◦ Xν

P

ν R^ν,α∆α,ν Wν ◦ ˆXν

(8)

• In the special case of a Tweedie family, i.e. for the distributions whose precision as W_ν = ˆX_ν^−p, the update is

Z_α ← Z_α ◦ P

ν R^ν,α∆_α,ν Xˆ_ν^−p ◦ X_ν P

ν R^ν,α∆_α,ν Xˆν^1−p

(9)

(46)

Update rules for Coupled Tensor Factorisations

∆^ε_α,ν(Q) = h X

uν∩¯vα

Q(u_ν) Y

α6=α

Z_α^′(v_α^′)^R^ν,αεi

(10)

• General Update for CTF

Z_α ← Z_α + 2 λ_α\0

P

ν R^ν,α∆_α,ν W_ν ◦ X_ν − ˆX_ν

P

ν R^ν,α∆²_α,ν W_ν (11)

(47)

GCTF Application: Audio restoration

Ground Truth

200 400

50 100 150 200 250 300 350 400 450 500

Observed Spectrogram

200 400

50 100 150 200 250 300 350 400 450 500

Reconstructed Spectrogram

200 400

50 100 150 200 250 300 350 400 450 500

11025 kHz, frame length 93 msec, (50 missing chunks: average length 0.23 sec., max. length: 1.07 sec.; with side information: bach chorales, SNR: 4.38 dB

(48)

GCTF Application: Score aided Audio restoration

D (Spectral Templates)

E (Excitations of X1)

B (Chord Templates)

X3 (Isolated Notes) X1 (Audio with Missing Parts) X2 (MIDI file)

f p

i p

f i

k d

i t

f t

i n

k i m

τ k

Observed TensorsHidden Tensors

F (Excitations

of X3) C (Excitations

of E)

G (Excitations of X2)

(49)

GCTF Application: Score aided Audio restoration

Xˆ₁(f, t) = X

i,τ,k,d

D(f, i)B(i, τ, k)C(k, d)Z(d, t, τ ) Test file (12) Xˆ₂(i, n) = X

τ,k,m

B(i, τ, k)G(k, m)Y (m, n, τ ) MIDI file (13) Xˆ₃(f, p) = X

i

D(f, i)F (i, p)T (i, p) Merged training files (14)

R =





1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1



 with

Xˆ₁ = P D¹B¹C¹Z⁰G⁰Y ⁰F⁰T⁰ Xˆ₂ = P D⁰B¹C⁰Z⁰G¹Y ¹F⁰T⁰ Xˆ₃ = P D¹B⁰C⁰Z⁰G⁰Y ⁰F¹T¹

(15)

(50)

GCTF Application: Score aided Audio restoration

Time (sec)

Frequency (Hz)

X3 (Isolated Recordings)

100 200 300

0 500 1000 1500 2000

Time (sec)

Notes

X2 (Transcription Data)

50 100 150

20 40 60 80

Time (sec)

Frequency (Hz)

X1

5 10 15 20 25

0 500 1000 1500 2000

Time (sec)

Frequency (Hz)

X1hat (Restored)

5 10 15 20 25

0 500 1000 1500 2000

Time (sec)

Frequency (Hz)

Ground Truth

5 10 15 20 25

0 500 1000 1500 2000

20 40 60 80

0 5 10 15

Missing Data Percentage (%)

SNR (dB)

Performance

Reconst. SNR Initial SNR

Figure 1: Observed matrices X₁: spectrum of the piano performance, (missing data (70%) are shown white), X₂, a piano roll obtained from a musical score of the piece, X₃, spectra of 88 isolated notes from a piano.

(51)

Summary and Future Work

• Cover broad class of models and topologies using a graphical model formalism

• Generalised Linear models for ML estimation in TF models

• Full Bayesian treatment (inference, model selection) is possible via MCMC or Variational Inference

• Automatic code generation a-la WinBUGS or Infer.net, given a model specification

• Fast Computations on a GPU

• Prior structures, Online inference

• Applications!!

(52)

A Toolbox based on GPU computation

Probabilistic Latent Tensor Factorization Matlab Toolkit

http://www.cmpe.boun.edu.tr/pilab/pilabfiles/pltftoolbox/

(53)

References http://www.cmpe.boun.edu.tr/~cemgil

• Simsekli, U., Cemgil, A. T. and Yilmaz, K., Score Guided Audio Restoration via Generalised Coupled Tensor Factorisation, Submitted 2011

• Yilmaz, K., Cemgil, A. T. and Simsekli, U., Generalised Coupled Tensor Factorisation, NIPS, 2011

• Cemgil, Simsekli and Subakan, Probabilistic Latent Tensor factorisation framework for audio modeling, IEEE Waspaa, 2011

• Yilmaz, K. and Cemgil, A. T. Algorithms for Probabilistic Latent Tensor Factorization, Signal Processing, Elsevier, 2011

– Longer version of Yilmaz, K. and Cemgil, A. T. Probabilistic Latent Tensor factorisation, Proc.

of ICA/LVA, 2010

• C. F´evotte and A. T. Cemgil Nonnegative matrix factorisations as probabilistic inference in composite models, Eusipco 2009, Glasgow

• A. T. Cemgil. Bayesian inference in non-negative matrix factorisation models. Computational Intelligence and Neuroscience, 2009

(54)

Change Point Models

• Real time pitch/event detection with NMF style models

• Tempo tracking and real time interaction

log t_,i

10 20 30 40 50 60

0 5 10 15 20 25

r_!

0 5 10 15 20 25

r_!

release sustain attack

0 5 10 15

v_!

0 2 4 6 8

v_!

log x_,!

10 20 30 40 50 60

log x_,!

10 20 30 40 50 60

HMM CPM

(55)

HMM versus Change Point Model

F F

r_{τ −1} rτ

v_{τ −1} vτ

x_{ν,τ −1} xν,τ

F F

c_{τ −1} c_τ

r_{τ −1} rτ

v_{τ −1} vτ

x_{ν,τ −1} xν,τ

(56)

Detection Results (HMM versus CP)

0 50 100 150 200 250 300 350 400

0 50 100

Lag (ms)

Precision (%)

HMM CPM

0 50 100 150 200 250 300 350 400

0 50 100

Lag (ms)

Recall (%)

0 50 100 150 200 250 300 350 400

20 40 60 80

Lag (ms)

Latency (ms)

(57)

Realtime Interaction

• Combine Perception with Control

Claves

Lego Robot Loud Speaker

Perception

Midi Synthesizer

Controller Tempo

Bar Position

Motor Feedback

Motor Command

(58)

Tempo Tracker, Graphical Model

F F

n_{τ −1} nτ

m_{τ −1} m_τ

r_{τ −1} rτ

v_{τ −1} v_τ

x_{ν,τ −1} x_ν,τ

(59)

Tempo Tracker

Tempo

BPM

1 2 3 4 5 6 7 8 9

100 150 200

0 0.5 1

Bar Position

1 2 3 4 5 6 7 8 9

200 400 600

0 0.5 1

Acoustic Events

1 2 3 4 5 6 7 8 9

1 2

3

0 0.5 1

Audio Spectra x_ν,τ

Time (sec)

Frequency

1 2 3 4 5 6 7 8 9

50 100 150 200 250 300 350 400 450 500

−14

−12

−10

−8

−6

−4

−2 0 2 Spectral Templates t_ν,i

Acoustic Events

Frequency

0.5 1 1.5 2 2.5 3 3.5

50 100 150 200 250 300 350 400 450 500

(60)

Controller

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4

−0.05

−0.04

−0.03

−0.02

−0.01 0 0.01 0.02 0.03 0.04 0.05

∆ m_τ

∆ n τ

τ = 0 τ = 1 τ = 2

τ = 3 τ = 4

(61)

Controller

0 5 10 15

0 500 1000

Bar Posion

Position

Time (sec)

0 5 10 15

120 125 130 135 140

Tempo (BPM)

Velocity

Time (sec)

0 5 10 15

0 pi/2 pi 3pi/2 2pi

Robot’s Position

Time (sec)

0 5 10 15

0 pi/32 pi/16

Robot’s Velocity

Time (sec)

(62)

Results

0 5 10 15 20

0 0.2 0.4 0.6 0.8 1

Bar Position

Time (seconds)

Bar Position

Robot’s Position Tracker’s Position

0 5 10 15 20

0 50 100 150 200 250 300

Tempo

Time (seconds)

Tempo (beat/min)

Robot Speed Tracker Speed

(63)

References (Realtime Tracking and Interaction)

• U. Simsekli, O. Sonmez, B. Kurt, A. T. Cemgil, Combined Perception and Control for Timing in Robotic Music Performances EURASIP Journal on Audio, Speech, and Music Processing, 2011

• U. Simsekli, A. T. Cemgil, Probabilistic Models for Real-Time Acoustic Event Detection with Application to Pitch Tracking, Journal of New Music Research, 2011

• U. Simsekli, A. Jylha, C. Erkut and A. T. Cemgil, Real-Time Recognition of Percussive Sounds by a Model-Based Method, EURASIP Journal on Advances in Signal Processing, 2011.