Variations on a theme:
Factorisation based models for analysis of musical audio
A. Taylan Cemgil
Department of Computer Engineering, Bo˘gazi¸ci University, ˙Istanbul, Turkey
17 Dec 2011, Nips Workshops, Granada
Acknowledgements
• Umut Simsekli, Kenan Yılmaz, Orhan S¨onmez, Barıs Kurt, Bogazici
• Cumhur Erkut, Antti Jylh¨a, Aalto (Helsinki)
• Onur Dikmen, C´edric F´evotte (CNRS, Telecom ParisTech)
• Tuomas Virtanen (Tampere)
• Evrim Acar (Copenhagen)
• Funding for our work
– TUBITAK, BAP, CNRS
Outline
• Introduction, Motivations
• Matrix Factorisation,
• Tensors
• Probabilistic Latent Tensor Factorisation
• Example models
• Inference framework
• Coupled Tensor Factorisations
• Applications
• Results and Conclusions
Statistical Approach
• Machine Listening ⇔ inverse synthesis via Bayesian inference p(Structure|Audio) ∝ p(Audio|Structure)p(Structure) – Hierarchical signal models to incorporate prior knowledge – Consistent framework for developing inference algorithms
• Contrast to Traditional/Procedural approaches – where no clear
distinction between “what” and “how”
Superposition
• Signal from sound sources are mixed – Denoising
– Separation
– Dereverberation
• Feature extraction is hard: Feature(P
isi) 6= P
i Feature(si)
piano + piccolo + cymbals
Polyphonic Music Transcription
• from sound ...
t/sec
f/Hz
0 1 2 3 4 5 6 7 8
0 1000 2000 3000 4000 5000
0 10 20
• ... to score
Computational issues
• Parameter Estimation
Which pitch, rhythm, tempo, meter, time signature ... ?
• Model Selection
How many notes, onsets, sections ... ?
• Online/Offline inference
Signal Models for Audio
• Time domain – state space, dynamical models
– Conditional Linear Dynamical Systems, Gaussian processes (e.g.
AR, ARMA), switching state space models
– Flexible, Physically realistic, Analysis down to sample precision
• Transform domain – Fourier representations, Generalised Linear model
– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)
• Feature based
– MFCC’s, Chroma,
Matrix Factorisation
• An Inverse problem: estimate Z1, Z2 given X.
X ≈ Z1Z2 X(ν, τ ) ≈ X
i
Z1(ν, i)Z2(i, τ )
i j
i j
i k
k j
≈ =
X ˆ Z
1Z
2X ◦ M
Matrix Factorisation
• An Inverse problem: estimate Z1, Z2 given X.
X ≈ Z1Z2 X(ν, τ ) ≈ X
i
Z1(ν, i)Z2(i, τ )
• Many well known algorithms can be cast as matrix factorisation problems – Clustering: Z1 is arbitrary, columns of Z2 are unit vectors
– NMF: Z1, Z2 are nonnegative (Paatero and Tapper, 1994; Lee and Seung 1999, 2000) – PCA, Latent Semantic Indexing, Latent Dirichlet Allocation ...
• Minimise a suitable error function
(Z1, Z2)∗ = arg min
Z1,Z2 D(X||Z1Z2)
NMF in Acoustic and Music modeling
• We seek a factorisation of the spectrogram (Smaragdis and Brown 2003; Sha, Saul, Lee;
Virtanen; Abdallah and Plumbley; Schmidt and Olsson; Fevotte)
X ≈ Z1 × Z2
Spectrogram ≈ Templates × Excitations
20 40 60 80 100 120
Freq. Z1
Z Intensity
Underlying Generative Model
• One-Rank
X(ν, τ ) ∼ p(·; Z1(ν)Z2(τ ))
• Higher Rank (Composite structure)
S(ν, τ, i) ∼ p(·; Z1(ν, i)Z2(τ, i)) X(ν, τ ) = X
i
S(ν, τ, i)
Music Analysis
τ/Frame
ν/Frequency Bin
Log |MDCT| coefficients
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
≈
i/Key index
ν/Frequency index
Estimated Scale Parameters of the template prior
10 20 30 40 50 60 70 80
50 100 150 200 250 300 350 400 450 500
×
pitch
τ/Frame index
20 40 60 80 100 120
5 10 15 20 25 30 35 40
One needs to incorporate a lot of prior knowledge to arrive at the “desired factorisation”
• Provide Priors (Spectral continuity, Gamma chains, HMM’s)
• Provide Side Information
Side Information for guiding the factorisation
X2 (Isolated Notes) X1 (Audio Spectrum)
f p
i p
i t
f t
Observed TensorsHidden Tensors
f i
D (Spectral Templates)
F (Excitations of X2) E (Excitations of X1)
[X1, X2] ≈ D[E, F ]
Many other extensions for audio (Smaragdis, Raj; Morup, Schmidt; Vincent; Virtanen; Fevotte; Coyle, FitzGerald; Bertin, Liutkus, Badeau, Richard)
Factorisation based audio models
• Need highly structured (and complicated) models
• A unifying and practical framework inspired by graphical models:
– Probabilistic Latent Tensor Factorisation – Generalised Coupled Tensor Factorisation
Main Research Questions
• Understand several popular models in Audio and Music processing, invent new ones Incorporation of prior knowledge via hierarchical modeling
• A general framework for derivation of decomposition algorithms
Inspiration from probabilistic graphical models, factor graphs, message passing
• Understand the statistical interpretation of Matrix/Tensor factorisation.
Certain Error criteria lead to hierarchical probabilistic models
• Model Selection, sparsity
Bayesian Model Selection via Variational free energy minimisation (VB) or MCMC (not here)
Tensors
• Tensor ≡ Multidimensional Array (X(i, j, k, . . .))
Disclaimer: not a tensor field
Tensor Factorisations
Kolda and Bader; Chichocki, Zdunek, Amari• PARAFAC (parallel factors – Carroll and Chang 1970 CANDECOMP (canonical decomposition – Harshman 1970)
X(ν, ξ, τ ) ≈ X
i
T (ν, i)V (τ, i)W (ξ, i)
• Three-mode FA, Higher order SVD (Tucker, 1966; De Lathauwer et al., 2000)
X(ν, ξ, τ ) ≈ X
i,j,k
G(i, j, k)T (ν, i)V (τ, j)W (ξ, k)
• N-Mode generalisation of Tucker and dozens of variations X(ν1, ν2, . . . , νN) ≈ X
i1,i2,...,iN
G(i1, i2, . . . , iN)V1(ν1, i1)V2(ν2, i2) · · · VN(νN, iN)
Actually Tensor factorisations are quite useful even if data are not multiway.
Example 1: Deconvolution as (latent) tensor factorisation
• X: Observed Signal
• Z1: Original Signal
• Z2: Filter impulse response
X(i) ≈ ˆX(i) = X
t
Z1(t)Z2(
z}|{d
i − t)
= X
t
X
d
Z1(t)Z2(d)δ(d − i + t)
= X
t,d
Z1(t)Z2(d)Z3(d, i, t)
Example 1: Hierarchical modeling (1)
Assume that the original signal can be confined into a subspace
• X: Observed Signal
• U : Original Signal
• Z2: Filter impulse response
X(i) ≈ ˆX(i) = X
t,d P
rZ4(t,r)Z1(r)
z}|{U (t) Z2(d)Z3(d, i, t)
= X
t
X
d
X
r
Z1(r)Z2(d)Z3(d, i, t)Z4(t, r)
Example 1: Hierarchical modeling (1)
U (t) = X
r
Z4(t, r)Z1(r)
0 0.5 1
0 100 200 300 400 500 600 700
=
0 1 2 3 4 5 6 7
0 100 200 300 400 500 600 700
×
0.5 1 1.5
1
2
3
4
5
6
7
Example 1: Hierarchical modeling (2)
Assume the filter can also be confined into a subspace
• X: Observed Signal
• Z1: Original Signals expansion coefficients
• H: Filter Impulse response
X(i) ≈ ˆX(i) = X
t
X
d
X
r
Z1(r)
P
q Z5(d,q)Z2(q)
z }| {
H(d) Z3(d, i, t)Z4(t, r)
= X
t
X
d
X
r
X
q
Z1(r)Z2(q)Z3(d, i, t)Z4(t, r)Z5(d, q)
This process may continue until we run out letters in the alphabet
Deconvolution
Synthetically blurred image
10 20 30 40 50
10
20
30
40
50
Original Image
10 20 30 40 50
5 10 15 20 25 30 35 40 45 50
Original filter
2 4 6 8 10
1 2 3 4 5 6 7 8 9 10
z1 and z2 convolved
10 20 30 40 50
10
20
30
40
50
z1
10 20 30 40 50
5 10 15 20 25 30 35 40 45 50
z2
2 4 6 8 10
1 2 3 4 5 6 7 8 9 10 20
40 60 80 100
20 40 60 80 100 120 140 160
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1
20 40 60 80 100
5 10 15 20 25 30 35
0.02 0.04 0.06 0.08 0.1 0.12 0.14
Ex2: Nonnegative Matrix deconvolution (NMFD)
(Smaragdis, 2004)X(f, t) =ˆ X
τ,i
W (f, i, τ )H(i,
z }| {d
t − τ )
= X
τ,i,d
W (f, i, τ )H(i, d)Z(d, t, τ ).
X: Spectrogram, W : Basis, H: Weights
Excitation Filter Models
• (NMF2D) Nonnegative Matrix 2-D deconvolution Schmidt and Mørup 2008
X(f, t) =ˆ X
i,φ,τ,ν,d
D(ν, τ, i)E(φ, d, i)Z1(ν, f, φ)Z2(d, t, τ )
• (SF-SSNTF) Source Filter Sinusoidal Shifted Nonnegative Tensor Factorisation
FitzGerald, Cranitch, Coyle 2008
X(c, t, f ) =ˆ X
i,p,r,τ
G(c, i)H(f, i)N (f, p, r)W (p, i, τ )E(r, i, d)Z(d, t, τ )
– mimics physically inspired source-filter models in spectral domain, – harmonic excitation multiplied by spectral envelope
• Excitation filter model Klapuri and Virtanen
X(t, k) =ˆ P
i,j,n G(n, t)E(n, k)C(i, j)A(j, k)Z(i, n, t)
Probabilistic Latent Tensor Factorisation
X(v0) ≈ ˆX(v0) = X
¯ v0
Y
α
Zα(vα)
Z1:|α|∗ = argmin
Z
D(X(v0)|| ˆX(v0))
v ∈ V All model indices,
v0 ∈ V0 indices of observation X,
vα ∈ Vα indices of factor Zα for α = 1 . . . |α|,
¯
v0 ∈ ¯V0 V \V0,
¯
vi ∈ ¯Vi V \Vα, α = 1 . . . |α|.
Example: NMF
Nonnegative Matrix factorisation
X(v0) ≈ ˆX(v0) = X
¯ v0
Y
α
Zα(vα)
X(f, t) ≈ ˆX(f, t) = X
i
Z1(f, i)Z2(i, t)
v ∈ V = {f, t, i} All model indices,
v0 ∈ V0 = {f, t} indices of observation X, v1 ∈ V1 = {f, i} indices of factor Z1,
v2 ∈ V2 = {i, t} indices of factor Z2,
¯
v0 ∈ ¯V0 = {i} v¯1 ∈ ¯V1 = {t} v¯2 ∈ ¯V2 = {f }
Factor Graph Representation for TF Models
• Multivariate Probability densities have a Factor graph representation
• Tensor Factorisation models can be represented similarly – Clique Potentials ↔ Factors
– Random variables ↔ Indices
• Example: NMF
V = {f, i, t} V0 = {f, t}
V1 = {f, i} V2 = {i, t}
X(f, t) ≈ P
i Z1(f, i)Z2(i, t)
Z1
Z2 f
i
t
Factor Graph Representation for TF Models
NMF NMFD NMF2D SF-SSNTF
Model V {f, t, i} {f, t, τ, i, d} {f, t, ν, τ, i, φ, d} {c, t, f, i, p, r, τ, d}
Observed V0 {f, t} {f, t} {f, t} {c, t, f }
Latent V¯0 {i} {τ, i, d} {ν, τ, i, φ, d} {i, p, r, τ, d}
Factors {f, i} {f, τ, i} {d, i} {ν, τ, i} {φ, d, i} {c, i} {f, i} {f, p, r}
{i, t} {d, t, τ } {ν, f, φ} {d, t, τ } {p, i, τ } {r, i, d} {d, t, τ }
D
E f
i
t
D
E Z
f
i
d t
τ
Z1
E D
Z2 f
ν i φ t τ d
N
G H
E W
Z c f
p i r t τ d
Bayesian Networks for MF
Θv vi,1
· · · vi,τ
· · · vi,K
Θt tν,i
sν,i,1
· · · sν,i,τ
· · · sν,i,K i = 1 . . . I
xν,1 xν,τ xν,K
ν = 1 . . . W
• A Directed acyclic graph, not to be confused with our notation
• Nodes denote Random variables, Rectangles denote plates (repeat nodes inside)
• However, fairly complicated even for NMF → not very insightful
Update Rules for Non-Negative GTF
From GLM one can derive multiplicative update rules (MUR) (Yilmaz et.al., NIPS 2011).
Zα ← Zα ◦ ∆α(M ◦ W ( ˆX) ◦ X)
∆α(M ◦ W ( ˆX) ◦ ˆX) s.t. Zα(vα) > 0
(inverse variance function), i.e. W ( ˆX) = 1/v( ˆX) where for the Gaussian, Poisson, Exponential and Inverse Gaussian distributions we have simply
W ( ˆX) = ˆX−p with p = {0, 1, 2, 3}.
∆α(Q) = h X
¯ vα
Q(v0) Y
α6=α
Zα′(vα′)i
(1)
PLTF: Iterative Maximum Likelihood
• Specialize to β-divergences (Yilmaz and Cemgil, 2010)
Zα ← Zα ◦ ∆α(M ◦ X ◦ ˆXβ−2)
∆α
M ◦ ˆXβ−1 D(·, ·) IS KL Euclidian
β 0 1 2
• M : mask tensor (M (v0) = 1 if X(v0) is observed, 0 otherwise)
• Evaluating ∆, equivalent to computing marginal potentials! ⇒ via message passing on a factor graph
∆α(Q) ≡
X
¯ vα
Q(v0) Y
α′6=α
Zα′(vα′)
General Update Rule for GTF
By dropping the non-negativity requirement we obtain the following update equation:
Zα ← Zα + 2 λα\0
∆α(W ◦ (X − ˆX))
∆2α(W ) with λα\0 = |vα ∩ ¯v0|
∆εα(Q) = h X
¯ vα
Q(v0) Y
α6=α
Zα′(vα′)εi
NMF
Z1
Z2 ν
i
τ
X(ν, τ ) ←ˆ X
i
Z1(ν, i)Z2(i, τ )
NMF
M ◦ X/ ˆX Z2 M
ν
i
τ
Z1 ← Z1 ◦ ∆1(M ◦ X/ ˆX)
∆1(M )
NMF
M ◦ X/ ˆX M
Z1 ν
i
τ
Z2 ← Z2 ◦ ∆2(M ◦ X/ ˆX)
∆2(M )
MAP estimation (for the KL case)
• Conjugate Prior for a Poisson (Gamma G(x; a, b) = baxa−1 exp(−bx)/Γ(a)) Zα ∼ G(Zα;Aα, Bα/Aα)
• Update Rule
Zα ← (Aα − 1) + Zα ◦ ∆α(M ◦ X/ ˆX)
Aα/Bα + ∆α (M ) (2)
Deriving the update equations
• Matrix factorisations as a Generalised Linear Model
• Consider a MF model
g( ˆX) = Z1Z2
where Z1, Z2 and g( ˆX) are matrices of compatible sizes.
• Use vec(AXB) = (B⊤ ⊗ A) vec X to obtain
vec(g( ˆX)) = (I|j| ⊗ Z1) vec(Z2) ≡ Lz
• We can compute a factorisation using the general GLM update equation by alternating between Z1 and Z2
• Readily generalised to arbitrary tensors
Extensions to NMFD and NMF2D
• Convolutive model (I)
E(φ, d, i) = X
k,l
B(k, l)C(k,
z }| {α
d − l, φ, i) (I) (3)
• Basis spline model (II)
E(φ, d, i) = X
k
B(k, d)C(k, φ, i) (II) (4)
φ Note index, i source index, d local time index
Extensions to NMFD and NMF2D
D
B Z
1C
Z
2f
i
d α
t k
l τ
(a) NMFD+I
D
Z
1C
B f
τ t i
k d
(b) NMFD+II
Z
1B
C D
Z
2Z
3f
φ
τ α
i k
t ν
d l
(c) NMF2D+I
Z
1B
C D
Z
2f
ν i φ
t
τ k
d
(d) NMF2D+II
Application: Missing audio restoration
• 50 short mono audio examples sampled at 44.1kHz (from FitzGerald, Cranitch, Coyle 2008).
• Compute a spectrogram of 1024 samples windows with no overlap.
• Remove randomly blocks of 10 consecutive time frames, approx. 250ms gaps.
• 20 per cent of each audio file is removed with long gaps.
Table 1: Evaluation of the models on missing audio restoration
SNR MSE
IS KL EUC IS KL EUC
NMFD 2.99 4.74 5.05 4.43 2.91 2.68
SF-SSNTF −0.28 5.09 5.06 15.00 2.57 2.59 NMFD + I 3.01 6.00 6.91 5.89 2.23 1.68 NMFD + II 5.00 5.79 5.80 2.74 2.20 2.17
Application: Source Separation
• 50 short mono audio examples sampled at 44.1kHz (from FitzGerald, Cranitch, Coyle 2008∗)
• Mix pairs of examples
• operate on Constant-Q magnitude (computed via Schoerkhuber and Klapuri)
• Using KL cost
• Reconstruct sources using the estimated magnitudes and phase of the mixture
Model SDR SIR SAR
NMF2D 6.10 19.00 7.50
NMF2D + I 6.19 19.84 6.84 SF-SSNTF∗ ≈ 8.00 ≈ 24.00 ≈ 8.00
Coupled Tensor Factorisations
Example Problem
X1i,j,k ≈ ˆX1i,j,k = X
r
Ai,rBj,rCk,r X2j,p ≈ ˆX2j,p = X
r
Bj,rDp,r X3j,q ≈ ˆX3j,q = X
r
Bj,rEq,r
A B C D E
X1 X2 X3
Coupled Tensor Factorisations
• Factorise multiple observed tensors simultaneously: Xν for ν = 1 . . . |ν|.
• Each observed tensor Xν now has a corresponding index set V0,ν and a particular configuration will be denoted by v0,ν ≡ uν
• We define a |ν| × |α| coupling matrix R where
Rν,α =
1 Xν and Zα connected
0 otherwise Xˆν(uν) = X
¯ uν
Y
α
Zα(vα)Rν,α (5)
Example
X1i,j,k ≈ X
r
Z1i,rZ2j,rZ3k,r X2j,p ≈ X
r
Z2j,rZ4p,r X3j,q ≈ X
r
Z2j,rZ5q,r (6)
Update rules for Coupled Tensor Factorisations
∆εα,ν(Q) = h X
uν∩¯vα
Q(uν) Y
α6=α
Zα′(vα′)Rν,αεi
(7)
• Update for Nonnegative CTF
Zα ← Zα ◦ P
ν Rν,α∆α,ν Wν ◦ Xν
P
ν Rν,α∆α,ν Wν ◦ ˆXν
(8)
• In the special case of a Tweedie family, i.e. for the distributions whose precision as Wν = ˆXν−p, the update is
Zα ← Zα ◦ P
ν Rν,α∆α,ν Xˆν−p ◦ Xν P
ν Rν,α∆α,ν Xˆν1−p
(9)
Update rules for Coupled Tensor Factorisations
∆εα,ν(Q) = h X
uν∩¯vα
Q(uν) Y
α6=α
Zα′(vα′)Rν,αεi
(10)
• General Update for CTF
Zα ← Zα + 2 λα\0
P
ν Rν,α∆α,ν Wν ◦ Xν − ˆXν
P
ν Rν,α∆2α,ν Wν (11)
GCTF Application: Audio restoration
Ground Truth
200 400
50 100 150 200 250 300 350 400 450 500
Observed Spectrogram
200 400
50 100 150 200 250 300 350 400 450 500
Reconstructed Spectrogram
200 400
50 100 150 200 250 300 350 400 450 500
11025 kHz, frame length 93 msec, (50 missing chunks: average length 0.23 sec., max. length: 1.07 sec.; with side information: bach chorales, SNR: 4.38 dB
GCTF Application: Score aided Audio restoration
D (Spectral Templates)
E (Excitations of X1)
B (Chord Templates)
X3 (Isolated Notes) X1 (Audio with Missing Parts) X2 (MIDI file)
f p
i p
f i
k d
i t
f t
i n
k i m
τ k
Observed TensorsHidden Tensors
F (Excitations
of X3) C (Excitations
of E)
G (Excitations of X2)
GCTF Application: Score aided Audio restoration
Xˆ1(f, t) = X
i,τ,k,d
D(f, i)B(i, τ, k)C(k, d)Z(d, t, τ ) Test file (12) Xˆ2(i, n) = X
τ,k,m
B(i, τ, k)G(k, m)Y (m, n, τ ) MIDI file (13) Xˆ3(f, p) = X
i
D(f, i)F (i, p)T (i, p) Merged training files (14)
R =
1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1
with
Xˆ1 = P D1B1C1Z0G0Y 0F0T0 Xˆ2 = P D0B1C0Z0G1Y 1F0T0 Xˆ3 = P D1B0C0Z0G0Y 0F1T1
(15)
GCTF Application: Score aided Audio restoration
Time (sec)
Frequency (Hz)
X3 (Isolated Recordings)
100 200 300
0 500 1000 1500 2000
Time (sec)
Notes
X2 (Transcription Data)
50 100 150
20 40 60 80
Time (sec)
Frequency (Hz)
X1
5 10 15 20 25
0 500 1000 1500 2000
Time (sec)
Frequency (Hz)
X1hat (Restored)
5 10 15 20 25
0 500 1000 1500 2000
Time (sec)
Frequency (Hz)
Ground Truth
5 10 15 20 25
0 500 1000 1500 2000
20 40 60 80
0 5 10 15
Missing Data Percentage (%)
SNR (dB)
Performance
Reconst. SNR Initial SNR
Figure 1: Observed matrices X1: spectrum of the piano performance, (missing data (70%) are shown white), X2, a piano roll obtained from a musical score of the piece, X3, spectra of 88 isolated notes from a piano.
Summary and Future Work
• Cover broad class of models and topologies using a graphical model formalism
• Generalised Linear models for ML estimation in TF models
• Full Bayesian treatment (inference, model selection) is possible via MCMC or Variational Inference
• Automatic code generation a-la WinBUGS or Infer.net, given a model specification
• Fast Computations on a GPU
• Prior structures, Online inference
• Applications!!
A Toolbox based on GPU computation
Probabilistic Latent Tensor Factorization Matlab Toolkit
http://www.cmpe.boun.edu.tr/pilab/pilabfiles/pltftoolbox/
References http://www.cmpe.boun.edu.tr/~cemgil
• Simsekli, U., Cemgil, A. T. and Yilmaz, K., Score Guided Audio Restoration via Generalised Coupled Tensor Factorisation, Submitted 2011
• Yilmaz, K., Cemgil, A. T. and Simsekli, U., Generalised Coupled Tensor Factorisation, NIPS, 2011
• Cemgil, Simsekli and Subakan, Probabilistic Latent Tensor factorisation framework for audio modeling, IEEE Waspaa, 2011
• Yilmaz, K. and Cemgil, A. T. Algorithms for Probabilistic Latent Tensor Factorization, Signal Processing, Elsevier, 2011
– Longer version of Yilmaz, K. and Cemgil, A. T. Probabilistic Latent Tensor factorisation, Proc.
of ICA/LVA, 2010
• C. F´evotte and A. T. Cemgil Nonnegative matrix factorisations as probabilistic inference in composite models, Eusipco 2009, Glasgow
• A. T. Cemgil. Bayesian inference in non-negative matrix factorisation models. Computational Intelligence and Neuroscience, 2009
Change Point Models
• Real time pitch/event detection with NMF style models
• Tempo tracking and real time interaction
log t ,i
10 20 30 40 50 60
0 5 10 15 20 25
r!
0 5 10 15 20 25
r!
release sustain attack
0 5 10 15
v!
0 2 4 6 8
v!
log x ,!
10 20 30 40 50 60
log x ,!
10 20 30 40 50 60
HMM CPM
HMM versus Change Point Model
F F
rτ −1 rτ
vτ −1 vτ
xν,τ −1 xν,τ
F F
cτ −1 cτ
rτ −1 rτ
vτ −1 vτ
xν,τ −1 xν,τ
Detection Results (HMM versus CP)
0 50 100 150 200 250 300 350 400
0 50 100
Lag (ms)
Precision (%)
HMM CPM
0 50 100 150 200 250 300 350 400
0 50 100
Lag (ms)
Recall (%)
0 50 100 150 200 250 300 350 400
20 40 60 80
Lag (ms)
Latency (ms)
Realtime Interaction
• Combine Perception with Control
Claves
Lego Robot Loud Speaker
Perception
Midi Synthesizer
Controller Tempo
Bar Position
Motor Feedback
Motor Command
Tempo Tracker, Graphical Model
F F
nτ −1 nτ
mτ −1 mτ
rτ −1 rτ
vτ −1 vτ
xν,τ −1 xν,τ
Tempo Tracker
Tempo
BPM
1 2 3 4 5 6 7 8 9
100 150 200
0 0.5 1
Bar Position
1 2 3 4 5 6 7 8 9
200 400 600
0 0.5 1
Acoustic Events
1 2 3 4 5 6 7 8 9
1 2
3
0 0.5 1
Audio Spectra xν,τ
Time (sec)
Frequency
1 2 3 4 5 6 7 8 9
50 100 150 200 250 300 350 400 450 500
−14
−12
−10
−8
−6
−4
−2 0 2 Spectral Templates tν,i
Acoustic Events
Frequency
0.5 1 1.5 2 2.5 3 3.5
50 100 150 200 250 300 350 400 450 500
Controller
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
−0.05
−0.04
−0.03
−0.02
−0.01 0 0.01 0.02 0.03 0.04 0.05
∆ mτ
∆ n τ
τ = 0 τ = 1 τ = 2
τ = 3 τ = 4
Controller
0 5 10 15
0 500 1000
Bar Posion
Position
Time (sec)
0 5 10 15
120 125 130 135 140
Tempo (BPM)
Velocity
Time (sec)
0 5 10 15
0 pi/2 pi 3pi/2 2pi
Robot’s Position
Time (sec)
0 5 10 15
0 pi/32 pi/16
Robot’s Velocity
Time (sec)
Results
0 5 10 15 20
0 0.2 0.4 0.6 0.8 1
Bar Position
Time (seconds)
Bar Position
Robot’s Position Tracker’s Position
0 5 10 15 20
0 50 100 150 200 250 300
Tempo
Time (seconds)
Tempo (beat/min)
Robot Speed Tracker Speed
References (Realtime Tracking and Interaction)
• U. Simsekli, O. Sonmez, B. Kurt, A. T. Cemgil, Combined Perception and Control for Timing in Robotic Music Performances EURASIP Journal on Audio, Speech, and Music Processing, 2011
• U. Simsekli, A. T. Cemgil, Probabilistic Models for Real-Time Acoustic Event Detection with Application to Pitch Tracking, Journal of New Music Research, 2011
• U. Simsekli, A. Jylha, C. Erkut and A. T. Cemgil, Real-Time Recognition of Percussive Sounds by a Model-Based Method, EURASIP Journal on Advances in Signal Processing, 2011.