Hierarchical Bayesian Models for Audio and Music Signal Processing
A. Taylan Cemgil
Signal Processing and Communications Lab.
8 December 2007
NIPS 07 Workshop on Music
Colaborators
• Onur Dikmen, Bo˘gazi¸ci, Istanbul
• Paul Peeling, Cambridge
• Nick Whiteley, Cambridge
• Simon Godsill, Cambridge
• Cedric Fevotte, ENST, Paris Telecom
• David Barber, UCL, London
• Bert Kappen, Nijmegen, The Netherlands
Statistical Approaches
• Probabilistic
• Hierarchical signal models to incorporate prior knowledge/inspiration from various sources:
– Physics (acoustics, physical models, ...)
– Studies of human cognition and perception (masking, psychoacoustics, ...)
– Musicology (musical constructs, harmony, tempo, form ...)
• Consistent framework for developing inference algorithms
• Contrast to Traditional/Procedural approaches – where no clear distinction between “what” and “how”
• Need to overcome computational obstacles (time, memory)
Generative Models for audition
• Computer audition ⇔ inverse synthesis via Bayesian inference
p(Structure|Observations) ∝ p(Observations|Structure)p(Structure)
Goal: Developing flexible prior structures for modelling nonstationary sources
∗ source separation, transcription,
∗ restoration, interpolation, localisation, identification,
Bayesian Source Separation
• Joint estimation of Sources, given Observations
Source Model, v : Parameters of Source prior
s
k,1. . . s
k,n. . . s
k,Nv
x
k,1. . . x
k,Mk = 1 . . . K
λ
Observation Model, λ : Channel noise, mixing system p(Src|Obs) ∝
Z
dλdvp(Obs|Src, λ)p(Src|v)p(v)
Audio Restoration/Interpolation
• Estimate missing samples given observed ones
• Restoration, concatenative expressive speech synthesis, ...
0 50 100 150 200 250 300 350 400 450 500
0
Polyphonic Music Transcription
• from sound ...
t/sec
f/Hz
0 1 2 3 4 5 6 7 8
0 1000 2000 3000 4000 5000
0 10 20
(S)
• ... to score
Modelling and Computational issues
• Hierarchical – Signal level
pitch, onsets, timbre – Symbolic level
melody, motives, harmony, chords, tonality, rhythm, beat, tempo, articulation, instrumentation, voice ...
– Cognitive level
expression, genre, form, style, mood, emotion
• Uncertainty
– Parameter Learning
Which pitch, rhythm, tempo, meter, time signature ... ? – Model Selection
How many notes, harmonics, onsets, sections ... ?
Generative Models for Music
Generative Models for Music
Score Expression
Piano-Roll
Hierarchical Modeling of Music
M
1
2
:::
t
v1 v2
:::
vt
k
1
k
2
::: k
t
h
1
h
2
::: h
t
1 2 ::: t
m1 m2 ::: mt
g
j;1
g
j;2
::: g
j;t
rj;1 rj;2
:::
rj;t
n
j;1
n
j;2
::: n
j;t
x
j;1
x
j;2
::: x
j;t
yj;1 yj;2
:::
yj;t
y1 y2
:::
yt
Modelling levels
• Physical - acoustical
• Time domain – state space, dynamical models
• Transform domain – Fourier representations, Generalised Linear model
• Feature Based
Research Questions:
What kinds of prior knowledge and modelling techniques are useful?
How can we do efficient inference ?
Signal Models for Audio
• Time domain – state space, dynamical models
– Conditional Linear Dynamical Systems, Gaussian processes (e.g.
AR, ARMA), switching state space models – Flexible, Physically realistic,
– Analysis down to sample precision, Computationally quite heavy
• Transform domain – Fourier representations, Generalised Linear model
– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)
– Inherent limitations (analysis windows, frequency resolution)
Sinusoidal Modeling
• Sound is primarily about oscillations and resonance
• Cascade of second order sytems
• Audio signals can often be compactly represented by sinusoidals
(real) y n =
X p k=1
α k e −γ
kn cos(ω k n + φ k )
(complex) y n =
X p k=1
c k (e −γ
k+jω
k) n
y = F (γ 1:p , ω 1:p )c
State space Parametrisation
x n+1 =
e −γ
1+jω
1. ..
e −γ
p+jω
p
| {z }
A
x n x 0 =
c 1 c 2 ...
c p
y n = 1 1 . . . 1 1
| {z }
C
x n
x 0 x 1 . . . x k−1 x k . . . x K
y 1 . . . y k −1 y k . . . y K
State Space Parametrisation
Audio Restoration/Interpolation
• Estimate missing samples given observed ones
• Restoration, concatenative expressive speech synthesis, ...
0 50 100 150 200 250 300 350 400 450 500
0
Audio Interpolation
p(x ¬κ |x κ ) ∝ Z
dHp(x ¬κ |H)p(x κ |H)p(H) H ≡ (parameters, hidden states)
H
x
¬κx
κMissing Observed
0
Probabilistic Phase Vocoder (Cemgil and Godsill 2005)
A
νQ
νs
ν0· · · s
νk· · · s
νK−1ν = 0 . . . W − 1
x
0x
kx
K−1s ν k ∼ N (s ν k ; A ν s ν k−1 , Q ν ) A ν ∼ N
A ν ;
cos(ω ν ) − sin(ω ν ) sin(ω ν ) cos(ω ν )
, Ψ
Inference: Structured Variational Bayes
A
αq(A
α) Q
αq(Q
α)
· · · s
αk−1s
αks
αk+1· · ·
α∈ C
Q
k
q(s
αk|s
αk−1)
x
kq(x
k)
• Intuitive algorithm:
– Substract from the observed signal x the prediction of the frequency bands in ¬α.
Restoration
• Piano
– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original
• Trumpet
– Signal with missing samples (37%)
– Reconstruction, 7.10 dB improvement
– Original
Hierarchical Factorial Models
• Each component models a latent process
• The observations are projections
r
ν0· · · r
νk· · · r
νKθ
ν0· · · θ
νk
· · · θ
νK
ν = 1 . . . W
y
ky
K• Generalises Source-filter models
Harmonic model with changepoints
r k |r k−1 ∼ p(r k |r k−1 ) r k ∈ {0, 1}
θ k |θ k−1 , r k ∼ [r k = 0]N (Aθ k−1 , Q)
| {z }
reg
+ [r k = 1]N (0, S)
| {z }
new
y k |θ k ∼ N (Cθ k , R)
A =
G ω
G 2 ω
. ..
G H ω
N
G ω = ρ k
cos(ω) − sin(ω) sin(ω) cos(ω)
damping factor 0 < ρ
k< 1, framelength N and damped sinusoidal basis matrix C of size N × 2H
Exact Inference in switching state space models is intractable
• In general, exact inference is NP hard
– Conditional Gaussians are not closed under marginalization
⇒ Unlike HMM’s or KFM’s, summing over r k does not simplify the filtering density
⇒ Number of Gaussian kernels to represent exact filtering density p(r k , θ k |y 1:k ) increases exponentially
−7.9036 6.6343
0.76292
−10.3422
−10.1982
−2.393
−2.7957
−0.4593
Exact Inference for Changepoint detection
• Exact inference is achievable in polynomial time/space
– Intuition: When a changepoint occurs, the state vector θ is reinitialized
⇒ Number of Gaussians kernels grows only polynomially (See, e.g., Barry and Hartigan 1992, Digalakis et. al. 1993, ` O Ruanaidh and Fitzgerald 1996, Gustaffson 2000, Fearnhead 2003, Zoeter and Heskes 2006)
r
1= 1 r
2= 0 r
3= 0 r
4= 1 r
5= 0
θ
0θ
1θ
2θ
3θ
4θ
5y
1y
2y
3y
4y
5• The same structure can be exploited for the MMAP problem arg max r
1:kp(r 1:k |y 1:k )
⇒ Trajectories of r 1:k (i) which are dominated in terms of conditional evidence
p(y 1:k , r 1:k (i) ) can be discarded without destroying optimality
Monophonic model (Cemgil et. al. 2006)
• We introduce a pitch label indicator m
• At each time k, the process can be in one of the {“mute”, “sound”} × M states.
r
0r
1. . . r
Tm
0m
1. . . m
Ts
0s
1. . . s
Ty
1. . . y
TMonophonic Pitch Tracking
Monophonic Pitch Tracking = Online estimation (filtering) of p(r k , m k |y 1:k ).
100 200 300 400 500 600 700 800 900 1000
−100
−50 0 50
100 200 300 400 500 600 700 800 900 1000
5 10 15
• If pitch is constant exact inference is possible
Transcription
• Detecting onsets, offsets and pitch to sample precision (Cemgil et. al. 2006, IEEE
TSALP)
Tracking Pitch Variations
• Allow m to change with k.
50 100 150 200 250 300 350 400 450 500
• Intractable, need to resort to approximate inference (Mixture Kalman Filter -
Rao-Blackwellized Particle Filter)
Factorial Generative models for Analysis of Polyphonic Audio
νfrequency
k
x k
• Each latent changepoint process ν = 1 . . . W corresponds to a “piano key”.
Single time slice - Bayesian Variable Selection
r
i∼ C(r
i; π
on, π
off)
s
i|r
i∼ [r
i= on]N (s
i; 0, Σ) + [r
i6= on]δ(s
i) x|s
1:W∼ N (x; Cs
1:W, R)
C ≡ [ C
1. . . C
i. . . C
W]
r 1 . . . r W
s 1 . . . s W
x
• Generalized Linear Model – Column’s of C are the basis vectors
• The exact posterior is a mixture of 2
WGaussians
• When W is large, computation of posterior features becomes intractable.
• Sparsity by construction (Olshausen and Millman, Attias, ...)
Factorial Switching State space model
r
0,ν∼ C(r
0,ν; π
0,ν) θ
0,ν∼ N (θ
0,ν; µ
ν, P
ν)
r
k,ν|r
k−1,ν∼ C(r
k,ν; π
ν(r
t−1,ν)) Changepoint indicator θ
k,ν|θ
k−1,ν∼ N (θ
k,ν; A
ν(r
k)θ
k−1,ν, Q
ν(r
k)) Latent state
y
k|θ
k,1:W∼ N (y
k; C
kθ
k,1:W, R) Observation
r
ν0· · · r
kν· · · r
νKs
ν0· · · s
νk· · · s
νKν = 1 . . . W
Synthetic Data
ν x freq. ν ν
k
(S)
Technical Difficulties
• Inference is quite heavy
• Vanilla Kalman filtering methods are not stable – computations with large matrices
– Need advance techniques from linear algebra – Interesting links to subspace methods
• Hyperparameter learning is necessary
Modelling levels
• Physical - acoustical
• Time domain – state space, dynamical models
• Transform domain – Fourier representations, Generalised Linear model
• Feature Based
Spectrogram
• Basis functions φ k (t), centered around time-frequency atom k = k(ν , τ ) = (Frequency , Time ), such as STFT or MDCT.
x(t) = X
k
s k φ k (t)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
Models for time-frequency Energy distributions
• Non-Negative Matrix factorisation (Sha, Saul, Lee 2002, Smaragdis, Brown 2003, Virtanen 2003, Abdallah, Plumbley 2004 )
X ν,τ = W ν,j S j,τ
Spectrogram = Spectral Templates × Excitations
= ×
– ... however, spectrograms are not additive (a 2 + b 2 6= (a + b) 2 )
Models for time-frequency Energy distributions
• Mask models (Roweis 2001, Reyes-Gomez, Jojic, Ellis 2005 ...)
X ν,τ = [r ν,τ = 0]S ν,τ (0) + [r ν,τ = 1]S ν,τ (1)
Spectrogram = Mask × Source 0 + (1 − Mask) × Source 1
= + +
Prior structures on time-frequency Energy distributions
• Main Idea: Spectrogram is a point estimate of the energy at a time-frequency atom k(ν, τ ).
• We place a suitable prior on the variance of transform coefficients s k and tie the prior variances across harmonically and temporally related time-frequency atoms
p(s|v)p(v) = Y
k
p(s k |v k )
!
p(v)
One channel source separation, Gaussian source model
vk,1 . . . vk,N
sk,1 . . . sk,N
xk
k = 1 : K
s k,n |v k,n ∼ N (s k,n ; 0; v k,n )
x k |s k,1:N = P N
n=1 s k,n
• Straightforward application of Bayes’ theorem yields
p(s k,n |v k,1:N , x k ) = N (s k,n ; κ k,n x k , v k,n (1 − κ k,n )) κ k,n = v k,n / X
′
v k,n ′ (Responsibilities)
One channel source separation, Poisson source model
vk,1 . . . vk,N
sk,1 . . . sk,N
xk
k = 1 : K
s k,n |v k,n ∼ PO(s k,n ; v k,n )
x k |s k,1:N = P N
n=1 s k,n
• This is the generative model for the NMF when we write
v k(ν,τ ),n = t ν,n × e τ,n (Template × Excitation)
Gamma G(x; a, b) and Inverse Gamma IG(x; a, b)
0 1 2 3 4 5
0 0.2 0.4 0.6 0.8 1 1.2
a = 0.9 b =1
a = 1 b =1
a = 1.3 b =1
a = 2 b =1
x
p(x)
0 1 2 3 4 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
a=1 b=1
a=1 b=0.5 a=2 b=1
G(x; a, z) ≡ exp((a − 1) log x − z −1 x + a log z −1 − log Γ(a))
IG(x; a, z) ≡ exp((a + 1) log x −1 − z −1 x −1 + a log z −1 − log Γ(a))
Gamma Chains
We define an inverse Gamma-Markov chain for k = 1 . . . K as follows
v k |z k ∼ IG(v k ; a, z k /a)
z k+1 |v k ∼ IG(z k+1 ; a z , v k /a z )
z
1· · · v
k−1a
zz
ka v
ka
zz
k+1· · ·
• Variance variables v are priors for sources
• Auxillary variables z are needed for conjugacy and positive correlation
• Shape parameters a and a z describe coupling strength and drift of the chain
Gamma Chains, typical draws
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 az = 10
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 a
z = 4
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 a
z = 40
k
Gamma Chains with changepoints, typical draws
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 a
z = 10
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 a
z = 4
100 200 300 400 500 600 700 800 900 1000
−20 0 20
log v k
a = 10 a
z = 40
k
Gamma Chains
• The joint can be written as product of singleton and pairwise potentials of form
ψ k,k = exp(−az k −1 v k −1 ) (Pairwise)
z
1· · · v
k−1a
zz
ka v
ka
zz
k+1· · ·
φ z k = exp((a z + a + 1) log z k −1 ) (Singletons)
Gamma Fields
• The joint can be written as product of singleton and pairwise potentials
ψ i,j = exp(−a i,j ξ i −1 ξ j −1 ) (Pairwise)
φ i = exp(( X
j
a i,j + 1) log ξ i −1 ) (Singletons)
Possible Model Topologies
Approximate Inference
• Stochastic
– Markov Chain Monte Carlo, Gibbs sampler – Sequential Monte Carlo, Particle Filtering
• Deterministic
– Variational Bayes
In all these, conjugacy helps.
VB or Gibbs
ψ
0,1z
1ψ
1,1v
1ψ
1,2z
2ψ
2,2v
2· · ·
p(y
1|v
1) p(y
2|v
2)
• VB
q (τ ) (v k ) ← exp(φ k + hlog ψ k,k + log ψ k,k+1 i q
(τ )(z
k)q
(τ )(z
k+1) )
• Gibbs
(τ ) (τ ) (τ )
VB or Gibbs
ψ
0,1z
1ψ
1,1v
1ψ
1,2z
2ψ
2,2v
2· · ·
p(y
1|v
1) p(y
2|v
2)
• VB
q (τ ) (z k ) ← exp(φ k + hlog ψ k,k−1 + log ψ k,k i q
(τ )(v
k)q
(τ )(v
k+1) )
• Gibbs
z k (τ ) ∼ p(z k |v k−1 , v k ) ∝ ψ k,k−1 (v k−1 (τ ) )ψ k,k (v k (τ ) )
Denoising - Speech (VB)
• Additive Gaussian noise with unknown variance
• Inference : Variational Bayes
Noisy Original
X
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Xorg
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Xh SNR:19.98
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Xv SNR:20.79
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Xb SNR:19.68
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Xg SNR:19.97
150 200 250 300 350 400 450 500
−10
−8
−6
−4
−2 0
Denoising – Music
Original
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
−18
−16
−14
−12
−10
−8
−6
−4
−2 0
Noisy
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
−18
−16
−14
−12
−10
−8
−6
−4
−2 0
PF SNR:8.53
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
−18
−16
−14
−12
−10
−8
−6
−4
−2 0
Gibbs SNR:8.66
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
−18
−16
−14
−12
−10
−8
−6
−4
−2 0
VB SNR:2.08
50 100 150 200 250
50 100 150 200 250 300 350 400 450 500
−18
−16
−14
−12
−10
−8
−6
−4
−2 0
“Tristram (Matt Uelmen)” +∼ 0dB white noise
Single Channel Source Separation (with Onur Dikmen)
• Source 1: Horizontal : Tie across time : harmonic continuity
• Source 2: Vertical : Tie across frequency : transients, percussive sounds
Frequency Bin (ν)
Xorg S
hor S
ver
Single Channel Source Separation with IGMCs
E-guitar
“Matte Kudasai (King Crimson)”+ Drums
“Territory (Sepultura)”= Mix
ˆs 1 ˆs 2
SDR SIR SAR SDR SIR SAR
VB -4.74 -3.28 5.67 -1.58 15.46 -1.37
Gibbs -4.5 -2.62 4.57 1.05 12.46 1.61
Gibbs EM -4.23 -2.42 4.82 1.34 13.13 1.85
Pre − trained -4.04 -3.15 8.13 3.56 11.44 4.64
Oracle 6.14 17.16 6.58 12.66 19.95 13.6
• Oracle: We use the square of the source coefficient as the latent variance estimate
• Pre-trained: We use the best coupling parameters a z and a trained on isolated
sources
Single Channel Source Separation with IGMCs
“Vandringar I Vilsenhet (Anglagard)” + “Moby Dick (Led Zeppelin)” = Mix
ˆs 1 ˆs 2
SDR SIR SAR SDR SIR SAR
VB -7.8 -6.22 4.53 -2.35 18.4 -2.25
Gibbs -8.46 -7.53 6.93 -4.04 14.59 -3.83
Gibbs EM -7.74 -6.19 4.62 -1.14 16.62 -0.97 Pre − trained -6.4 -5.39 6.95 3.8 16.39 4.14
Oracle 12.1 32.9 12.14 21.13 33.89 21.37
Harmonic-Transient Decomposition
Time ( τ )
Frequency Bin ( ν )
X org S
hor S
ver
(Original) (Hor) (Vert)
(Original) (Hor) (Vert)
Chord Detection - Signal model (with Paul Peeling)
−10
−8
−6
−4
−2 0
log σν
Chord Detection
.. .
Time τ / s
Frequency/Hz
MDCT of piano chord {41,48,51,56}
0.5 1 1.5 2 2.5
0 500 1000 1500 2000 2500 3000 3500 4000
Time τ / s
MIDInotej
logP
νvν,j,τ
0.5 1 1.5 2 2.5
40 45 50 55 60 65
Multichannel Source Separation
• Hierarchical Prior Model (Fevotte and Godsill 2005, Cemgil et. al. 2006)
λ
1. . . λ
n. . . λ
N∼ G(λ
n; a
λ, b
λ)
v
k,1. . . v
k,n· · · v
k,N∼ IG(v
k,n; ν/2, 2/(νλ
n))
s
k,1. . . s
k,n. . . s
k,N∼ N (s
k,n; 0, v
k,n)
x
k,1. . . x
k,Mk = 1 . . . K
∼ N (x
k,m; a
⊤ms
k,1:N, r
m)
Equivalent Gamma MRF
• A tree for each source
• λ n can be interpreted as the overall “volume” of source n
Source Separation
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 5 10 15 20 25
(Guitar)
f/Hz
4000 6000 8000 10000
15 20 25
Reconstructions
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
10 15 20 25 30
(Speech)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
0 10 20 30
(Piano)
t/sec
f/Hz
0 200 400 600 800 1000 1200
0 2000 4000 6000 8000 10000
5 10 15 20 25
(Guitar)
Multimodality
Multimodality
Annealing, Bridging, Overrelaxation, Tempering
0 500 1000 1500 2000
−0.8024 0.8295 2.0375
a
0 500 1000 1500 2000
7.2408 25.1398 36.2295
λ
0 500 1000 1500 2000
0.5451 1.8648
r
Epoch
Tempo tracking and score performance matching
• Given expressive music data (onsets/detections/spectral features/....) – Determine the position of a performance on a score
– Determine where a human listener would clap her hand – Create a quantized/human readable score
– ....
• Online-Realtime or Offline-Batch
• All of these problems can be mapped to inference problems in a HMM
Bar position Pointer (Whiteley, Cemgil, Godsill 2006)
| | |
3 • • • • • • • •
n
k2 • • • • • • • •
1 • • • • • • • •
1 2 3 4 5 6 7 8
m
k3/4 time
4/4 time
• Each dot denotes a state x = (m, n) (Score Position – Tempo level)
• Directed Arcs denote state transitions with positive probability
Bar position Pointer - transition model
Tempo Level
Bar Position p(x
2| x
1
)
1 2 3 4 5 6 7 8
1 2 3 4 5
Tempo Level
p(x2| x 1)
2 3 4 5
Tempo Level
p(x2| x 1)
2 3 4 5
Bar position Pointer - k = 1
Tempo Level
Bar Position p(x1)
Bar position Pointer - k = 2
Tempo Level
p(x2)
Bar position Pointer - k = 3
Tempo Level
Bar Position p(x3)
Bar position Pointer - k = 4
Tempo Level
p(x4)
Bar position Pointer - k = 5
Tempo Level
Bar Position y5 = 0 p(x
5| y
1:5)
Bar position Pointer - k = 10
Tempo Level
p(x10)
Bar position Pointer - observation model (Poisson)
• Observation model p(y k |x k ) : Poisson intensity
0 100 200 300 400 500 600 700 800 900 1000
0 2 4
mk
µ k
Triplet Rhythm
0 100 200 300 400 500 600 700 800 900 1000
0 2 4
mk
µ k
Duplet Rhythm
Tempo, Rhythm, Meter analysis
Bar Pointer Model ( Whiteley, Cemgil, Godsill 2006)
n0 n1 n2 n3
θ0 θ1 θ2 θ3
m0 m1 m2 m3
r0 r1 r2 r3
λ1 λ2 λ3
y1 y2 y3
Filtering
0 50 100 150 200 250 300 350 400 450
0 1 2
yk
Observed Data
m k
log p(mk|y1:k)
50 100 150 200 250 300 350 400 450
800 600 400 200
−10
−5 0
Quarter notes per min.
log p(n k|y
1:k)
50 100 150 200 250 300 350 400 450
180 120
60 −4
−2 0
p(rk|y 1:k)
Frame Index, k
50 100 150 200 250 300 350 400 450
Triplets
Duplets
0 0.2 0.4 0.6 0.8
Smoothing
0 50 100 150 200 250 300 350 400 450
0 1 2
yk
Observed Data
m k
log p(m k|y
1:K)
50 100 150 200 250 300 350 400 450
800 600 400
200 −10
−5 0
Quarter notes per min.
log p(n k|y
1:K)
50 100 150 200 250 300 350 400 450
180 120 60
−10
−5 0
p(rk|y 1:K) Triplets
Duplets 0.4
0.6 0.8
Time Signature
0 2 4 6 8 10 12
−1 0 1
sample value
time, s Observed Data
mk
log p(m k|z
1:K)
100 200 300 400 500
800 600 400
200 −10
−5 0
Quarter notes per min.
log p(n k|z
1:K)
100 200 300 400 500
155 103 52
−10
−5 0
p(θk|z 1:K)
Frame Index, k
100 200 300 400 500
4/4
3/4 0.2
0.4 0.6 0.8
Score-Performance matching (ISMIR) 2007
• Given a musical score, associate note events with the audio
4
x t
Score-Performance matching - Graphical Model
ν = 1, . . . , W
t
1t
2. . . t
Kr
1r
2. . . r
Kλ
1λ
2. . . λ
Kv
ν,1v
ν,2. . . v
ν,Ks
ν,1s
ν,2. . . s
ν,K6 7 8 1 2 3 4 5
r k
v ν,τ ∼ IG(v ν,τ ; a, 1/(aλσ ν (r τ )))
Score-Performance matching - Signal model
−10
−8
−6
−4
−2 0
log σν
Score-Performance matching
Spectrogram Data
Time / s
Frequency / Hz
0 2 4 6 8 10 12 14
0 1000 2000 3000 4000
50 100 150 200 250 300 350 400 450
55 60 65 70 75 80 85
MIDI Data
Score position
MIDI note
Online (filtering) or Offline (smoothing) processing is possible
Transcription
log p(r
τ|s
τ)
M ID I n ot e n u m b er
Time / s
1 2 3 4
60 65 70 75 80
1 2 3 4
−10
−5 0 5 10
P
i