• Sonuç bulunamadı

Generative models for audio and music processing

N/A
N/A
Protected

Academic year: 2021

Share "Generative models for audio and music processing"

Copied!
128
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Generative models for audio and music processing

A. Taylan Cemgil

Signal Processing and Communications Lab.

26 June 2007

Microsoft Research, Cambridge

(2)

Motivation

• Audio: Core information processing modality. Central to the scientific understanding of human hearing abilities.

• Broad spectrum of applications

– Speech processing, Sound localisation, – Hearing aids, Auditory Scene Analysis,

– Music information retrieval, Music performance systems

• Realistic generative modelling appears to be more feasible than, for example, video.

⇒ Combined with modern approaches to inference in dynamical systems,

we may make significant progress in this area.

(3)

Traditional Analysis

• Digital filtering theory, Transform methods/Fourier techniques

• Deterministic model based, System identification

• Procedural – no clear distinction between “what” and “how”

(4)

Statistical Approaches

• Probabilistic

• Hierarchical signal models to incorporate prior knowledge/inspiration from various sources:

– Physics

(acoustics, physical models, ...)

– Studies of human cognition and perception

(masking, psychoacoustics, ...)

– Musicology

(musical constructs, harmony, tempo, form ...)

– Neuroscience

(efficient auditory representations, ...)

• Consistent framework for developing inference algorithms

• Practical obstacles such as memory requirement, computation time

(5)

Generative Models for audition

• Computer audition ⇔ inverse synthesis

p(Structure |Observations) ∝ p(Observations|Structure)p(Structure) – restoration, interpolation

– source separation, – transcription,

– localisation, – identification,

– coding, compression

– resynthesis, cross synthesis

– ...

(6)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(7)

Source Separation

sk,1 . . . sk,n . . . sk,N

xk,1 . . . xk,M

k = 1 . . . K

a1 r1 . . . aM rM

• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)

(8)

Polyphonic Music Transcription

• from sound ...

t/sec

f/Hz

0 1 2 3 4 5 6 7 8

0 1000 2000 3000 4000 5000

0 10 20

(S)

• ... to score

(9)

Polyphonic Music Transcription

• Extracting a human readable representation from music data

– Auditory scene analysis task with a lot of structure (both natural and man-made) – Reminiscent to the “cocktail party” problem in speech recognition or “object

recognition” in computer vision

– Varies in modelling effort and computational complexity

∗ duets, simple piano music (easier)

∗ symphonies, contemporary music (hard)

• Applications

– Music Education, Interactive Music Performance

– Musicology, Music Perception and Cognition Research – Music Information Retrieval

content-based querying and retrieval, music recommendation and playlist generation, automatic classification, music summarisation, annotation

(10)

Modelling and Computational issues

• Hierarchical – Signal level

pitch, onsets, timbre – Symbolic level

melody, motives, harmony, chords, tonality, rhythm, beat, tempo, articulation, instrumentation, voice ...

– Cognitive level

expression, genre, form, style, mood, emotion

• Uncertainty

– Parameter Learning

Which pitch, rhythm, tempo, meter, time signature ... ? – Model Selection

How many notes, harmonics, onsets, sections ... ?

(11)

Generative Models for Music

(12)

Generative Models for Music

Score Expression

Piano-Roll

Signal

(13)

Hierarchical Modeling of Music

M



1



2

:::



t

v1 v2

:::

vt

k

1

k

2

::: k

t

h

1

h

2

::: h

t

1 2 ::: t

m1 m2 ::: mt

g

j;1

g

j;2

::: g

j;t

rj;1 rj;2

:::

rj;t

n

j;1

n

j;2

::: n

j;t

x

j;1

x

j;2

::: x

j;t

yj;1 yj;2

:::

yj;t

y1 y2

:::

yt

(14)

Research Questions

• What kinds of prior knowledge and modeling techniques are useful?

• How can we speed up inference to make them competitive with more

traditional approaches?

(15)

Outline

• Audio Signal, Time frequency representations, Spectrogram

• Transform Domain

– Gamma Chains and Fields

∗ Denoising

∗ Source Separation

∗ Score Following, transcription,

∗ Tempo, rhythm, meter estimation

(16)

Outline

• Time Domain/State Space

– Kalman Filter: Probabilistic Phase Vocoder

∗ Interpolation/Restoration

– Switching State space models, Changepoint models

∗ Pitch tracking, Onset/offset detection – Factorial Changepoint Model

∗ Bayesian model selection

∗ Polyphonic pitch tracking

• Conclusions and Final Remarks

(17)

Colaborators

• Nick Whiteley, Cambridge

• Paul Peeling, Cambridge

• Onur Dikmen, Bo˘gazi¸ci, Istanbul

• Simon Godsill, Cambridge

• Cedric Fevotte, ENST, Paris Telecom

• David Barber, UCL

• Bert Kappen, Nijmegen, The Netherlands

(18)

Audio Signal

x t

t (Speech)

t

x t

(Piano)

x = x1 . . . xt . . . 

(19)

Signal Models for Audio

• “Time Domain”

– State space modeling,

– Conditional Linear Dynamical Systems, Gaussian processes (e.g.

AR, ARMA)

– Flexible, Physically realistic,

– Analysis down to sample precision (if required) – Computationally quite heavy

• “Frequency Domain”

– Models on (orthogonal) transform coefficients, Energy compaction – Practical, can make use of fast transforms (FFT, MDCT, ...)

– Inherent limitations (analysis windows, frequency resolution)

(20)

Spectrogram

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

• A linear expansion using a collection of basis functions φk(t) centered around time-frequency atom k = k(τ, ω) at time τ and frequency ω

x(t) = X

k

skφk(t)

• Popular transforms for audio: STFT, MDCT

• Spectrogram displays 2 log |sk| or |sk|2 (of STFT)

(21)

Models for time-frequency Energy distributions

• Non-Negative Matrix factorisation

(Sha, Saul, Lee 2002, Smaragdis, Brown 2003, Abdallah, Plumbley 2004, Virtanen 2005 ...)

X

ν,τ

= W

ν,j

S

j,τ

Spectrogram = Spectral Templates × Excitations

= ×

– ... however, spectrograms are not additive (a

2

+ b

2

6= (a + b)

2

)

(22)

Models for time-frequency Energy distributions

• Mask models

(Roweis 2001, Reyes-Gomez, Jojic, Ellis 2005 ...)

X

ν,τ

= [r

ν,τ

= 0]S

ν,τ(0)

+ [r

ν,τ

= 1]S

ν,τ(1)

Spectrogram = Mask × Source

0

+ (1 − Mask) × Source

1

= + +

– ... however, sources do overlap in time and frequency

(23)

Prior structures on time-frequency Energy distributions

• Main Idea: Spectrogram is a point estimate of the energy at a time-frequency atom k(ν, τ ). We place a suitable prior on the variance of transform coefficients s

k

– The inverse Gamma distribution is a natural candidate as a prior

• There is significant structure on a spectrogram

– Need to introduce correlations among variances of harmonically and temporally related time-frequency atoms

p(s |v)p(v) = Y

k

p(s

k

|v

k

)

!

p(v)

– Gabor regression (Wolfe and Godsill 2005): v discrete

(24)

Example, one channel source separation

vk,1 . . . vk,N

sk,1 . . . sk,N

xk

k = 1 : K

sk,n|vk,n ∼ N (sk,n; 0; vk,n)

xk|sk,1:N = PN

n=1 sk,n

• Straightforward application of Bayes’ theorem yields

p(s

k,n

|v

k,1:N

, x

k

) = N (s

k,n

; κ

k,n

x

k

, v

k,n

(1 − κ

k,n

)) κ

k,n

= v

k,n

/ X

n

v

k,n (Responsibilities)

(25)

Example, one channel source separation

p(s

n

|v

1:N

, x) = N (s

n

; κ

n

x, v

n

(1 − κ

n

)) κ

n

= v

n

/ X

n

v

n

0

or h1/vni−1 /

X

n′

h1/vn′i−1

1

A

• Each source coefficient sn gets a fraction κn of the observation x

• Model additive in variances Z

dsp(x|s1:N)p(s1:N|v1:N)p(v1:N) = N (x; 0,

XN j=n

vn)p(v1:N)

• Unconditional marginal p(x) is heavy tailed (e.g. Student-t)

(26)

Gamma Distribution

• Gamma Distribution with shape a and scale z

G(λ; a, z) ≡ exp((a − 1) log λ − z

−1

λ + a log z

−1

− log Γ(a))

– Conjugate prior for Gaussian precision, Poisson intensity and Inverse Gamma scale

– Sufficient statistics hλiG = az and hlog λiIG = Ψ(a) − log z−1

• Inverse Gamma Distribution

IG(v; a, z) ≡ exp((a + 1) log v

−1

− z

−1

v

−1

+ a log z

−1

− log Γ(a))

– Conjugate prior for Gaussian variance and Gamma scale – Sufficient statistics

v−1

IG = az and

log v−1

IG = Ψ(a) − log z−1

• λ ∼ G(λ; a, z) ⇔ λ

−1

= v ∼ IG(v; a, z)

(27)

Inverse Gamma Distribution

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

a=1 b=1

a=1 b=0.5 a=2 b=1

(28)

Gamma Chains

We define an Inverse Gamma-Markov chain for k = 1 . . . K as follows

v

k

|z

k

∼ IG(v

k

; a, z

k

/a)

z

k+1

|v

k

∼ IG(z

k+1

; a

z

, v

k

/a

z

)

z1 · · · vk−1 az zk a vk az zk+1 · · ·

p(z, v; a) ∝ ψ(b−1z , azz1−1)Y

k

φ(vk−1; a + az)φ(zk−1; a + az)ψ(azk−1, vk−1)ψ(azvk−1, zk+1−1 )

φ(ξ; α) = exp((α + 1) log ξ) ψ(ξ, η) = exp(−ξη)

(Singletons) (Pairwise)

(29)

The need for auxillary variables z

• We want conjugacy (full conditional p(v

k

|v

k−1

, v

k+1

) is IG) and positive correlation beetween v

k

and v

k−1

• Conjugate, but no positive correlation

v

k

|v

k−1

∼ IG(v

k

; a, v

k−1

/a)

• Positive correlation but not conjugate

v

k

|v

k−1

∼ IG(v

k

; a, (v

k−1

a)

−1

)

(30)

IG − IG Distribution

(Bernardo-Smith)

• Scale Mixture of Inverse Gamma p(vk|vk−1) =

Z

dzkIG(vk; a, zk/a)IG(zk; az, vk−1/az)

= Γ(a + az) Γ(az)Γ(a)

(azvk−1−1 )az(avk−1)a

(azvk−1−1 + avk−1)(az+a)vk−1

a = 0.1 a

z = 0.1

log10 v

k

log 10 v k−1

−5 0 5

−5 0 5

a = 2 a

z = 2

log10 v

k

log 10 v k−1

−5 0 5

−5 0 5

a = 0.1 a

z = 10

log10 v

k

log 10 v k−1

−5 0 5

−5 0 5

a = 3 a

z = 0.3

log10 v

k

log 10 v k−1

−5 0 5

−5 0 5

• Small a and az ⇒ weak coupling.

• az/a < 1 Systematic drift towards small variances ⇒ Potentially useful for damping effects in audio

(31)

Example: Nonstationary Gaussian Process

• Time varying variance – Stochastic Volatility Model

v1:K ∼ Inverse Gamma Chain(v1:K; a, az) yk ∼ N (yk; 0, vk)

0 200 400 600 800 1000

0 5 10 15 20

v k

0 200 400 600 800 1000

−10

−5 0 5 10

y k

k

True VB

(32)

Example: Nonhomogeneous Poisson Process

• Time varying intensity (L : length of a small interval where the intensity is constant)

λ1:K ∼ Gamma Chain(λ1:K; a, az)

ckk ∼ PO(ck; λkL) ∝ exp (ck log λk − Lλk)

0 50 100 150

λ k

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

c k

Arrival time

True VB

(33)

Inference : Structured Mean Field, Variational Bayes

(MacKay 1995, Attias 1999, Wiegerinck 2000, Ghahramani and Beal 2000, Winn and Bishop 2005)

Approximate P with a tractable distribution Q = Q

α Qα KL(Q||P) = hlog QiQ



log 1

Zxφ(S)



Q

hf(x)ip(x) ≡ R

dxp(x)f (x). Using KL ≥ 0, we obtain a lower bound on the evidence log Zx ≥ hlogφ(S)iQ − hlog QiQ

This leads to a set of fixed point equations

Qα ∝ exp hlog φ(S)iQ¬α

Depending upon the initial starting point Q(0), the iteration converges to a local maximum of the lower bound.

(34)

Gibbs Sampler (with grouping)

• MCMC: Construct a markov chain with stationary distribution as the desired posterior P

• Gibbs sampler: We visit each block α ∈ C and sample from full conditionals Cα ∼ p(Cα|C¬α)

• For a graphical model, full conditionals are functions of the variables of the Markov Blanket

• Algorithmically very similar to VB

(35)

Inverse Gamma Markov Random Field

• A conditional Markov Random Field

• IGMRF on ξ = {ξi}i∈V, Given

– an Undirected graph with vertex set V and undirected edge set E – A set of connection weights a = {ai,j}(i,j)∈E for i, j ∈ V and i 6= j

p(ξ; a) = 1 Za

Y

i∈V

φ(ξi−1;X

j

ai,j) Y

(i,j)∈E

ψ(ξi−1, (ai,j/2)ξj−1)

φ(ξ; α) = exp((α + 1) log ξ) (Singleton)

ψ(ξ, η) = exp(−ξη) (Pairwise)

• Gamma field φ(ξi;P

j ai,j), ψ(ξi, (ai,j/2)ξj) but with φ(ξ; α) = exp((α− 1) log ξ).

(36)

Possible Model Topologies

(37)

Harmonic-Transient Decomposition

• Horizontal : Tie across time : harmonic continuity

• Vertical : Tie across frequency : transients, pulse like sounds

(38)

Harmonic-Transient Decomposition

Time (τ)

Frequency Bin (ν)

Xorg S

hor S

ver

(Original) (Hor) (Vert)

(39)

Denoising - Piano

X

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−16

−14

−12

−10

−8

−6

−4

−2 0

Xorg

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−16

−14

−12

−10

−8

−6

−4

−2 0

(Noisy) (Original)

(Hor) (Ver) (Band) (Grid)

(40)

Denoising - Speech

X

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−14

−12

−10

−8

−6

−4

−2 0

Xorg

20 40 60 80 100 120 50

100 150 200 250 300 350 400 450 500

−14

−12

−10

−8

−6

−4

−2 0

(Noisy) (Original)

(Hor) (Ver) (Band) (Grid)

(41)

Source Separation

sk,1 . . . sk,n . . . sk,N

xk,1 . . . xk,M

k = 1 . . . K

a1 r1 . . . aM rM

• Joint estimation Sources, Channel noise and mixing system xk,1:M ∼ N (xk,1:M; Ask,1:N, R)

(42)

Multichannel Source Separation

• Hierarchical Prior Model (Fevotte and Godsill 2005, Cemgil et. al. 2006)

λ1 . . . λn . . . λN ∼ G(λn; aλ, bλ)

vk,1 . . . vk,n · · · vk,N ∼ IG(vk,n; ν/2, 2/(νλn))

sk,1 . . . sk,n . . . sk,N ∼ N (sk,n; 0, vk,n)

xk,1 . . . xk,M

k = 1 . . . K

∼ N (xk,m; amsk,1:N, rm)

a1 r1 . . . aM

∼ N (am;· · · ) rM

∼ IG(rm;· · · )

(43)

Equivalent Gamma MRF

• A tree for each source

• λn can be interpreted as the over energy of source n

(44)

Factor Graph

(Kschischang et. al. 2001, Frey and Jojic 2005)

• Many approximation algorithms (LBP, VMP, EP, EC, Gibbs Sampling...) can be formulated as message passing schemata

λ1 . . . λn . . . λN

. . . . . .

vk,1 . . . vk,n · · · vk,N

. . . · · ·

sk,1:N

. . .

k = 1 . . . K

a1 r1 . . . aM rM

Factor Graph: Useful for visualising the inference problem and the approximating distribution Q = Qα Qα

Factor nodes: Black squares. Factor potentials defining the posterior P

Variable nodes: Circles. “factors” of the approximating distribution Q

Edges: denote membership. A variable is connected to a factor if it is a variable of the local function

(45)

Source Separation

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 5 10 15 20 25

(Guitar)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25

(Mix)

(46)

Reconstructions

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

10 15 20 25 30

(Speech)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

0 10 20 30

(Piano)

t/sec

f/Hz

0 200 400 600 800 1000 1200

0 2000 4000 6000 8000 10000

5 10 15 20 25

(Guitar)

(47)

Multimodality

• Typically underdetermined (Channels < Sources) ⇒ Multimodal posterior

(48)

Annealing, Bridging, Overrelaxation, Tempering

• Define a sequence of inverse temparatures τ1, τ2, . . . , τT (an annealing schedule)

• Sample from/approximate

Pτi ∝ Pτi

s1

s 2

prior

exact posterior

factorized MF

s1

s 2

prior

exact posterior

factorized MF R1

R2

Rτ

(49)

Multimodality

Observations:

m

k

20 40 60 80 100 120 140 160 180 200

1

2 −1000

−500 0 500

0 500 1000 1500 2000

−0.8024 0.8295 2.0375

a

0 500 1000 1500 2000

7.2408 25.1398 36.2295

λ

0 500 1000 1500 2000

0.5451 1.8648

r

Epoch

(50)

Reconstructions

n

Epoch = 500 1

2 3

n

Epoch = 1000 1

2 3

n

Epoch = 1500 1

2 3

n

Epoch = 2000 1

2 3

n

k Original

20 40 60 80 100 120 140 160 180 200

1 2 3

Posterior surface is multimodal, each mode corresponding to a viable separation

(51)

Mixing System, Best Solution

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1

x 2

(52)

Mixing System, stable but suboptimal solutions

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1 x2

−20 −15 −10 −5 0 5 10 15 20

−20

−15

−10

−5 0 5 10 15 20

x1 x2

(53)

Reconstruction Error

60 80 100 120 140 160

0 10 20 30 40 50

Reconstruction Error

The histogram for 100 independent runs.

(54)

Training Gamma Chains and Fields (with Onur Dikmen)

• Chains and trees: simple since log Za has a known expression – By variational Bound maximisation

• Fields:– work in progress – partition function log Za is unknown (? – at least seems to be quite complicated)

• Approximate the derivative

– Contrastive Divergence (Hinton)

• Cancel out Za/Za term in acceptance probability

– Metropolis-Hastings with “artificial data” (Moller, Pettitt, Reeves, Berthelsen)

• Power EP

(55)

Further Applications in Music Processing

(with Paul Peeling)

• Score-Performance matching

• Musical Score guided source separation

• Transcription

• Chord Detection

(56)

Score-Performance matching

• Given a musical score, associate note events with the audio

4

t

x t

(57)

Score-Performance matching - Graphical Model

ν = 1, . . . , W

t1 t2 . . . tK

r1 r2 . . . rK

λ1 λ2 . . . λK

vν,1 vν,2 . . . vν,K

sν,1 sν,2 . . . sν,K

6 7 8 1 2 3 4 5

rk

0 500 1000 1500 2000 2500 3000 3500 4000

−12

−10

−8

−6

−4

−2 0

Frequency ν / Hz log σν

v ∼ IG(v ; a, 1/(aλσ (r )))

(58)

Score-Performance matching - Signal model

0 500 1000 1500 2000 2500 3000 3500 4000

−12

−10

−8

−6

−4

−2 0

Frequency ν / Hz log σν

(59)

Score-Performance matching

Spectrogram Data

Time / s

Frequency / Hz

0 2 4 6 8 10 12 14

0 1000 2000 3000 4000

50 100 150 200 250 300 350 400 450

55 60 65 70 75 80 85

MIDI Data

Score position

MIDI note

(60)

Monophonic Transcription

log p(rτ|sτ)

MIDInotenumber

Time / s

1 2 3 4

60 65 70 75 80

1 2 3 4

−10

−5 0 5 10

P

iw˜τ(i)λ˜(i)τ

Time / s

logλ

MDCT of audio (source: Daniel-Ben Pienaar)

Time / s

Frequency/Hz

1 2 3 4

0 500 1000 1500 2000 2500 3000 3500 4000

(61)

Tempo, Rhythm, Meter analysis (

Cemgil 2000; Whiteley, Cemgil, Godsill 2006) Example: A Performed Onset Sequence

1.18 0.59 0.29 0.34 0.44 0.34 0.39 0.6 0.63 0.3 0.28 0.3 0.35 1.19

Very accurate but too complex

                  

Simple but a very poor description of the rhythm

             

Desired quantization balances accuracy and simplicity

    

3

         

(62)

Bar Pointer Model

| | |

3

nk 2

1

1 2 3 4 5 6 7 8

mk

3/4 time 4/4 time

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

mk

µ k

Triplet Rhythm

0 100 200 300 400 500 600 700 800 900 1000

0 2 4

mk

µ k

Duplet Rhythm

n0 n1 n2 n3

θ0 θ1 θ2 θ3

m0 m1 m2 m3

r0 r1 r2 r3

λ1 λ2 λ3

y1 y2 y3

(63)

Filtering

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(mk|y1:k)

50 100 150 200 250 300 350 400 450

800 600 400 200

−10

−5 0

Quarter notes per min.

log p(n k|y

1:k)

50 100 150 200 250 300 350 400 450

180 120

60 −4

−2 0

p(rk|y 1:k)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets

0 0.2 0.4 0.6 0.8

(64)

Smoothing

0 50 100 150 200 250 300 350 400 450

0 1 2

yk

Observed Data

m k

log p(m k|y

1:K)

50 100 150 200 250 300 350 400 450

800 600 400

200 −10

−5 0

Quarter notes per min.

log p(n k|y

1:K)

50 100 150 200 250 300 350 400 450

180 120 60

−10

−5 0

p(rk|y 1:K)

Frame Index, k

50 100 150 200 250 300 350 400 450

Triplets

Duplets 0.2

0.4 0.6 0.8

(65)

Time Domain Modeling

(66)

Sinusoidal Modeling

• Sound is primarily about oscillations and resonance

• Cascade of second order sytems

• Audio signals can often be compactly represented by sinusoidals

(real) yn =

Xp k=1

αke−γkn cos(ωkn + φk)

(complex) yn =

Xp k=1

ck(e−γk+jωk)n

y = F (γ1:p, ω1:p)c

(67)

State space Parametrisation

xn+1 =

e−γ1+jω1

. ..

e−γp+jωp

| {z }

A

xn x0 =



 c1 c2 ...

cp



yn = 1 1 . . . 1 1 

| {z }

C

xn

x0 x1 . . . xk−1 xk . . . xK

y1 . . . yk−1 yk . . . yK

(68)

Diagonal Parametrisation for strictly real y

xdr0 =

0

B

B

B

B

B



c1 c1 ...

cp cp

1

C

C

C

C

C

A

xdrn+1 =





e1

e−jω1

. ..

ep

e−jωp





| {z }

A

dr

xdrn

yn = 1 1 . . . 1 1 

| {z }

C

dr

xdrn

(69)

Alternative realisations – Reparametrisations

• A linear system has infinitely many realisations, each corresponding to a state space representation

¯

xn+1 = A¯¯xn yn = C ¯¯xn with

¯

xn ≡ T xn A¯ ≡ T AT−1 C¯ ≡ CT−1

T xn+1 = T AT−1T xn yn = CT−1T xn

• There are many parametrisations, but some are more suitable for extracting relevant information or a compact representation

(70)

Transformation from Diagonal canonical form to rotation matrix

cos(θ) = e + e−jθ

2 sin(θ) = e − e−jθ 2j

T AT−1 = A¯

√1 2

 1 1

−j j

  e−γ+jω 0

0 e−γ−jω

 1√ 2

 1 j 1 −j



= e−γ

 cos(ω) − sin(ω) sin(ω) cos(ω)



CT−1 = C¯ 1 1  1

√2

 1 j 1 −j



= 1 0 

(71)

State Space Parametrisation

(72)

Audio Restoration/Interpolation

• Estimate missing samples given observed ones

• Restoration, concatenative expressive speech synthesis, ...

0 50 100 150 200 250 300 350 400 450 500

0

(73)

Audio Interpolation

p(x¬κ|xκ) ∝ Z

dHp(x¬κ|H)p(xκ|H)p(H) H ≡ (parameters, hidden states)

H

x¬κ xκ

Missing Observed

0

(74)

Probabilistic Phase Vocoder

(Cemgil and Godsill 2005)

Aν Qν

sν0 · · · sνk · · · sνK−1

ν = 0 . . . W − 1

x0 xk xK−1

sνk ∼ N (sνk; Aνsνk−1, Qν) Aν ∼ N

 Aν;

 cos(ων) − sin(ων) sin(ων) cos(ων)

 , Ψ



(75)

Exact Inference

1. Infer the posterior distribution

p(S, Θ|xκ) = 1

Zxp(xκ|S, Θ)p(S|Θ)p(Θ)

≡ 1

Zxφ(S, Θ) ≡ P

S = s0:W −10:K−1 Θ = (A, Q) Zx = p(xκ)

2. Then, compute the predictive distribution p(x¬κ|xκ) =

Z

dSdΘp(x¬κ|S, Θ)p(S, Θ|xκ) Exact evaluation of P is intractable.

(76)

Approximating distribution Q

Q ≡ Y

α∈C

q(sα0:K−1)q(Θα) ≡ Y

α∈C

Qα(Sα)Qαα)

where C = {ν1, . . . , νN} is a set of disjoint clusters of frequency bands such that νi ∩ νj = ∅ for i 6= j and S

i νi = {0, . . . , W − 1}.

q(sα0:K−1) ≡ q(sα0) YK k=1

q(sαk|sαk−1) q(Θα) ≡ q(Aα)q(Qα)

(77)

Inference: Structured Variational Bayes

Aα q(Aα) Qα q(Qα)

· · · sαk−1 sαk sαk+1 · · ·

α∈ C

Q

kq(sαk|sαk−1)

xk q(xk)

• Intuitive algorithm:

– Substract from the observed signal x the prediction of the frequency bands in ¬α.

– Compute a fit for α to this residual and iterate.

• For fixed A, Q, this is equivalent to Gauss-Seidel, an iterative method for solving linear systems of

(78)

Restoration

• Piano

– Signal with missing samples (37%) – Reconstruction, 7.68 dB improvement – Original

• Trumpet

– Signal with missing samples (37%)

– Reconstruction, 7.10 dB improvement

– Original

(79)

Hierarchical Factorial Models

• Each component models a latent process

• The observations are projections

rν0 · · · rνk · · · rνK

θν0 · · · θνk · · · θνK

ν = 1 . . . W

yk yK

• Generalises Source-filter models

(80)

Harmonic model with changepoints

rk|rk−1 ∼ p(rk|rk−1) rk ∈ {0, 1}

θkk−1, rk ∼ [rk = 0]N (Aθk−1, Q)

| {z }

reg

+ [rk = 1]N (0, S)

| {z }

new

ykk ∼ N (Cθk, R)

A =



Gω

G2ω

. ..

GHω



N

Gω = ρk

 cos(ω) − sin(ω) sin(ω) cos(ω)



damping factor 0 < ρk < 1, framelength N and damped sinusoidal basis matrix C of size N × 2H

(81)

Harmonic model with changepoints

r kfrequency

k k

x k

(S)

• Each changepoint denotes the onset of a new audio event

(82)

Inference

• Filtering

p(rk, θk|y1:k) ∝ X

r1:k−1

Z

1:k−1p(y1:k1:k)p(θ1:k|r1:k)p(r1:k)

• Smoothing, Marginal MAP – MMAP r1:K = arg max

r1:K

Z

θ1:Kp(y1:K1:K)p(θ1:K|r1:K)p(r1:K)

– Each configuration of r1:K encodes one of the possible 2K possible models, i.e., segmentation.

– MMAP is equivalent to Bayesian Model Selection

• All problems are similar, but MMAP is usually harder because max and R

do not commute

(83)

Exact Inference in switching state space models is intractable

• In general, exact inference is NP hard

– Conditional Gaussians are not closed under marginalization

⇒ Unlike HMM’s or KFM’s, summing over rk does not simplify the filtering density

⇒ Number of Gaussian kernels to represent exact filtering density p(rk, θk|y1:k) increases exponentially

−7.9036 6.6343

0.76292

−10.3422

−10.1982

−2.393

−2.7957

−0.4593

Referanslar

Benzer Belgeler

Bununla birlikte metnin yazınsallığının bir ölçütü olarak metinlerarasılığı yalnızca metinler arasında biçimsel ve anlamsal düzlemde gerçekleşen bir tür oyun alanı

In experimental results, it was observed that the integrated model performs better in terms of the number of frequencies assigned, for both protocol-based and SIR-based

Imagine the location on efficiency frontier, which expresses the possible maximum efficiency for the fourth branch (A). This location is expected for the fourth branch to

We consider the implementation of RFID technology to support the lifecycle of a durable product such as an aircraft turbine engine in an attempt to achieve the goal of improved

[r]

In Section 2, signal models for audio are developed in the time domain, including some examples of their inference for a musical acoustics problem.. Section 3 describes models in

music transcription, polyphonic pitch tracking, Bayesian signal processing, switching Kalman filters..

In this paper we describe some models developed recently for these tasks, which also have utility in audio and general signal processing applications; and investigate hybrid