Signal Processing Group, University of Cambridge

(1)

2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16-19, 2005, New Paltz, NY

EFFICIENT VARIATIONAL INFERENCE FOR THE DYNAMIC HARMONIC MODEL Ali Taylan Cemgil and Simon J. Godsill

Signal Processing Group, University of Cambridge

Department of Engineering, Trumpington Street, Cambridge, CB2 1PZ, UK {atc27,sjg}@eng.cam.ac.uk

ABSTRACT

In this paper, we develop a class of probability models that are po- tentially useful for various music applications such as polyphonic transcription, source separation, restoration or denoising. This class unifies and extends several models such as sinusoidal and harmonic models, additive synthesis model, Gabor regression and probabilistic phase vocoder. We overcome computational intractabil- ity issues by introducing structured variational (mean-field) approximations that lead to efficient local message passing algorithms.

1. INTRODUCTION

When excited, vast majority of physical systems (human vocal tract, many acoustical musical instruments e.t.c.) respond with quasi-periodic oscillations. Hence, in short time scales, the energy of signals from these sources is concentrated to a few narrow frequency bands. In modelling music or speech, this property is exploited extensively and it is a well established practice to model audio as a superposition of sinusoidals. This provides a compact representation of the audio waveform using a few amplitude and phase coefficients. However, since the frequency content of music or speech changes over time, the analysis needs to be applied to subsequent short (and possibly overlapping) time frames. This is, of course, the underlying principle behind many well known methods such as short time Fourier transform (STFT) analysis, phase vocoder [1, 2], the sinusoidal model [3], or various harmonic-plus- noise models [4, 5, 6, 7].

However, in many applications, the most “interesting” regions of an audio signal are transient regions associated with change- points such as onsets or offsets, yet it is exactly these points where the stationarity assumptions of a sinusoidal model are violated.

Correct characterisation of changepoints is important in various applications: in transcription we wish to estimate note onsets, offset or frequency modulations precisely, in coding or time scaling unfaithful reconstruction of attacks introduces annoying percep- tual artifacts such as “flanging” or “phasiness”, e.t.c [2].

In this paper, we approach signal analysis from a Bayesian perspective. Central to our approach is a probabilistic hierarchical (three layer) signal model. The bottom layer p(x|s) describes how the audio signal x is generated from a latent dynamical process with state variable s. The intermediate layer, denoted by p(s|θ), is non-stationary and its dynamics is governed by a sequence of unknown parameters θ. The basic idea is to view the parameters θ as describing the characteristics of a hidden “excitation” sequence that drives the dynamical process. In signal analysis, we wish to estimate θ (possibly using some prior knowledge encoded by a top layer process p(θ)). As a general inference problem, the posterior

distribution is given by the Bayes’ rule

p(θ|x) = 1

p(x) Z

dsp(x|s)p(s|θ)p(θ) (1)

In the next section, we introduce this model class, that we name as dynamic harmonic model (DHM). An attractive property of DHM is that transient signal characteristics, onsets or frequency fluctua- tions can be estimated to sample precision, if desired. Moreover, this class unifies and extends several models such as sinusoidal and harmonic models[8, 9], additive synthesis model, Gabor regression [10] and probabilistic phase vocoder[11].

Our purpose in this paper is twofold: First, we introduce DHM using the language of directed graphical models as a generic model for harmonic signal modelling. Our second aim is to illustrate inference methods based on variational (structured mean-field type) approximations [12, 13, 14]. We will describe the method as an iterative local message passing algorithm on a factor graph[15]. Our focus here is to describe a particular variational inference schema on DHM and to demonstrate effectiveness.

2. MODEL

Consider a single channel audio signal x = (x0, . . . , xt, . . . , xT −1)^>. The harmonic model represents the signal as

x|s ∼ N (x; Cs, R) (2)

The symbol N (x; µ, Σ) denotes a Gaussian distribution on x, with mean µ and covariance matrix Σ. In particular, we take R = σRI, where I is the identity matrix. C is a T × 2W matrix

Cs ≡





C0⁰ . . . C₀^{W −1} ... Ct^ν ... CT −1⁰ . . . C_{T −1}^{W −1}







 s⁰

... s^{W −1}





with time (row) index t = 0, . . . , T − 1 and frequency (column) index ν = 0, . . . , W − 1. The block entries are defined by

Ct^ν ≡ ¡

cos(tνθ) sin(tνθ) ¢

xt = X

ν

C_t^νs^ν

where θ = 2π∆f is the angular frequency, ∆f = Fs/L is the frequency resolution with Fs being the sampling frequency and L ≥ W . Note when σR= 0, T = W = L, then C is equivalent (up to normalisation) to the inverse Fourier transform matrix and s are alternating real and imaginary parts of Fourier coefficients.

(2)

By slight abuse of notation, we will refer to ν’th coefficient by s^ν, keeping in mind that it is actually not a scalar but a 2 × 1 vector¹. The above interpretation highlights the well-known fact that we can view Fourier analysis (or any basis expansion) as a particu- lar instance of the generalised linear model [16]. Note that, this is a static interpretation, whilst suitable for fast transformations, does not reflect the fact that data is arriving online. One observation is that we can define the model for arbitrary frame lengths. Let N be the length of a frame in samples and define a frame index k such that k = 0 . . . T /N − 1. We denote

xk ≡ (xkN, . . . , x(k+1)N −1)

The alternative (and arguably physically more realistic) dynamic state-space interpretation hinges on the well known trigonometric identity (which is more familiar in complex arithmetic as e^{j(t+N )θ}= e^{jN θ}e^jtθ)

µ cos((t + N )θ) sin((t + N )θ)

¶

= B(N θ)

µ cos(tθ) sin(tθ)

¶

where B(·) is a Givens rotations matrix.

B(ω) ≡

µ cos(ω) − sin(ω) sin(ω) cos(ω)

¶

Using this identity, we see that the basis matrix can be “generated”

by the recursion

Ct+N^ν = Ct^νB(N νθ)^>

Multiplying both sides by s^ν, we obtain for t = 0, . . . , N − 1 Ct+N^ν s^ν = Ct^νB(N νθ)^>s^ν

We now define s^νk≡ B(N kνθ)^>s^ν. According to this definition, we can represent each frame xkby the recursion

s^νk = B(N νθ)^>s^νk−1 ν = 0 . . . W − 1

xk = X

ν

C0:N −1^ν s^νk≡ C_{0:N −1}^{0:W −1}sk

Where the notation C_{0:N −1}^{0:W −1}denotes the first N rows of the basis matrix and sk≡ (s⁰_k^>, . . . , s_k^ν^>, . . . , s^{W −1}_k ^>)^>. This means that by rotating s^νappropriately, i.e., by phase correction, we can just use the first N samples of each basis vector to represent x. For example, in the extreme case, when N = 1,

s^νk = B(νθ)^>s^νk−1 ν = 0 . . . W − 1 xk = xk=¡

1 0 . . . 1 0 ¢ sk

for k = 0, . . . , T − 1. Here, all information about the signal is conveyed in the latent dynamical process and the (overcomplete) basis C becomes trivial. Using this idea, we can “interpolate” be- tween dynamic and static interpretations. Of course, mathemati- cally speaking, we haven’t really gained anything and in the noise- less case, all formulations for any frame length N are equivalent.

However, the dynamical system perspective opens up interesting possibilities to extend the basic model in a way that is not apparent in the static formulation.

1One could avoid this technicality by describing the model using complex Gaussian distributions, but this approach precludes certain generalisa- tions.

3. DYNAMIC HARMONIC MODEL

Dynamic Harmonic model is a generative model that describes how the audio signal x is generated from a latent process with state variable sk, for k = 0, . . . , T /N − 1. Each element s^νkof sk

follows its independent piecewise linear dynamics and is subject to non-stationary Gaussian noise as:

s^ν_k|A^ν_k, Q^ν_k ∼ N (s^ν_k; A^ν_ks^ν_k−1, Q^ν_k)

Here, transition matrix A^ν_k and noise covariance matrix Q^ν_k are 2 × 2. The k’th frame is generated by

xk|sk ∼ N (xk; Csk, R)

where observation noise covariance is isotropic with R = σRI.

For brevity, we denote the observation matrix C ≡ C_{0:N −1}^{0:W −1}, Ak≡ diag(A⁰_k, . . . , A^{W −1}_k ) and Qk≡ diag(Q⁰_k, . . . , Q^{W −1}_k ).

The prior structure on the transition matrices and noises is taken as a hidden Markov model with discrete latent state rk∈ {0, 1, . . . , |r|−

1} where |r| denotes number of distinct states:

r^νk|r^νk−1 ∼ M(rk^ν; π0(rk−1), . . . , π|r|−1(rk−1)) A^νk|r^νk ∼ N (A^νk; Θ(r^νk), Σ(r^νk))

Q^νk|r^νk ∼ IG(Q^νk; a(r^νk), b(r^νk))

The symbols M, N and IG denote multinomial, (matrix valued) Gaussian and inverse-gamma distributions respectively and defi- nitions are given at the appendix. The graphical model is shown in Figure 1. In this paper, we will demonstrate DHM with the following parametrisation:

rk ∈ {off, onset, on, offset}

Θ(r^ν_k) = ρ(r)B(νθ)^>

where ρ(r) is a scalar with 0 = ρ(onset) ≤ ρ(offset) = ρ(off) <

ρ(on) ≤ 1. The interpretation is that the oscillations are damped more during “off” state and less in “on” state. We let Σ(r^νk) = 0 to ensure Ak is a fixed rotation matrix. Only the following state transitions have nonzero probability

πon = p(on → on) = 1 − p(on → offset) πoff = p(off → off) = 1 − p(off → onset)

1 = p(offset → off) = p(onset → on)

The transition noise is constrained to be isotropic Q^νk= σ_k^νI with σ^νk|rk^ν ∼ IG(σ^νk; a(rk^ν), b(r^νk))

The parameters a and b should be chosen such that the expected transition noise precision h1/σk^νi_IG is small when rk^ν = onset and very large when r_k^ν 6= onset. This mechanism models the scenario where the state variable s^νkis reinitialised during an onset by injecting a random amount of energy and otherwise follows an almost deterministic dynamics. In Figure 2, we demonstrate a typical sample from the model.

4. INFERENCE

To solve the problem in (1) exactly, we need to first infer the posterior distribution

p(s, θ|x) = 1 Zx

p(x|s)p(s|θ)p(θ) ≡ 1 Zx

φ(s, θ) ≡ P (3)

(3)

r^ν0 · · · r^νk · · · r^νK−1

Q^ν0 · · · Q^νk · · · Q^νK−1

A^ν₀ · · · A^ν_k · · · A^νK−1

s^ν0 · · · s^ν_k · · · s^νK−1

ν = 0 . . . W−1

x0 x^k x^K−1

Figure 1: Graphical model for the DHM. The rectangle denotes a plate, W copies of the nodes inside.

where s = {s^νk}, θ = {(A^νk, Q^νk)} for all pairs (k, ν). Here, Zx= p(x) is a normalising constant (also known as the evidence or data likelihood). The parameter posterior is obtained by the marginal p(θ|x) =R

dsp(s, θ|x). However, exact evaluation of the posterior distribution in (3) is intractable due to couplings be- tween θ and s, so we will resort to approximations.

4.1. Structured Mean Field

One possible approximation method, that leads to a practical opti- misation procedure is structured mean field, also known as varia- tional Bayes, see [12, 13, 14] and references herein. In the partic- ular case of (1), mean field boils down to approximating the exact posterior P in (3) with a simple distribution Q in such a way that the integrand in (1) becomes tractable. An intuitive interpretation of mean field is minimising the KL divergence with respect to (the parameters of) Q where

KL(Q||P) = hlog Qi_Q−

¿ log 1

Zx

φ(s, θ) À

Q

(4)

Here, hf (x)i_p(x) ≡R

dxp(x)f (x) denotes the expectation of f w.r.t. p. Using non-negativity of KL [17] we obtain a lower bound on the evidence log Zx≥ hlog φ(s, θ)i_Q− hlog Qi_Qwhere max- imising this lower bound is equivalent to finding the “nearest” Q to P in terms of KL. In this paper, we choose the approximating distribution Q of form

Q ≡ Qs

W −1Y

ν=0

Q^νr K−1Y

k=1

Q^νA,kQ^νQ,k

=

K−1Y

k=1

Ã

q(sk|sk−1)

W −1Y

ν=0

q(r^νk|rk−1^ν )q(A^νk)q(Q^νk)

! (5)

Although a closed form solution for Q still can not be found, it can be easily shown, e.g. see [18], that each factor Qaof the optimal approximating distribution should satisfy the following fixed point equation

Qa ∝ exp

³

hlog φ(s, θ)i_Q/Q

a

´

(6)

Frame Index 10

20 30 40 50 60

Frame Index 10

20 30 40 50 60

Figure 2: (left) Short Time Fourier Transform log-magnitude of a signal generated from the model (right) Corresponding indicators r^ν_k.

r0^ν · · · r^νk

−1

ϕ(r^νk−1, rk^ν) r^νk

ϕ(r^νk, r^νk+1)

ϕ(r^ν₀)

· · ·

ϕ(r^νk

−1, A^νk

−1, Q^νk

−1) ϕ(r^νk, A^νk, Q^νk)

· · · A^νk

−1 Q^νk

−1 A^νk Q^νk

s0 · · · sk−1

ϕ(sk

−1, sk, Ak, Qk) sk

ν = 0 . . . W−1

ϕ(s0)

· · ·

ϕ(sk−1) ϕ(sk)

Figure 3: Factor graph that corresponds to the variational approximation in (5). Solid links correspond to substructures where it is feasible to carry out exact inference via belief propagation and dotted links correspond to the links “broken” by the variational approximation.

where Q/Qadenotes product of all factors excluding Qa. Hence, the mean field approach leads to a set of fixed point equations that need to be iterated.

Right hand side of this fixed point iteration can be computed efficiently since log φ(s, θ) is of formP

ξϕ(ne(ξ)) where ϕ are the factors (local functions) defined on a small set of variables denoted by ne(ξ) and ξ is an index that runs over factors. The structure of this expression for the DHM can be conveniently “vi- sualised” using a factor graph[15] depicted in Figure 3. Here, each factor potential ϕ(ne(ξ)) is shown as a black node adjacent to variable nodes a ∈ ne(ξ). Using this notation, we see that (6) simplifies to

Qa ∝ exp



 X

ξ∈ne(a)

hϕ(ne(ξ))i_Q

ne(ξ)/Q_a





where ne(a) denotes set of all factors where a occurs as an argu- ment. The expectations hϕi can be computed easily if all distri- butions are chosen to be in a conjugate-exponential family [13], which is the case for the DHM.

The structured mean field for DHM has a particularly intuitive interpretation: each chain structured distribution Q^νr on r^ν0:K−1

corresponds to an HMM, and similarly, the chain structure Qson s0:K−1corresponds to a Kalman filter model. The observations of the HMM’s and transition model of the Kalman filter at each time slice are determined by sufficient statistics of A^νkand Q^νk. In other words, the approximation boils down to iteratively running a sequence of Kalman and HMM smoothing algorithms.

(4)

Further computational savings can be obtained by considering the expressions for individual factors. For example, when frame length N is chosen such that C is the inverse Fourier transform matrix and R = σRI

ϕ(sk) = −1

2Tr C^TR⁻¹Csks^Tk + Tr x^TkR⁻¹Csk

= − 1

2σR

Tr sks^Tk+ 1 σR

Tr(C^Tx)sk

where the terms C^Tx can be computed by the FFT. Other savings (for any N ) can be obtained for ϕ(sk, sk−1, Ak, Qk) when Akis a block-diagonal rotation matrix and Qkis diagonal. Due to page limitations, we will report these details in a future technical report.

5. CONCLUSIONS

A very attractive feature of the message passing schema is that the approximating structures can be freely chosen to trade of ac- curacy versus computation time. Moreover, computer code can be modularised and organised in such a way to make distributed and parallel execution possible. Moreover, by imposing mild restric- tions on model parameters, inference can be further speeded up by making use of fast transforms such as the fast Fourier transform.

The convergence properties of resulting iterative algorithms can be further improved using deterministic annealing procedures.

We have recently applied a particular DHM successfully for signal restoration tasks[11]. Restoration results can be listened at http://www-sigproc.eng.cam.ac.uk/∼atc27/em-restore/. In future work, we will investigate further applications such as polyphonic transcription or audio source separation and compare variational inference with stochastic methods (e.g. MCMC).

A. APPENDIX

In this section, we define standard distributions in exponential form.

For each case, we denote the log-partition function (log normaliser) by ψ(·).

Matrix valued Gaussian: We let X be a Mr× Mcmatrix. Then vec X is a MrMc×1 vector obtained by concatenation of columns of X. The mean Θ has the same size as X and the covariance Σ is MrMc× MrMc. Then a matrix valued Gaussian

N (X; Θ, Σ) ≡ N (vec X; vec Θ, Σ)

Multinomial: We let π ≡ (π0, . . . , πi, . . . , π|r|−1) a vector such that πi≥ 0 for all i = 1, . . . , |r|. We denote a multinomial distribution in exponential form as

M(r; π) ≡ exp(

|r|−1X

i=0

[r = i] log πi− ψ(π))

where ψ(π) = logP

jπjand [r = i] is the indicator of the event that r = i. Note that the expectation h[r = i]i_M= π1/P

jπjof the indicator gives simply the probability of the event.

Inverse Gamma: Let Γ(a) denote the gamma function. We define digamma function as Ψ(a) ≡ ∂ log Γ(a)/∂a. For x ≥ 0, the inverse Gamma distribution in exponential form is given as We let Σ = diag(σ1, . . . , σM) is a diagonal M ×M matrix with σi≥ 0, we define a ≡ diag(a1, . . . , aM) and b ≡ diag(b1, . . . , bM)

IG(Σ; a, b) ≡ exp¡

− Tr(I + a) log Σ − Tr b⁻¹Σ⁻¹− ψ¢ ψ = ψ(a, b) ≡ Tr log Γ(a) + Tr a log b

Σ⁻¹®

IG = diag(b1a1, . . . , bMaM) ≡ ba

log |Σ⁻¹|®

IG = XM

j=1

(Ψ(aj) + log bj) = Tr(Ψ(a) + log |b|)

Note that IG(Σ; a, b) =Q_M

j=1IG(σj; aj, bj).

B. REFERENCES

[1] J. L. Flanagan and R. M. Golden, “Phase vocoder,” Bell System Tech- nical Journal, pp. 1493–1509, November 1966.

[2] M. Dolson, “The phase vocoder: A tutorial,” Computer Music Jour- nal, vol. 10, no. 4, pp. 14–27, 1986.

[3] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[4] X. Serra and J. O. Smith, “Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic de- composition,” Computer Music Journal, vol. 14, no. 4, pp. 12–24, 1991.

[5] X. Rodet, “Musical sound signals analysis/synthesis: Sinusoidal + residual and elementary waveform models,” Applied Signal Process- ing, 1998.

[6] L. Parra and U. Jain, “Approximate Kalman filtering for the harmonic plus noise model,” in Proc. of IEEE WASPAA, New Paltz, 2001.

[7] R. A. Irizarry, “Weighted estimation of harmonic components in a musical sound signal,” Journal of Time Series Analysis, vol. 23, 2002.

[8] P. J. Walmsley, “Signal separation of musical instruments,” Ph.D. dis- sertation, University of Cambridge, 2000.

[9] M. Davy and S. J. Godsill, “Bayesian harmonic models for musical signal analysis,” in Bayesian Statistics 7, 2003.

[10] P. J. Wolfe, S. J. Godsill, and W. Ng, “Bayesian variable selection and regularisation for time-frequency surface estimation,” Journal of the Royal Statistical Society series B, 2004, read paper (with discussion).

[11] A. T. Cemgil and S. J. Godsill, “Probabilistic Phase Vocoder and its application to Interpolation of Missing Values in Audio Signals,”

in Accepted to 13th European Signal Processing Conference. An- talya/Turkey: EURASIP, 2005.

[12] W. Wiegerinck, “Variational approximations between mean field the- ory and the junction tree algorithm,” in UAI-2000 (16-th conference), pp. 626–633.

[13] Z. Ghahramani and M. Beal, “Propagation algorithms for variational Bayesian learning,” in Neural Information Processing Systems 13, 2000.

[14] M. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Department of Statistics, UC Berkeley, Tech. Rep. 649, September 2003.

[15] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm.” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, February 2001.

[16] A. J. Dobson, An Introduction to Generalized Linear Models, 2nd ed.

London: Chapman and Hall/CRC, November 2001.

[17] T. M. Cover and J. A. Thomas, Elements of Information Theory.

New York: John Wiley & Sons, Inc., 1991.

[18] J. Winn and C. Bishop, “Variational message passing,” Submitted to Journal of Machine Learning Research, 2004.