Bayesian Hierarchical Models and Inference for Musical Audio Processing

(1)

Bayesian Hierarchical Models and Inference for Musical Audio Processing

Paul Peeling A. Taylan Cemgil

Signal Processing and Communications Laboratory, Department of Engineering, University of Cambridge, UK

Simon Godsill

Abstract—In music transcription and related musical signal processing applications we are interested in determining the activity of a set of sound-generating sources having extracted features such as DFT coefficients, spectral peaks from the audio, over time. A probabilistic treatment requires the construction of a Bayesian hierarchy: a dynamical model of how the source activity changes over time, and a generative model of how the features are produced from the active sources. The resulting Bayesian network is of extremely large dimension, for which standard MCMC and variational inference methods struggle. In this paper we describe some models developed recently for these tasks, which also have utility in audio and general signal processing applications; and investigate hybrid inference strategies: sequential methods such as particle filtering on the source activity dynamics coupled with MCMC inference over latent parameters on the generative models, demonstrating their performance for real data.

I. INTRODUCTION

Musical signal processes are hierarchical in nature. At the highest level, the information contained in a musical score can be transformed into an activity map of sources, which are sounding (‘on’) or muted (‘off’) at any given time, and may also have volume and timbre information attached, in the same manner as MIDI Note messages. A number of dynamical models for source activity have been proposed in [1], [2], [3].

At the lowest level, individual samples of audio data can be modelled directly using oscillator concepts [4], or transforms are applied to windowed frames to extract relevant features. As the popular spectrogram representation of musical and audio signals demonstrates, there is a structured correlation between observed features, caused by for example, the harmonics of a musical note, which has been exploited in [5], [6] for instance. A computationally attractive framework for modelling these correlations is a Markov random field of variables with conjugate priors in an exponential family of probability distributions. We use a hierarchical source model for non- parametric transform methods, such as the Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT):

p(s|v)p(v) = Ã

Y

ν,τ

p(sν,τ|vν,τ)

!

p(v) (1)

where s are vectors of transform coefficients sν,τ and v are the associated variances, corresponding to the energy of the particular time-frequency atom, over time indices τ and frequency indices ν. In particular, we will assume that the expansion coefficientssν,τ are conditionally Gaussian and the

variances vν,τ are nonnegative random variables assumed to be distributed by an inverse-gamma distribution:

sν,τ ∼ N (sν,τ; 0, vν,τId) (2) vν,τ ∼ IG(vν,τ, a, b) (3) Id is an identity matrix that is taken 1 × 1 for DCT (real coefficients) and2×2 for a DFT representation (real and com- plex coefficients). The marginal distribution on the transform coefficients is non-Gaussian and heavy-tailed by construction, which corresponds to coefficient distributions found in speech and audio, see e.g. [7]. a and b are the shape and scale parameters of the inverse-gamma distribution, and may be themselves be functions a(v_¬), b(v_¬) of the variance v_¬ of any other time-frequency atom. In Figure 1 we show a realistic scenario, where the parameters of the variance of each atom in a particular time frame τ are correlated with an overall energy parameter λτ. For a musical note, λτ is the overall volume of the note, and this energy is distributed mainly in the harmonics of the note, giving rise to a harmonic ‘template’

as shown in Figure 2. Further details and alternative Markov field topologies are given in [8], [9].

The model in Figure 1 takes the particular form:

vν,τ|λτ ∼ IG(vν,τ, a, (abνλτ)⁻¹) (4) and a gamma-Markov chain is used to model the energy process over time:

λτ|zτ ∼ G(λτ; aλ, (zτaλ)⁻¹) (5) zτ|λ_{τ −1} ∼ G(zτ; az, (λ_{τ −1}az)⁻¹) (6) where we have included a latent variablez to ensure positive correlation between successive values ofλτ.

Parametric transforms, such as sinusoidal modelling, which identifies frequencies present in the signal, can be modelled by point processes over the parameter space, using a Poisson source model:

x ∼ P(r) (7)

where r is a rate parameter with a Gamma-distribution as conjugate prior, hence we can use the same class of prior

(2)

λτ −1 λτ

v_{1,τ −1} v1,τ

v_{ν,τ −1} vν,τ

vW,τ −1 vW,τ

Fig. 1. A gamma Markov random field for a musical source, where a DFT transform has distributed the energy into W bins.. Each edge in this graph has a weight which describes the correlation between the connected variables.

The high-level parameters λτ describe the overall energy of the source at time τ . The edge weights between λτ and the variances v₁, . . . , vW in each bin describe the proportion of the overall energy each individual bin receive, see Figure 2. Between λ_{τ −1} and λτ we have a correlation that suggests the overall energy to decrease over time, as for plucked strings (guitar) or struck strings (piano).

correlation structure as for the model described in (2). A form of this model has been used for music transcription in [10].

This paper will describe inference methods developed for musical signal models combining dynamical activity models and feature-vector source models. Standard techniques for Bayesian networks and Markov random fields, notably Gibbs’

sampler and Variational Bayes, have been observed to perform poorly, as each possible configuration of the activity describes a mode under the source models, hence the posterior is inherently multi-modal. Hybrid techniques, combining these techniques and inference methods for dynamical systems, can help to overcome these issues. In Section II we describe inference on the source models, and in Section III we incorporate these methods into a hybrid scheme. Throughout, we will provide indicative results of the performance of the algorithms.

II. SOURCEMODELINFERENCE

We consider infererence with the Gaussian source model (2) for STFT bins. Two samples are drawn from each Gaussian distribution, for the real and imaginary parts of the STFT coefficient. We use a Gamma Markov random field on the variance parameters of the form shown in Figure 1.

Using this model we can obtain very reasonable estimates of the active notes present within a snapshot (‘chord’) of musical audio, see Figure 3 for example. We perform inference over a set of possible sources j = 1, . . . , J, and assume that the sources are mixed under Gaussian noise:

y ∼ N (Cs^(j), ρ) (8)

wherey are the observed STFT coefficients, ρ is the noise, and C denotes a mixing vector with each element set to 1. Here we will assume all possible sources are active, and infer the overall energy of each, suggesting that a high energy corresponds to an active note, and low energy to an inactive note. We may use stochastic Markov chain Monte Carlo (MCMC) methods such

MIDI number

Frequency / Hz

0 1000 2000 3000 4000 5000 6000 7000 8000

30 40 50 60 70 80 90 100

Fig. 2. Trained weights for the edges in Figure 1, using a database of individial piano notes and a DFT length of1024. Dark regions correspond to high energy, and frequency bins corresponding to harmonics of the fundamental frequency will have a higher proportion of the overall energy.

20 40 60 80 100

0 0.5

1x 10⁻⁶

Ampltitude

Overall Energy

Time / frames

Energy

20 40 60 80 100

10 20 30

Time / frames

MIDI number Log Energy

20 40 60 80 100

10 20 30

Fig. 3. Detection of three notes (MIDI numbers48, 64, 78) within a chord.

The top panel shows the energy assigned to the88 possibly active sources, corresponding to the notes on a piano. The additional note76 is harmonically related to64 , and thus shares some of that note’s energy. The lower panels show how the energy decays over time

as Gibbs’ sampler or deteministic methods such as Variational Bayes, which are both standard inference methods on Markov random fields. In this paper we will restrict our discussion to Monte Carlo methods.

The Gibbs sampler [11] is a particular Monte Carlo method that relies on generating samples{ξ^(t)}t=1,2,...via simulation of a Markov chain with the desired target density p(ξ). The algorithm proceeds by sampling each random variablei ∈ V from the full conditional distributionp(ξi|ξ_−i^(t−1)) where −i ≡ V \ i.

One problem with the Gibbs sampler is convergence speed.

(3)

In practice, the chain takes a prohibitively long time to converge and occasionally becomes trapped in local maxima.

To speed up convergence, special strategies, such as tempering (gradually changing the target density) or blocking (grouping random variables) need to be employed [11].

The models described in Section I are deliberately chosen such that the conditional distributions are of closed form and are straightforward to sample from, as the priors are chosen to be conjugate exponential. The forms of the conditional distributions are standard, and can be found in [12], with the exception of the conditional distribution of the source coefficients. For a single time-frequency atom we have

p(s⁽¹⁾, . . . , s^(J)|y, v⁽¹⁾, . . . , v^(J)) ∼ N (1/ρΣC^Ty, Σ) (9) where the covariance matrixΣ is the inverse of a rank-1 update to the covariance of the prior. That is, if we write

s⁽¹⁾, . . . , s^(J)|y, v⁽¹⁾, . . . , v^(J)∼ N (0, V ) (10) where V is a diagonal matrix with elements Vj,j = v^(j), then

Σ⁻¹= V⁻¹+1

ρC^TC (11)

and by the Woodbury formula [13],

Σ = V − 1

ρ + TrVV C^TCV (12)

Rather than computing the Cholesky factorization of (11) directly, which has a computational complexity of O(n³), which is necessary to sample from the multivariate normal density (9), we observe that (12) is a rank-1 downdate of V and use one of the algorithms described in [14], [15], with complexity of O(n²). Making use of the fact that V is a diagonal matrix and the downdating vector CV is just the diagonal ofV allows further simplifications. Drawing samples from (9) is therefore not as computationally demanding as for a general full covariance matrix, and hence the Gibbs’ sampler scheme remains a feasible method of inference for these large models.

We can assess the performance of these models for chord detection, by ordering the possible notes by overall inferred energy λτ, and using average precision as a performance measure of this ordering. We have been able to obtain an average precision of 90% or more on a database of 100 musically relevant chords selected from the MIDI files in the Poliner-Ellis database [16]. However in a practical setting, we will have a much larger number of different sources to compare with, corresponding to different notes, instruments and volumes. This reduces the validity of the model, since it is unreasonable to begin with the assumption that every possible note is playing with some energy.

We therefore consider ways of inferring which sources are active at any given time. The number of possible combinations of active sources is prohibitively large, so we cannot perform exact inference.

III. ACTIVITYINFERENCE

Here we focus on Monte-Carlo techniques for performing approximate inference. We distinguish between static and dynamic problems for the activity, where static means that the activity does not change over the period of time considered – the musical chord detection problem, and dynamic – the music transcription problem. For the static problem, we do not know the number of notes in the chord and so need to have jumps between activity configurations of different dimension, requiring a reversible jump Monte-Carlo scheme [17]. For the dynamical system, sequential Monte-Carlo methods (SMC), [18] which are based on sequential importance sampling (also known as particle filtering) have proved to be powerful, especially for dynamic or time series models. SMC methods have the advantage that they are inherently online, simple to implement and flexible. In many applications, by merely increasing the amount of computation, the estimation results tend to improve. SMC methods approximate a target density (often the posterior of a time series model)p(x0:K|y0:K) of a hidden Markov process x^0:K as a set ofN particle trajectories {x⁽ⁱ⁾0:K, i = 1, . . . , N } which are drawn from a importance function π, which allows the normalised importance weights

˜

w⁽ⁱ⁾_k = w^∗(i)_k /PN

i=1w^∗(i)_k , where w^∗(i)_k =p(y0:k|x0:k)p(x0:k)

π(x0:k|y0:k) = p(yk|xk)p(xk|xk−1) π(xk|x0:k−1, y0:k) w^∗(i)_k−1 to be computed sequentially. The particle trajectories x⁽ⁱ⁾0:k ≡

³

x⁽ⁱ⁾_0:k−1, x⁽ⁱ⁾k

´ are constructed sequentially by sampling the importance function: x⁽ⁱ⁾k ∼ π(xk|x0:k−1, y0:k). To keep the approximation accurate over time (by keeping the variance of the importance weights bounded), periodical resampling steps are applied. Here, particles are drawn according to a distribution based on the importance weights and the importance weights are set to1/N . The bootstrap filter [18] is the simplest SMC method, where the importance function is simply the prior, i.e. π(x0:K|y0:K) = p(x0:K) and resampling is carried out at every iteration by copying each particle N_k⁽ⁱ⁾ times according to a multinomial distribution with parametersw˜⁽ⁱ⁾_k . However, whilst the bootstrap filter works in many time series analysis problems, it is well known that when the latent state dimension high, it can quickly become ineffective. To render the SMC approach feasible many improvements are needed. One such improvement is “Rao-Blackwellisation”, i.e.

exploiting model structure for reducing the sampling dimension by analytically integrating out some of the variables, conditioned on some others.

Both schemes require a proposal distribution, which help the algorithm to concentrate on relevant combinations. One way of generating such a proposal is shown in Figure 4 where the probability that the chord contains only a single note is estimated. We would expect that notes present in the chord would have a higher likelihood than notes not present, and the figure shows this to be indeed the case.

As a demonstration, we present two example applications of hybrid inference schemes using a simple dynamical model

(4)

20 30 40 50 60 70 80 90 100

−7

−6

−5

−4

−3

−2

−1 0x 10⁵

MIDI number

Log probability

Fig. 4. Log likelihood p(y|rj) of each possible note rj considered in isolation, for the chord in Figure 3. The correct MIDI numbers 48, 64, 78 are indicated by arrows on the plot, and are shown to have high likelihood.

An exhaustive search across all possible combinations of note is infeasible, but this figure shows that approximate inference and search strategies will rapidly converge to the correct solution. The log likelihood itself is approximated by running Gibbs’s sampler, and computing the likelihood of the configuration found.

for the current position of the music in the score, which we call the ‘score pointer’. If the score pointer rk at time k is at position n in the score, the probability that the score pointer moves to the next positionn + 1 in the score at time k + 1 is denoted by:

p(rk+1= n + 1|rk = n) = p(t) (13) where the transition probabilityp(t) is an increasing function of the tempo t. This model is chosen because inference is straightforward to apply, as we can easily formulate this as a hidden Markov model (HMM), and for short extracts at least, apply the Viterbi or Forward-Backwards algorithms described in [19], [20]. Figure 5 shows the inferred position of the score pointer, at regular times for a score-alignment application where the score, but not the tempo, is already known, using DFT feature vectors, using the Viterbi algorithm, which gives the most likely sequence rk for the score pointer. However the Viterbi algorithm implements an exhaustive search of complexityO(N²K) where N is the number of possible score positions and K is the number of frames. For larger scores, SMC methods are more appropriate.

Figure 6 shows the inferred transcription for a monophonic piece of piano music, using DCT vectors. The overall energy of each note is jointly inferred, and the characteristic attack and decay of piano notes can be seen in this figure.

IV. CONCLUSION

There are a set of available and flexible models for musical signal processes, consisting of dynamic models describing how the source activity, or a musical score, progresses over time, and low-level signal models of transform coefficients.

When these models are linked together hierarchically and a Bayesian probabilistic treatment is applied, the resulting modelling scheme is intuitive, and can be applied to a diverse range of applications and situations. In this paper we have

Spectrogram Data

Time / s

Frequency / Hz

0 2 4 6 8 10 12 14

0 1000 2000 3000 4000

50 100 150 200 250 300 350 400 450

55 60 65 70 75 80 85

MIDI Data

Score position

MIDI note

Fig. 5. Score alignment by determining the Viterbi path of the score pointer.

The vertical bars in this figure denote the position of the score pointer matched in the MIDI representation below and the spectrogram representation above.

The observation likelihood p(yτ,ν|rτ) can be estimated by assuming λτ

and sτ,ν observed and marginalizing the remaining latent variables vτ,ν

(frequency index dropped for clarity). We have found that this estimate is adequate for score alignment purposes.

log p(rτ|sτ)

MIDInotenumber

Time / s

1 2 3 4

60 65 70 75 80

1 2 3 4

−10

−5 0 5 10

P iw˜⁽ⁱ⁾τ˜λ⁽ⁱ⁾τ

Time / s

logλ 1

MDCT of audio (source: Daniel-Ben Pienaar)

Time / s

Frequency/Hz

1 2 3 4

0 500 1000 1500 2000 2500 3000 3500 4000

Fig. 6. Monophonic piano transcription, using a bootstrap particle filtering algorithm. The marginal filtering density over the possible notes rτ and the minimum mean-squared-error estimate of λτ are shown. The observation likelihood p(yτ,ν|rτ, λτ) is estimated by importance sampling: we draw N = 1000 samples at each step from p(sτ,ν|λτ), marginalizing the latent variables vτ,ν, and weight each sample by the likelihood function (8). Denote sⁱ_τ,ν, i = 1, . . . , N as the set of samples thus drawn for a given λτ, rτ, the Monte-Carlo estimate is p(yτ,ν|rτ, λτ) ≈ _N¹ PN

i=1p(yτ,ν|sⁱ_τ,ν). The proposal distribution for the bootstrap filter is factored independently as q(rτ|yτ,ν)q(λτ|λτ −1) with q(rτ|yτ,ν) ≡ p(yτ,ν|rτ) being the monophonic likelihood, as shown for example in Figure 4, and q(λτ|λ_{τ −1}) is implemented by simulating a gamma-Markov chain (equations (5) and (6)).

100 particles are used to approximate the posterior distribution, using the weighting and resampling methods described in Section III.

demonstrated how inference schemes designed to work on one of these models, such as Gibbs’ sampler for the DFT source model, and particle filtering for the score dynamics model; can be combined hierarchically in the same manner as the models themselves are combined, and that these hybrid inference schemes provide acceptable solutions for real musical data.

ACKNOWLEDGMENT

This research is funded by EPSRC (Engineering and Phys- ical Sciences Research Council of UK) under the grant EP/D03261X/1 entitled “Probabilistic Modelling of Musical Audio for Machine Listening”

(5)

REFERENCES

[1] P. H. Peeling, A. T. Cemgil, and S. J. Godsill, “A probabilistic framework for matching music representations,” in 8th International Symposium on Music Information Retrieval (ISMIR), Vienna, Austria, September 2007. [Online]. Available: http://www-sigproc.eng.cam.ac.

uk/^∼php23/peeling07ismir.html

[2] N. Whiteley, A. T. Cemgil, and S. J. Godsill, “Bayesian modelling of temporal structure in musical audio,” in 7th International Symposium on Music Information Retrieval (ISMIR), Victoria, BC, Canada, October 2006.

[3] C. Raphael, “A hybrid graphical model for aligning polyphonic audio with musical scores,” in 5th International Symposium on Music Infor- mation Retrieval (ISMIR), Barcelona, Spain, October 2004.

[4] A. T. Cemgil, H. J. Kappen, and D. Barber, “A generative model for music transcription,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 2, pp. 679–694, March 2006.

[5] M. Reyes-Gomez, N. Jojic, and D. Ellis, “Deformable spectrograms,”

in AI and Statistics Conference, Barbados, 2005.

[6] C. F´evotte, L. Daudet, S. J. Godsill, and B. Torr´esani, “Sparse regres- sion with structured priors: Application to audio denoising,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, May 2006.

[7] R. Martin, “Speech enhancement based on minimum mean square error estimation and supergaussian priors,” IEEE Trans. Speech and Signal Processing, vol. 13, no. 5, 2005.

[8] A. T. Cemgil and O. Dikmen, “Conjugate gamma Markov random fields for modelling nonstationary sources,” in 7th International Conference on Independent Component Analysis and Signal Separation (ICA), 2007.

[9] A. T. Cemgil, P. H. Peeling, O. Dikmen, and S. J. Godsill,

“Prior structures for time-frequency energy distributions,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, October 2007. [Online]. Available:

http://www-sigproc.eng.cam.ac.uk/^∼php23/cemgil07waspaa.html

[10] P. H. Peeling, C. Li, and S. J. Godsill, “Poisson point process modeling for polyphonic music transcription,” Journal of the Acoustical Society of America Express Letters, vol. 121, no. 4, pp. EL168–EL175, April 2007. [Online]. Available: http://scitation.aip.org/journals/doc/

JASMAN-ft/vol 121/iss 4/EL168 1-div0.html

[11] J. S. Liu, Monte Carlo strategies in scientific computing. Springer, 2004.

[12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, 2nd ed. CRC Press, 2003.

[13] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore:

Johns Hopkins University Press, 1996.

[14] P. Gill, G. Golub, W. Murray, and M. Saunders, “Methods for modifying matrix factorizations,” Mathematics of Computation, vol. 28, pp. 505–

535, 1974.

[15] K. B. Petersen and M. S. Pedersen, “The Matrix Cookbook,” February 2008, version 20070905. [Online]. Available: http://www2.imm.dtu.dk/

pubdb/p.php?3274

[16] G. Poliner and D. Ellis, “A discriminative model for polyphonic piano transcription,” Eurasip Journal of Advances in Signal Processing, special issue on Music Signal Processing, 2007.

[17] C. Robert and G. Casella, Monte Carlo Statistical Methods. New York, NY: Springer, 2000.

[18] A. Doucet, N. de Freitas, and N. J. Gordon, Eds., Sequential Monte Carlo Methods in Practice. Springer Verlag, 2001.

[19] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[20] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 22, pp. 257–286, February 1989.