MULTI-OBJECT TRACKING OF SINUSOIDAL COMPONENTS IN AUDIO WITH THE GAUSSIAN MIXTURE PROBABILITY HYPOTHESIS DENSITY FILTER Daniel Clark

(1)

2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 21-24, 2007, New Paltz, NY

MULTI-OBJECT TRACKING OF SINUSOIDAL COMPONENTS IN AUDIO WITH THE GAUSSIAN MIXTURE PROBABILITY HYPOTHESIS DENSITY FILTER

Daniel Clark

^∗

, Ali-Taylan Cemgil, Paul Peeling and Simon Godsill

Signal Processing and Communications Laboratory University of Cambridge

{dec30, atc27, php23, sjg30}@cam.ac.uk UK

ABSTRACT

We address the problem of identifying individual sinusoidal tracks from audio signals using multi-object stochastic ﬁltering techniques.

Attractive properties for audio analysis include that it is concep- tually straightforward to distinguish between measurements that are generated by actual targets and those which are false alarms.

Moreover, we can estimate target states when observations are missing and can maintain the identity of these targets between time-frames. We illustrate a particularly useful variant, the Prob- ability Hypothesis Density (PHD) ﬁlter, on measurements of musical harmonics determined by high resolution subspace methods which provide very accurate estimates of amplitudes, frequencies and damping coefﬁcients of individual sinusoidal components. We demonstrate this approach in a musical audio signal processing application for extracting frequency tracks of harmonics of notes played on a piano.

1. INTRODUCTION

Sinusoidal modelling is a well established technique in coding and analysis of audio [1]. Typically, the sinusoidal components are detected frame by frame using Fourier techniques and consecu- tive frames are connected by heuristic tracking methods. In recent years, application of subspace based high resolution techniques have become popular in the audio community where highly accurate estimates of signal poles (damping factors and frequencies) and complex amplitudes can be obtained, at least in high SNR conditions [2, 3, 4]. However, since analysis is carried out frame by frame the sinusoidal tracks need to be estimated separately, which is crucial in many signal processing applications such as coding, time/pitch modiﬁcation or feature extraction.

We attempt to address the problem of identification of the individual frequencies over their duration by using a technique devised by the object-tracking community called the Probability Hypoth- esis Density filter [5]. In the next section, we briefly describe the subspace method used to generate the data from the musical audio data. Section 3 presents the Probability Hypothesis Density filter and its closed-form solution [6]. The tracking model used in our implementation is presented in section 4, together with results of the sinusoidal tracking. A discussion of the proposed technique for this application of audio tracking is given in the conclusions.

∗This work was funded by the EU Framework 6, Network MUSCLE project and the EPSRC project ”Probabilistic Modelling of Musical Audio for Machine Listening”.

2. SINUSOIDAL ESTIMATION USING SUBSPACE TECHNIQUES

In this section we review the parameter extraction techniques based on subspace methods [7, 8, 2, 3, 4]. The objective is to ﬁnd parameters of a set of damped sinusoidals to represent a small frame of audio data. The damped sinusoidal model is given by

˜yn(α, d, ω, φ) = αe^−dncos(ωn + φ) (1) whereα is the (real) amplitude, d is the log-damping coefﬁcient, ω is the frequency andφ is the phase. An equivalent representation by complex poles and amplitudes can be derived by using the basic trigonometric identitycos(θ) = (e^jθ+ e^−jθ)/2

˜yn = αe^−dncos(ωn + φ) =α

2e^−dn(e^jωn+jφ+ e^{−jωn−jφ})

= α

2(e^(−d+jω)ne^jφ+ e^{(−d−jω)n}e^−jφ)

= czⁿ+ c^∗z^∗n≡ ˜yn(c, z)

here,∗ denotes complex conjugate and we use the fact that e^(−d+jω)n= (e^−d+jω)ⁿand deﬁne

c ≡ α

2e^jφ complex amplitudes

z ≡ e^−d+jω complex poles

Audio signals encountered in practice can be represented com- pactly by a sum ofW damped sinusoids where

y_n= XW k=1

˜yn(ck, z_k) = XW k=1

(ckzⁿ_k+ c^∗kz^∗_kⁿ) (2)

This notation highlights the Vandermonde structure of the system which is useful for efﬁcient calculations via fast transforms.

For example, note the equivalence to the inverse Fourier transform whenzk= e^2πjkn/N.

Sinusoidal estimation is then equivalent to estimation of signal polesz1:Wand complex amplitudesc1:W. In principle, this can be achieved by a direct approach where we minimize directly a cost function of the form

X

n

˛˛˛˛

˛yn−X

k

(ckz_kⁿ+ c^∗kz_k^∗ⁿ)˛˛

˛˛˛

2

There is some analytic structure in this model: when conditioned on the polesz (i.e. frequencies ω and log damping coefﬁcients

(2)

d), the complex amplitudes c can be found easily by least squares.

However, ﬁnding the poles can be hard because the objective function has many local minima. Consequently, a direct approach turns out to be computationally expensive and is applied only in low SNR conditions. Subspace methods rely on an alternative alge- braic formulation of the model and using fast singular value de- composition, one can estimate the signal polesz_kand the complex amplitudes from an observed data vectory [7, 2, 4].

3. THE PROBABILITY HYPOTHESIS DENSITY (PHD) FILTER

The mathematical theory of filtering is concerned with estimat- ing the state of a stochastic process recursively based on (partial) noisy observations up to the current time. A natural application of filtering theory is the problem of target tracking, i.e. to track the state (eg. position, velocity, area etc.) based on detections observed from a sensor. If the state transition and observation func- tions are linear and the process and observation noises are Gaus- sians, then an optimal solution can be computed analytically in the form of a Kalman filter [9]. When the problem is extended to a multiple-target tracking scenario where the sensor observed multiple noisy observations with false alarms and missed detections, the problem becomes much more complex. Most approaches to this problem involve associating measurements to individual fil- ters using data association techniques such as Nearest Neighbour (NN), Joint Probabilistic Data Association (JPDA), and Multiple Hypothesis Tracking (MHT) [10]. These techniques can be diffi- cult to implement and there is no notion of optimality.

The Probability Hypothesis Density (PHD) filter is a multiple- target filtering algorithm which recursively updates a distribution based on sets of measurements [5]. From this distribution, we can determine how many targets there are and the states of these targets. This algorithm has the advantage that it can eliminate false alarms (measurements which do not arise from a real target) and can still estimate targets when there are no measurements arising from a target. The PHD filter is a recursion which propagates the first-order moment of a multi-target posterior distribution which is known in the tracking community as the PHD [5]. Sets of target estimates are determined by obtaining peaks of the PHD. Recent studies have show that this approach compares favourably to tra- ditional approaches such as JPDA and MHT [11, 12].

3.1. The PHD Recursion

The PHD recursion involves predicting an intensity function of a spatial point process through a Markov transition, and updating it with a set of noisy observations of targets. Letv_kandv_k|k−1 denote the respective intensities for the multi-target prediction and update recursion. The prediction equation is given by

v_k|k−1(x) = Z

p_S,k(ζ)fk|k−1(x|ζ)vk−1(ζ)dζ + γk(x), (3) wheref_k|k−1(·|ζ) is the single target transition density at time k, pS,k(ζ) is the probability of target survival at time k, γk(·) is the intensity of spontaneous births at timek. The update equation is given by

v_k(x) =[1 − pD,k(x)]vk|k−1(x) (4)

+ X

z∈Zk

pD,k(x)gk(z|x)vk|k−1(x) κk(z) +R

pD,k(ξ)gk(z|ξ)vk|k−1(ξ)dξ,

whereZ_kis the measurement set at timek, g_k(·|x) is the single target measurement likelihood at timek, p_D,k(x) is the probability of target detection at timek, and κk(·) is the intensity of clutter measurements at timek.

3.2. The Gaussian Mixture PHD ﬁlter

A closed-form solution to the PHD filter was derived by Vo and Ma [6] under linear assumptions on the system and observation equations and Gaussian process and observation noises, called the Gaussian Mixture PHD (GM-PHD) filter. The PHD is approxi- mated at each stage with a weighted mixture of Gaussians, where the means and covariances of the Gaussian components are calculated according to the Kalman filter equations [9], and the weights are calculated according to the PHD filter equations [5]. The multiple target states in the GM-PHD filter are estimated by taking the Gaussian components with highest weights. This leads naturally to an extension of the GM-PHD filter which allows the evolution of individual target states to be determined over time, which en- sures the continuity of the target tracks [12]. This has been used for tracking objects in forward-scan sonar images [13].

Each target follows a linear Gaussian dynamical and observation model i.e.

f_k|k−1(x|ζ) = N (x; Fk−1ζ, Qk−1), (5) gk(z|x) = N (z; Hkx, Rk), (6) whereN (·; m, P ) denotes a Gaussian density with mean m and covarianceP , F_k−1is the state transition matrix,Q_k−1is the process noise covariance,Hkis the observation matrix, andRkis the observation noise covariance. The survival and detection probabil- ities are state independent, i.e.

pS,k(x) = pS,k, (7)

pD,k(x) = pD,k. (8)

The PHD for the birth of new targets is a Gaussian mixture of the form

γk(x) =

JXγ,k i=1

w⁽ⁱ⁾_γ,kN (x; m⁽ⁱ⁾_γ,k, P_γ,k⁽ⁱ⁾), (9)

whereJ_γ,k,w_γ,k⁽ⁱ⁾,m_γ,k⁽ⁱ⁾,P_γ,k⁽ⁱ⁾,i= 1, . . . , Jγ,k, are given model parameters that determine the birth intensity.

GM-PHD Prediction

Suppose that the posterior intensity at timek− 1 is a Gaussian mixture of the form

vk−1(x) =

JXk−1 i=1

w⁽ⁱ⁾_k−1N (x; m⁽ⁱ⁾_k−1, P_k−1⁽ⁱ⁾ ). (10) Substituting linear transition densityf_k|k−1, posterior intensity v_k−1and and birth intensityγ_kinto the PHD prediction, we have

v_k|k−1(x) = = pS,k JXk−1

i=1

w⁽ⁱ⁾_k−1N (x; m⁽ⁱ⁾_k|k−1, P_k|k−1⁽ⁱ⁾ ) + γk(x) (11) where the meansm⁽ⁱ⁾_k|k−1and covariancesP_k|k−1⁽ⁱ⁾ of each component in the Gaussian mixture are given by the Kalman prediction

m⁽ⁱ⁾_k|k−1= Fk−1m⁽ⁱ⁾_k−1, (12) P_k|k−1⁽ⁱ⁾ = Qk−1+ Fk−1P_k−1⁽ⁱ⁾ F_k−1^T . (13)

(3)

GM-PHD Update

Now suppose that the predicted intensity to timek is a Gaussian mixture of the form

v_k|k−1(x) =

J_k|k−1X

i=1

w⁽ⁱ⁾_k|k−1N (x; m⁽ⁱ⁾_k|k−1, P_k|k−1⁽ⁱ⁾ ). (14)

Substituting the prediction intensityv_k|k−1and linear Gaussian observationgkinto the PHD update equation, we get

vk(x) = (15)

(1 − pD,k)vk|k−1(x) + X

z∈Z_k J_k|k−1X

j=1

w^(j)_k (z)N (x; m^(j)_k|k(z), P_k|k^(j))

where the means and covariances are calculated with the Kalman update equations and the weights of the components are given by

w_k^(j)(z) = p_D,kw^(j)_k|k−1N (z; ˆz_k|k−1^(j) , S^(j)_k|k−1) κ_k(z) + pD,k

PJ_k|k−1

=1 w⁽⁾_k|k−1N (z; ˆz_k|k−1⁽⁾ , S⁽⁾_k|k−1) (16) m⁽ⁱ⁾_k|k(z) = m⁽ⁱ⁾_k|k−1+ Kk(z − ˆzk|k−1), (17) P_k|k⁽ⁱ⁾= [I − KkHk]P_k|k−1⁽ⁱ⁾ , (18)

ˆzk|k−1= Hkm_k|k−1 (19)

Kk= Pk|k−1H_k^TS_k|k−1⁻¹ (20)

S_k|k−1= HkP_k|k−1H_k^T+ Rk. (21) To alleviate the computational complexity of an increasing number of Gaussian components, methods for pruning and merg- ing are introduced [6, 14].

4. TRACKING RESULTS

In this section we describe the tracking model for the frequency and amplitude sinusoid tracking. Essentially we use 2-dimensional tracking in the joint frequency and amplitude domain and consider the sinusoids to be the targets we track. In the frequency domain, we use a constant position model and in the amplitude domain, we use the use the damping coefficient as a coefficient of the state in the prediction model. More specifically, we use linear Gaussian dynamics with the following state space model:

xk=

„ 1 0

0 dk

«

xk−1+ vk−1, (22)

and observation model:

zk= xk+ k. (23) vkandkare the uncorrelated process and measurement noises, respectively, andd_kis the mean damping coefficient at timek. The state vectorxkis defined to be the frequency and amplitude of the sinusoidal component(ω α)^T, and the observation vectorzkis defined as the noisy estimate of this determined with the subspace method.

A sequence of notes has been played on the piano and the frequency, amplitude and damping of the sinusoidal components have been determined with the subspace method. For each time-step,

0 50 100 150 200 250 300 350

0−500 1000500 20001500 2500 35003000

−0.5 4000 0 0.5

1 1.5

2 2.5

3x 10

4

Hz Time

Tracking Results

Amplitude

Figure 1: Tracking Results over time

the mean of the damping coefficients has been used to determine the prediction. Measurements of the sinusoidal components determined by the subspace method, consisting of the frequency and amplitude of the components, are used as input to the tracker as noisy data,zk. The tracker provides estimates of the states of the targets, i.e. the amplitude and frequency, at each iteration and de- scribes the trajectories of each individual target over time. The results of the tracking over the sequence are presented in figures 1 to 3. Figure 1 shows the 2-dimensional tracking of frequency and amplitude over time, and figures 2 and 3 show the tracking in the frequency and amplitude domains, respectively. Figure 4 shows the estimated number of sinusoids at each time-step from the PHD filter. We can clearly see from figure 4 that the estimated number of targets corresponds well to the onset and termination of the notes played. When a new note is played, the estimated number of harmonics rapidly increases, showing that the harmonics are rapidly identified. Similarly, when the note terminates, there is a corresponding dip in the estimated number of harmonics.

The identfication of the correct harmonics can also be seen from figure 2, tracking in the frequency domain. Since the frequency of the harmonics are generally constant, we can identify the harmonic tracks as the horizontal lines. The distinct notes are clear and the onset and termination of harmonics are easy to identify. Figure 4 shows the tracking in the amplitude domain which is more challenging than in the frequency domain due to the rapid de- cay of the amplitude. In the amplitude domain, there is sometimes difficulty in identifying the onset of the strongest harmonics due to the large variation in the first few amplitudes. However, the tracker is able to identify most of the harmonics reasonably quickly and maintain them throughout the course of each note.

5. CONCLUSIONS

The Gaussian mixture Probability Hypothesis Density ﬁlter has been applied to the problem of multiple-target tracking of sinusoidal components from harmonics of notes played on a piano.

The ﬁlter uses a 2-dimensional model for tracking the frequency and amplitude of these components. The state space model as-

(4)

0 50 100 150 200 250 300 350

0 500 1000 1500 2000 2500 3000 3500

4000 Tracking Results

Time

Frequency (Hz)

Figure 2: Estimated tracks in frequency domain

0 50 100 150 200 250 300 350

0 0.5

1 1.5

2 2.5

3 x 10

4 Tracking Results

Time

Amplitude

Figure 3: Estimated tracks in amplitude domain

0 2 4 6 8 10 12 14

0 50 100 150 200 250 300 350

Estimated Number of Harmonics

Time Step Number of Harmonics Estimated

Figure 4: PHD ﬁlter estimated number of targets

sumes a constant position model for the frequency domain and uses the mean damping coefficient at each time-step to predict the amplitude of the sinusoids. The results show that the Gaussian mixture PHD filter can effectively identify the onset and termination of individual harmonics produced from notes played on a piano and maintain their identities. We have also shown that the PHD filter can be used to estimate the number of harmonics and that this information is particularly useful to identify the times when notes begin and end. These estimates may then be incor- porated in subsequent processing algorithms, such as sinusoidal coders or for music transcription.

6. REFERENCES

[1] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 4, pp.

744–754, 1986.

[2] M. Viberg, “Subspace-based methods for the identiﬁcation of linear time-invariant systems,” Automatica, vol. 31, no. 12, pp. 1835–1851, 1995.

[3] R. Badeau, R. Boyer, and B. David, “EDS parametric mod- elling and tracking of audio signals,” in DAFx-02, Hamburg, Germany, September 2002.

[4] B. David, G. Richard, and R. Badeau, “An EDS modelling tool for tracking and modifying musical signals,” in Proc.

of Stockholm Music Acoustics Conference, SMAC-03, Stock- holm, Sweden, August 2003.

[5] R. Mahler, “Multitarget Bayes ﬁltering via ﬁrst-order multi- target moments,” IEEE Transactions on Aerospace and Elec- tronic Systems, vol. 39, No.4, pp. 1152–1178, 2003.

[6] B. Vo and W. K. Ma, “The Gaussian Mixture Probability Hy- pothesis Density Filter,” IEEE Transactions on Signal Pro- cessing, 2006.

[7] K. Rao, B.D. Arun, “Model based processing of signals:

a state space approach,” Proceedings of the IEEE, vol. 80, no. 2, pp. 283–309, Feb 1992.

[8] T. K. Sarkar and O. Pereira, “Using the matrix pencil method to estimate the parameters of a sum of complex exponen- tials,” IEEE Antennas and Propagation Magazine, vol. 37, no. 1, February 1995.

[9] R. E. Kalman, “A new approach to linear ﬁltering and predic- tion problems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960.

[10] Y. Bar-Shalom and T. Fortmann, Tracking and Data Associ- ation. Academic Press, 1988.

[11] B. T. Vo, B. Vo, and A. Cantoni, “Analytic Implementations of the Cardinalized Probability Hypothesis Density Filter,”

IEEE Transactions on Signal Processing, in press, 2007.

[12] D. Clark, K. Panta, and B. Vo, “The GM-PHD Filter Multiple Target Tracker,” Proc. International Conference on Informa- tion Fusion. Florence., July 2006.

[13] D. Clark, B. Vo, and J. Bell, “GM-PHD Filter Multi-target Tracking in Sonar Images,” Proc. SPIE Defense and Security Symposium. Orlando, Florida [6235-29], 2006.

[14] D. Clark and B. Vo., “Convergence Analysis of the Gaussian Mixture PHD Filter.” IEEE Transactions on Signal Process- ing, Vol 55 No 4 pp.1204–1212, 2007.