Variational and stochastic inference for Bayesian Source Separation

(1)

Variational and stochastic inference for Bayesian Source Separation

A. Taylan Cemgil

^a,1

C´edric F´evotte

^b

Simon J. Godsill

^a

aEngineering Dept, University of Cambridge Trumpington St, CB2 1PZ, Cambridge, UK

{atc27,sjg}@eng.cam.ac.uk

bDept Signal-Image, T´el´ecom Paris fevotte@tsi.enst.fr

37-39, rue Dareau, 75014 Paris, France

Abstract

We tackle the general linear instantaneous model (possibly underdetermined and noisy) where we model the source prior with a Student t distribution. The conjugate- exponential characterisation of the t distribution as an infinite mixture of scaled Gaussians enables us to do efficient inference. We study two well known inference methods, Gibbs sampler and variational Bayes for Bayesian source separation. We derive both techniques as local message passing algorithms to highlight their algorithmic similarities and to contrast their different convergence characteristics and computational requirements. Our simulation results suggest that typical posterior distributions in source separation have multiple local maxima. Therefore we propose a hybrid approach where we explore the state space with a Gibbs sampler and then switch to a deterministic algorithm. This approach seems to be able to combine the speed of the variational approach with the robustness of the Gibbs sampler.

Key words: Source Separation, Variational Bayes, Markov Chain Monte Carlo, Gibbs sampler

1 Introduction

In Bayesian source separation (Mohammad-Djafari, 1997, 1999; Knuth, 1998, 1999; Knuth and Vaughan, 1998; Rowe, 2002, 2003), the task is to infer N

1 This research is sponsored by the EPSRC grant EP/D03261X/1 “Probabilistic Modelling of Musical Audio for Machine Listening” and MUSCLE Network of ex- cellence supported by the European commission under FP6.

(2)

source signals s_k,n given M observed signals x_k,m where n = 1 . . . N, m = 1 . . . M. Here, k is an index with k = 1 . . . K that may directly correspond to time or indirectly to an expansion coefficient of the sources in a (linear) trans- form domain. By letting s ≡ s_1:K,1:N and x ≡ x_1:K,1:M, a generic hierarchical Bayesian formulation of the problem can be stated as

p(s|x) = 1 Zx

Z

dΘ_mdΘ_pp(x|s, Θ_m)p(s|Θ_p)p(Θ_m)p(Θ_p) (1) The mixing process is characterised by the (possibly degenerate, deterministic) conditional distribution p(x|s, Θ_m), that is known as the observation model.

Here, Θm denotes the collection of mixing parameters such as the mixing matrix, observation noise variance, etc. The prior term p(s|Θ_p), the source model, describes the statistical properties of the sources via their own prior parameters Θp. The hierarchical model is completed by postulating hyper- priors over the nuisance parameters Θ_mand Θ_p. The normalisation term Z_x = p(x) is the marginal probability of the data under the model, also known as the evidence (MacKay, 2003). While its actual numerical value is not directly relevant for separation, it plays a key role for model order selection, for example when the number of sources N is unknown and need to be estimated (Miskin and Mackay, 2001).

In this paper we assume the mixing process to be linear and instantaneous. In reality, a linear mixing process may still involve convolutions (i.e. reverbera- tion) and may evolve with time (i.e. nonstationarity, moving sources). While such general scenarios are definitely of practical interest and pose significant computational challenges, these may be addressed in the same Bayesian framework of Eq. 1. Another common assumption is that of a-priori independent sources; i.e., the source model factorises as p(s|Θp) =^Q_np(sn|Θp), which un- derlies the well-known Independent Component Analysis model (Hyv¨arinen et al., 2001; Cichocki and Amari, 2003). While the validity of the a-priori independence assumption is debatable especially for musical audio, it is nev- ertheless a quite commonly made assumption mainly due to its simplicity.

Note however that some authors have considered source models where mutual dependence is taken into account (Rowe, 2002, 2003).

Once the posterior is computed by evaluating the integral, the estimates of the sources can be obtained as a marginal maximum-a-posteriori (MMAP) or minimum-mean-square-error (MMSE) estimate

s^∗ = argmax

s p(s|x) hsi_p(s|x) =

Z

sp(s|x)ds

Here, and elsewhere the notation hf (x)i_p(x) will denote the expectation of the function f (x) under the distribution p(x), i.e. hf (x)i_p ≡^R dxf (x)p(x).

(3)

Unfortunately, the computation of the posterior distribution p(s|x) via the integral in Eq.1 is intractable for almost all relevant observation and source models, even under conditionally Gaussian and independence assumptions.

Hence, approximate numerical integration techniques have to be employed.

In summary, the success of a practical Bayesian source separation system hinges upon the following points:

• Topology (structure) of the model and the parameter regime

• The accuracy, solution quality and the computational cost of the inference algorithm

Obviously, some tradeoff has to be made between model complexity and computational effort. However, for a particular application, the details of how to make this tradeoff are far from clear, especially given that there are many candidate models and approximate inference algorithms. Moreover, it is not uncommon that the difficulty of the inference problem is determined by the particular parameter regime, not only by model structure. For example, the difficulty of a source separation task depends upon the particular details of the mixing system where the effective rank of the mixing matrix and the number of observed data points play a critical role. This affects the convergence properties of approximate inference algorithms and success often depends upon initialisation and other design choices (such as tempering or propagation schedule).

In this paper, we will investigate some of the issues raised above. We will focus on the underdetermined case (M < N ) and with noisy mixtures. The underdetermined case is challenging because contrary to the (over)determined case, estimating the mixing system is not sufficient for reconstruction of the sources and the prior structure p(s|Θ) becomes increasingly important for reconstructing the sources in noisy environments.

We will follow the model proposed in (F´evotte et al., 2004; F´evotte and Godsill, in press), where audio sources are decomposed on a MDCT basis (a local cosine transform orthonormal basis) and the coefficient prior p(s|·) is a Student t distribution. This prior prefers the sparsity of the sources on a given dictionary, which intuitively means that the MAP solution s^∗ or MMSE solution hsi typically contain few coefficients that are significantly non-zero. The use of source sparsity to handle underdetermined source separation finds its origin in the deterministic approaches of (Zibulevsky et al., 2001; Jourjine et al., 2000).

In (F´evotte and Godsill, in press), a Gibbs sampler is derived to sample from the posterior distribution of the parameters (the mixing matrix, the sources coefficients, the additive noise variance and the hyper-parameters). Minimum Mean Square Error estimates of the coefficients of the sources are then com-

(4)

puted and time domain estimates of the sources are reconstructed by applying inverse-MDCT. The results were reproducible, i.e., independent of initialisation, and the method could compute a perceptually reasonable separation of the sources.

The main advantage of MCMC is its generality, robustness and attractive theoretical properties. However, the method comes at the price of heavy computational burden which may render it impractical for certain applications. An alternative approach for computing the required integrals is based on deterministic fixed point iterations (Variational Bayes – Structured Mean field) (Ghahra- mani and Beal, 2000; Wainwright and Jordan, 2003; Bishop, 2006). This set of methods have direct links with the well-known expectation-maximisation (EM) type of algorithms.

Variational methods have been extensively applied to various models for source separation by a number of authors (Attias, 1999; Lappalainen, 1999; Valpola, 2000; Girolami, 2001; Miskin and Mackay, 2001; Hojen-Sorensen et al., 2002;

Chan et al., 2003; Winther and Petersen, 2006). In this contribution, we derive both VB and Gibbs sampler as local message passing algorithms on a factor graph (Kschischang et al., 2001; Bishop, 2006). This approach has two advantages: first it provides a pedagogical perspective to highlight the algorithmic similarities and to contrast objectively the convergence characteristics and computational requirements of different inference algorithms. Secondly, the framework facilitates generalisation to more complex models and to au- tomation of code generation procedure including alternative message passing methods.

2 Model

The source separation model defines a generative model for the source coefficients. The model is applicable to the actual time series ˆx_t,m for t = 1 . . . T observed at channel m as well as when transformed by an orthogonal basis to a new series x_k,m where the new index k typically encodes both the fre- quency band and frame number (a particular tile in a time-frequency plane).

We assume linear instantaneous mixing²:

x_k,m|s_k,1:N, a_m, r_m∼ N (x_k,m; a^>_ms_k,1:N, r_m)

where s_k,1:N is the vector of source coefficients, a_m is a N × 1 vector that denotes the mixing proportions for the m’th channel, and r_m is the variance

2 N , IG and G denote Gaussian (Normal), Inverse Gamma and Gamma distribu- tions respectively and are defined in the appendix.

(5)

of observation noise in the m’th channel. Since both of these quantities are typically unknown exactly, we place informative priors on them³

r_m∼ IG(r_m; a_r, b_r) a_m∼ N (a_m; µ_a, P_a)

In many domains, (audio, natural images) the distribution of source coeffi- cients s_k,n tends to be heavy tailed. For example in typical audio signals, this is a simple consequence of the fact that the vast majority of physical sys- tems (human vocal tract, many acoustical musical instruments e.t.c.) produce quasi-periodic oscillations. Hence, over short time scales, the energy of signals from these sources is concentrated in a few narrow frequency bands which enables a sparse representation using just a few expansion coefficients. In this respect, a sparse prior can be justified by the underlying physics. We assume the source coefficients to be zero mean Gaussian

s_k,n|v_k,n∼ N (s_k,n; 0, v_k,n)

where v_k,n is an unknown variance. Note that the variance v_k,n (for zero mean case) is equal to the expected amount of energy. Since this quantity is in general unknown, we place an informative prior as

v_k,n|λ_n∼ IG(v_k,n; ν/2, 2/(νλ_n))

We note that this hierarchical prior model on the source coefficients is equivalent to a t-distribution

Tν(s; µ, λ) =

Z

dvN (s; µ, v)IG(v;ν 2, 2

νλ) ≡ Γ((ν + 1)/2)

√νπλ Γ(ν/2)

Ã

1 + 1 ν

(s − µ)² λ

!_−(ν+1)/2

where the parameter λ_ndirectly controls the variance of the t-distribution and hence the power of the n’th source. We place an informative prior on λ_n as

λ_n∼ G(λ_n; a_λ, b_λ)

In this paper the degrees of freedom ν is fixed to a low value, yielding a distri- bution with heavy tails to model sparsity. However, any estimation procedure of the degrees of freedom of an inverse Gamma distribution could be employed to learn ν as well. As such, in (Cemgil et al., 2005), we have considered a lower

3 Note that one can break the source separation gain indeterminacy (ca)^>s = a^>(cs), by placing appropriate priors on the mixing proportions a or source coeffi- cients s. The posterior then becomes only invariant to permutations of the sources.

(6)

λ₁ . . . λn . . . λN ∼ G(λn; aλ, bλ)

v_k,1 . . . vk,n · · · vk,N ∼ IG(vk,n; ν/2, 2/(νλn))

s_k,1 . . . sk,n . . . sk,N ∼ N (s_k,n; 0, vk,n)

x_k,1 . . . xk,M

k = 1 . . . K

∼ N (xk,m; a^>_ms_k,1:N, rm)

a₁ r₁ . . . a_M

∼ N (am; · · · ) rM

∼ IG(rm; · · · )

Fig. 1. Graphical model for the source separation model. The rectangle denotes a plate, K copies of the nodes inside.

bound maximisation. Similarly, an empirical adaptive scheme is considered in (F´evotte and Godsill, in press) and a Metropolis-Hasting procedure is proposed in (Moussaoui et al., to appear). The graphical model of the full model is shown in Figure 1.

3 Inference

The model described in the previous section is an instance of the basic Bayesian source separation stated in (1), where the mixing model parameters Θ_m = {a_1:M, r_1:M} ≡ {A, R} and the source prior parameters are Θ_p = {v_1:K,1:N, λ_1:N} ≡ {v, λ}. We will refer to all parameters by Θ ≡ (Θp, Θm). The remaining hyper- parameters, a_λ, b_λ, a_r, b_r, µ_a, P_a and ν are assumed to be known. Formally, we need to compute the posterior distribution

p(s|x) = 1 Zx

Z

φ_x(s, A, R, v, λ)dAdRdvdλ (2)

where φ_x is the joint distribution evaluated at the observed x given as

(7)

φ_x(s, A, R, v, λ) ≡ p(x|s, A, R)p(s|v)p(A)p(R)p(v|λ)p(λ)

=

YK

k=1

Ã _N Y

n=1

p(vk,n|λn)p(sk,n|vk,n)

! Ã _M Y

m=1

p(xk,m|sk,1:N, am, rm)

!

Ã _M Y

m=1

p(a_m)p(r_m)

! Ã _N Y

n=1

p(λ_n)

!

and estimate the sources by, e.g., MMSE given as hsi_p(s|x). Unfortunately, the required integral can not be computed exactly and numerical techniques have to be employed.

In the following two sections, we will review two alternative approaches for numerical computation: a stochastic method (Markov Chain Monte Carlo) and a deterministic method (Structured Mean field - Variational Bayes). Both methods are well understood and a vast amount of literature is available on both subjects. In this paper, we will introduce both techniques in the context of source separation to highlight the algorithmic similarities in order to facilitate a comparison.

3.1 Markov Chain Monte Carlo

Suppose we could generate I independent samples (s⁽ⁱ⁾, Θ⁽ⁱ⁾) i = 1 . . . I from the joint posterior

1 Zx

φ_x(s, Θ) ≡ 1 Zx

φ_x(s, A, R, v, λ)

Then, the intractable integral can be trivially approximated by simply dis- carding Θ⁽ⁱ⁾ and the sources could be recovered by

hsi ≈ 1 I

XI

i=1

s⁽ⁱ⁾ (3)

Unfortunately, generating independent samples from the joint posterior is a difficult task. On the other hand, it is usually easier to generate dependent samples, that is we generate (s⁽ⁱ⁺¹⁾, Θ⁽ⁱ⁺¹⁾) by making use of (s⁽ⁱ⁾, Θ⁽ⁱ⁾). Per- haps surprisingly, even though subsequent samples are correlated (and pro- vided certain ergodicity conditions are satisfied), Eq. 3 remains still valid and estimated quantities converge to their true values when number of samples, I, goes to infinity (Gilks et al., 1996).

A sequence of dependent samples (s⁽ⁱ⁾, Θ⁽ⁱ⁾) is generated by sampling from a Markov chain that has desired joint posterior φ_x(s, Θ)/Z_x as its stationary dis-

(8)

tribution. The chain is defined by a collection of transition probabilities, i.e., a transition kernel T (s⁽ⁱ⁺¹⁾, Θ⁽ⁱ⁺¹⁾|s⁽ⁱ⁾, Θ⁽ⁱ⁾). The Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) provides a simple way of defining an ergodic kernel that has the desired stationary distribution. Suppose we have a sample (s⁽ⁱ⁾, Θ⁽ⁱ⁾). A new candidate (s⁰, Θ⁰) is generated by sampling from a proposal distribution q(s, Θ|s⁽ⁱ⁾, Θ⁽ⁱ⁾). We define the acceptance probability a by

a(s⁰, Θ⁰ ← s, Θ) = min

Ã

1,φ_x(s⁰, Θ⁰)/Z_x φ_x(s, Θ)/Z_x

q(s, Θ|s⁰, Θ⁰) q(s⁰, Θ⁰|s, Θ)

!

The new candidate (s⁰, Θ⁰) is accepted as the next sample (s⁽ⁱ⁺¹⁾, Θ⁽ⁱ⁺¹⁾) with probability a, otherwise (s⁽ⁱ⁺¹⁾, Θ⁽ⁱ⁺¹⁾) ← (s⁽ⁱ⁾, Θ⁽ⁱ⁾). The algorithm is ini- tialised by generating the first sample (s⁽⁰⁾, Θ⁽⁰⁾) according to an (arbitrary) proposal distribution.

However for a given transition kernel T , it is hard to assess the time required to converge to the stationary distribution so in practice one has to run the simulation until a very large number of samples have been obtained, (Roberts and Rosenthal, 1998). The choice of the proposal distribution q is also very critical.

A poor choice may lead to the rejection of many new candidates and conse- quently to a very slow convergence to the stationary distribution. In the context of source separation, where the dimension is large, it is not easy to design a proposal that makes global jumps, as suggested by T (s⁽ⁱ⁺¹⁾, Θ⁽ⁱ⁺¹⁾|s⁽ⁱ⁾, Θ⁽ⁱ⁾), where a whole block of variables is updated in parallel.

3.2 Gibbs sampling

A simpler approach is to sample the variables one by one or in smaller, mutually disjoint blocks. For this purpose, we group all variables in mutually exclusive sets Cα, that we name as clusters. The subscript α = 1 . . . A denotes the unique cluster index and C = {C₁, . . . , C_α, . . . , C_A} is the set of all clusters.

Often, there is a huge number of choices for grouping the variables. A natural choice for the source separation problem is ⁴

C = {s_1,1:N, . . . , s_K,1:N, a₁, . . . , a_M,

r₁, . . . , r_M, v_1,1, . . . , v_k,m, . . . , v_K,M, λ₁, . . . , λ_M} (4)

4 Note that in this choice, for each k, we cluster the sources s_k,1, . . . , s_k,N jointly as s_k,1:N. This particular blocking leads to better convergence because whilst s_k,1:N are a-priori independent, clearly they are a-posteriori dependent given x_k,1:M.

(9)

Given a clustering one can apply a more specialised Markov chain Monte Carlo (MCMC) algorithm, the Gibbs sampler, by choosing the proposal density as the full conditional distributions given as p(C_α|C_¬α), where C_¬α ≡ C\C_α, i.e., all clusters but C_α. It also turns out that the acceptance probability becomes a = 1 and all the proposals are accepted by default.

To complete the algorithm, we need to decide upon the order in which we will visit the individual clusters α. We will call such a sequence a propagation schedule π_α(j) where j = 1 . . . AI.

To generate a new candidate for a cluster Cα, we freeze the configuration of the remaining clusters C_¬α^(j−1) (where α = π_α(j)) and compute the full conditional distribution p(C_α|C_¬α^(j−1)) ∝ φ_x(C_α, C_¬α^(j−1)).

Theoretically, the actual order is not important and can be randomised, pro- vided that each cluster is visited infinitely often in the limit when j → ∞.

In practice however, given limited computational resources, a particular propagation schema may be favourable, since it may converge faster. Typically, periodic propagation schedules are employed where each cluster is visited once according to some permutation. The permutation is repeated I times so each cluster is visited I times at the end of the simulation. It is also common prac- tice to let the chain run for an initial burn in period where each cluster is visited I₀ times without computing any statistics.

The key observation is that, given a clustering of the variables C, the joint posterior admits the factorisation

φ_x(C) =^Y

β

φ_x,β({C_α}_{α∈neigh(β)}) (5)

Here, φ_x,β are positive and possibly unnormalised functions. This expression says that if cluster α appears in factor φ_x,β, we have α ∈ neigh(β) and the joint can be written as a product of such terms⁵.

The structure of the factorisation can be conveniently visualised using a factor graph, see Fig 2 (Kschischang et al., 2001). Each black square correspond to a factor φ_x,β where φ_x =^Q_βφ_x,β and the circles denote the individual cluster variables Cα. A factor φx,β is connected to a cluster node α if α ∈ neigh(β),

5 As an example, suppose φ(s, v) = p(s|v)p(v) = N (s; 0, v)IG(v; 1, 1) and we choose C = {C₁, C₂} = {{v}, {s}}. Then

φ(s, v) ≡ exp

³

−1 2

s² v −1

2log 2πv

´

exp

³

−2 log v −1 v

´

∝ exp

³

−1 2

s² v

´

exp

³

−5 2log v −1

v

´

= φ₁(s, v)φ₂(v) ≡

Y2 β=1

φ_β

and we would have neigh(β = 1) = {1, 2} and neigh(β = 2) = {1}.

(10)

i.e. a variable in C_α appears in the defining expression of φ_x,β. Obviously the relation is symmetric: a cluster α is connected to φx,β when β ∈ neigh(α).

For a given cluster α, we could write Eq.(5) as

log φ_x(C) = ^X

β∈neigh(α)

log φ_x,β({C_a}_{a∈neigh(β)}) + ^X

β /∈neigh(α)

log φ_x,β({C_a}_{a∈neigh(β)}(6))

p(C|C_¬α^(j−1)) ∝ φx(Cα; C_¬α^(j−1)) = exp



 X

β∈neigh(α)

log φx,β({Ca}_{a∈neigh(β)})



 (7)

Given the configuration C_¬α^(j−1), the second term in Eq.6 does not depend on α, hence only a small subset of clusters in ¬α actually effect the expression of the full conditional. This subset, which is known as the Markov blanket of α, will be denoted by¬α. The Markov blanket consists of nodes that share a^d common factor with α, i.e. ¬α ≡ {α^d ⁰ : α⁰ ∈ ¬α ∧ ∃βs.t.(β ∈ neigh(α⁰) ∧ β ∈ neigh(α))}. For example, in Fig 2, the Markov blanket of {v_k,n} consists of {{λ_k,n}, {s_k,1:N}}.

This simplification reduces the amount of computation required for computing the full conditional distribution. A further computational simplification is obtained when the expression for the full conditional corresponds to a known distribution. One way to ensure this is choosing the hierarchical model such that all priors are conjugate-exponential⁶. In this case, all conditional marginals have a closed form solution. The Gibbs sampling algorithm is summarised below:

• Initialise (or Burn in):

For α = 1 . . . A, C_α⁽⁰⁾ ∼ q(C_α)

• Sample the chain (typically J = AI):

For j = 1 . . . J α = π_α(j) C_α^(j)∼ p(C_α|C_¬α^(j−1)_c )

• Estimate hCαi ≈1

I

X

js.t.πα(j)=α

C_α^(j)

6 Consider the example from previous footnote

p(v|s = ˆs) ∝ φ(ˆs, v) ∝ exp

³

−1 2 ˆ s²

v

´

exp

³

−5 2log v −1

v

´

= exp

³

−5 2log v −

³

1 +1 2sˆ²

´₁

v

´

∝ IG(v; 3/2, 2/(2 + ˆs²))

The full conditional p(v|s = ˆs) has the same form as the prior p(v).

(11)

λ1 . . . λ_n . . . λ_N

. . . . . .

v_k,1 . . . v_k,n

· · · v_k,N

. . . · · ·

s_k,1:N

. . .

k = 1 . . . K

a₁ r₁ . . . a_M r_M

Fig. 2. Factor graph corresponding to the clustering of the variables as defined in Eq.4.

3.3 Structured Mean Field – Variational Bayes

One alternative approximation method, that leads to an iterative optimisa- tion procedure is the structured mean field method, also known as variational Bayes, see (Wiegerinck, 2000; Ghahramani and Beal, 2000; Wainwright and Jordan, 2003; MacKay, 2003; Bishop, 2006) and references herein. In our case, mean field boils down to approximating the integrand P = φx/Zx in (2) with a simple distribution Q in such a way that the integral becomes tractable. An intuitive interpretation of mean field method is minimising the KL divergence with respect to (the parameters of) Q where

KL(Q||P) = hlog Qi_Q−

¿

log 1 Z_xφ_x

À

Q

(8) Using non-negativity of KL, (Cover and Thomas, 1991), we obtain a lower bound on the evidence

log Zx ≥ hlog φxi_Q − hlog Qi_Q≡ F [P, Q] + H[Q]

Here, F is interpreted as a negative energy term and H[Q] is the entropy of the approximating distribution. The maximisation of this lower bound is equivalent to finding the “nearest” Q to P in terms of KL divergence and

(12)

this solution is obtained by a joint maximisation of the entropy H and F (minimisation of the energy) (Neal and Hinton, 1999).

In this paper, we choose a factorized approximating distribution Q that re- spects the factorisation of form

Q =^Y

α

Q_α(C_α)

where the clusters are defined as in (4). Although a closed form solution for Q still can not be found, it can be easily shown, e.g. see (Wainwright and Jordan, 2003; Winn and Bishop, 2005), that each factor Q_α of the optimal approximating distribution should satisfy the following fixed point equation

Qα∝ exp^³hlog φxi_Q_¬α^´ (9)

where Q_¬α ≡ Q/Q_α, that is the joint distribution of all factors excluding Q_α. Hence, the mean field approach leads to a set of (deterministic) fixed point equations that need to be iterated until convergence.

Right hand side of this fixed point iteration can be computed efficiently since the joint posterior admits the factorisation in (5) that translates to

hlog φxi_Q_¬α=^X

β

hlog φx,βi_Q_¬α =⁺ ^X

β∈neigh(α)

hlog φx,βi_Q

neigh(β)\α

where f (x) =⁺g(x) denotes equal up to some irrelevant constant c, i.e., f (x) = g(x)+c. The expectations hφ_x,βi can be computed easily if all distributions are chosen to be in a conjugate-exponential family, for example see (Ghahramani and Beal, 2000). One additional nice feature of this fixed point iteration is that at every step, the lower bound defined in Eq.9 is guaranteed to increase.

This feature is useful in ensuring that an implementation is bug free.

The Variational Bayes algorithm is summarised below:

• Initialise:

For α = 1 . . . A, set Q⁽⁰⁾_α

• Iterate the fixed point equations (typically J = AI):

For j = 1 . . . J α = π_α(j) Q^(j)_α = exp



 X

β∈neigh(α)

hlog φ_x,βi_Q^(j−1)

neigh(β)\α





(13)

• Estimate

hC_αi_P≈ hC_αi_Q(J)

It is important to note the similarity between the Gibbs sampler and variational Bayes. Given the factor graph of Fig.2, both algorithms have a simple

“visual” interpretation. Remember that the factor nodes β ∈ neigh(α) (black squares in Fig.2) define local compatibility functions between α and its Markov blanket α⁰ ∈¬α.^d

In Gibbs sampler, on each cluster node α, we store a configuration. If we wish to sample the cluster node α at step j of the simulation, all we need is the current configuration of the Markov blanket ¬α^d^(j−1) and we need to evaluate φ_x,β for β ∈ neigh(α) pointwise. When the nodes are chosen as conjugate exponential, the expression for this full conditional density is already available in closed form as a function of the configuration of the Markov blanket.

In variational Bayes, on each cluster node α, we store sufficient statistics. If we wish to update (the sufficient statistics of) the approximating distribution Q_α, all we need is the current sufficient statistics of Q_¬α_c and we need to evaluate the expectation hφ_x,βi_Q

c

¬α

for β ∈ neigh(α). Again, if the nodes are chosen as conjugate exponential, the expression for Q_α is already available in closed form as a function of the sufficient statistics of its Markov blanket⁷.

3.4 Tempering

As illustrated above, while underlying theoretical principles are different, both the Gibbs sampler and variational Bayes share the same local updating strategy and both can suffer from slow convergence or get stuck in a local mode of the posterior. One general strategy to circumvent these problems is tempering⁸. The idea is to define a sequence of target distributions

φ^(τ_x¹⁾, . . . , φ^(τ_x^j⁾, · · · → φx

7 Consider the example from the first footnote and assume Q₁(v) = IG(v, a_v, b_v) and Q₂(s) = N (s; m_s, S_s) where a_v, b_v, m_s, S_sare variational parameters. By Eq.9, we have the following fixed point equations:

log Q₁(v) =⁺hlog φ₁(s, v)φ₂(v)i_Q2(s)= hlog φ₁(s, v)i_Q2(s)+ log φ₂(v) = −1 2

s²®₁

v−5 2log v −1

v

=⁺log IG(v; 3/2, 2/(2 + S_s+ m²_s)) log Q₂(s) =⁺hlog φ₁(s, v)φ₂(v)i

Q1(v)=⁺hlog φ₁(s, v)i

Q1(v)= −1 2s²

D1 v

E

=⁺log N (s; 0, 1/(a_vb_v))

8 Different communities use different terms for this idea such as bridging, annealing or overrelaxation. We prefer to reserve the term simulated annealing for the particular case where the inverse temperature τ goes to infinity to locate the modes of the posterior.

(14)

that converge to the desired posterior. The sequence τ_j is referred as a temper- ing schedule. The idea is to use samples or sufficient statistics obtained from φ^(τx^j−1⁾ as a the starting point for φ^(τx^j⁾. In general, the schedule can be quite arbitrary but one simple strategy is to define a sequence 0 ≤ τ₁ ≤ · · · ≤ τ_j ≤

· · · ≤ 1 and to raise the posterior to the power as φ^(τx^j⁾ = φ^τx^j. When τ is near zero, the posterior will be flatter and the modes will not be very pronounced.

4 Simulation Results

The first simulation is designed to illustrate both algorithms on a simpler problem where there is only one source (N = 1) that is observed through a single noisy observation channel (M = 1). This problem is in fact a de-noising problem where we wish to “separate” a single source from Gaussian noise.

We sample 200 independent problem instances from the model with K = 180 data points using hyper-parameters (a_λ, b_λ, a_r, b_r, ν) = (0.5, 10, 0.5, 10, 0.5).

The mixing matrix is set A = a1 = 1 and is also assumed to be known.

For each problem instance (x, s, v, r, λ)ôrig, we infer the posterior mean of the sources s^rec = hsi_p(s|xorig)and compare it with sôrig. The parameter λôrig defines the scale of the source and rôrig denotes the variance of the observation noise.

The difficulty of de-noising depends upon the signal to noise ratio λ/r, where cases with low λ/r are typically more difficult to de-noise.

For each of the 200 instances, we initialise the parameters as follows: prior means for Gibbs and prior sufficient statistics for VB. We run both algorithms using the same propagation schedule where we update the parameters peri- odically with the order {s, v, r, λ}. We run the Gibbs sampler for I = 1000 iterations with a burn-in period of I₀ = 100 iterations. The variational al- gorithm, in contrast, is only run for I = 100 iterations, which seems to be sufficient to ensure its convergence.

The results for two typical problem instances are shown in Figure 3, where we show the approximation to the posterior marginal p(λ, r|x) as computed by both inference methods. In both cases, the Gibbs sampler seems to be captur- ing the posterior marginal. In contrast, in the harder case, VB underestimates the observation noise variance by about an order of magnitude.

The results of the experiment are summarised in Figure 4. We measure the quality of reconstruction by the SNR defined by

SNR(s^orig, s^rec) = 10 log₁₀

Ã |s^orig|²

|s^orig− s^rec|²

!

(15)

10⁰ 10⁻³ 10⁻² 10⁻¹ 10⁰

λ

r

SNR vb : 35.96 SNR gibbs : 35.24

10⁻¹ 10⁰ 10¹ 10²

λ

r

SNR vb : 78.74 SNR gibbs : 79.22

Fig. 3. Two typical runs of the inference algorithms. (Upper) easy - low noise and (Lower) harder - high noise. The dots correspond to the samples obtained from the Gibbs sampler and the contour plot corresponds to the variational approximation.

The round point denotes (λ^orig, r^orig). Note that this picture is only showing the approximation to the posterior marginal p(λ, r|x).

(16)

In the top panel of Figure 4, we show each problem instance by a point (λ^orig, r^orig) and classify by the SNR criteria. Although this is a rather crude comparison, the emerging pattern suggests that Gibbs sampling achieves better performance for harder cases where VB seems to be a viable choice for easier instances. A pairwise comparison of SNR values for methods shows that we can expect comparable performance from both methods and there is a slight tilt towards Gibbs sampler at the difficult cases. We obtain qualita- tively similar behaviour in similar experiments in other reasonable parameter regimes, although the details of these pictures vary (such as the absolute dB levels).

We observe that, in the easy instances both algorithms achieve essentially the same performance level with VB requiring far less computation. In more difficult cases, the Gibbs sampler tends to be slightly superior. In a few isolated cases, the Gibbs sampler failed to converge in I = 1000 iterations, whereas VB converged quite quickly to a viable solution.

4.1 Source separation, synthetic example

Next, we illustrate both algorithms on synthetic data (K = 200) sampled from the model with N = 3 sources and M = 2 observation channels with (a_λ, b_λ, a_r, b_r, ν) = (2, 10, 2, 1, 1). The first row of the mixing matrix a is set to a1 = [1 1 1]^> in order to remove the BSS indeterminacy on gain. The second row a₂ is sampled from the prior and is assumed to be unknown. The observations x are shown in Figure 5.

We note that we use the same clustering of the variables for both Gibbs and VB, as given in Eq. 4. In particular, we cluster the sources s_k,1, . . . , s_k,N jointly as s_k,1:N. This is because whilst s_k,1:N are a-priori independent, clearly they are a-posteriori dependent given x_k,1:M. In MCMC jargon, this is a simple form of blocking which improves convergence to the stationary distribution (Gilks et al., 1996). In the case of VB, a detailed comparison of using a richer approximation versus a simpler factorised approximation is given by Ilin and Valpola (2005) for ICA. It is also easy to see that given this particular model topology, introducing an even richer approximation that tries to capture cor- relations across k (e.g. such as Q(s_1:K,1:N)), whilst in principle tractable will not give us a better approximation, see Barber and Wiegerinck (1999) for a discussion of choosing tractable approximations in general graphical models.

Our initial simulations with various K ≤ 2000 have revealed that the pos- terior is multimodal and the solutions that are obtained by both methods depend critically upon starting configuration. The Gibbs sampler was slightly superior but the performance was not too different. Similarly to (F´evotte and

(17)

10⁻² 10⁻¹ 10⁰ 10¹ 10² 10⁻²

10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴

λ

r

ar = 0.5, b

r = 10.00, a

l = 0.5,b

l = 10.00, nu = 0.5, N = 180

VB Gibbs

0 50 100 150 200

0 20 40 60 80 100 120 140 160 180 200

SNR_VB SNR Gibbs

ar = 0.5, b

r = 10.00, a

l = 0.5,b

l = 10.00, nu = 0.5, N = 180

Fig. 4. Denoising Problem. (Up) 200 problem instances denoted by (λ^orig, r^orig):

instances where the Gibbs sampler (VB) achieves a higher SNR is shown with crosses (triangles). (Down) Comparison of both algorithms in terms of SNR on the same set of instances.

Godsill, in press), we have added a tempering schedule where we modified the scale parameter of the observation noise variance as log₁₀b_r = −8 . . . 0 in I/2 epochs. This has the effect of slowly blending in the data contributions. Other approaches, such as tempering ν, were not as useful.

(18)

m

k

20 40 60 80 100 120 140 160 180 200

1

2 −1000

−500 0 500

Fig. 5. Observations x_k,m.

500 1000 1500 2000

Epoch

Fig. 6. Source Separation, synthetic example with 250 epochs of Gibbs with tempering and 250 epochs of VB.

We have found a hybrid approach most effective: switching between Gibbs sampler and VB during the simulation cycles. In Figure 6, we show a typical simulation run where we switch between Gibbs and VB every 250 epochs.

When we switch to the Gibbs sampler, we apply a tempering schedule for 300 epochs. This enables the chain to jump between modes. Once we switch back to VB, the chain quickly converges to a “nearby” solution. In Figure 7, we show the reconstruction of the sources as obtained each time before switching to Gibbs sampling. As a careful investigation shows, each reconstruction assigns source coefficients with high amplitude to different sources.

(19)

Epoch = 500

Epoch = 1000

Epoch = 1500

Epoch = 2000

k Original

20 40 60 80 100 120 140 160 180 200

Fig. 7. Reconstructions (from top) and original (bottom). As shown in Fig 6, during first switch at epoch 500, the simulation has not converged and the reconstruction is poor (top panel). Subsequent reconstructions are close to the original with only a few source coefficients assigned incorrect. The final reconstruction is close to the original and the mixing system is estimated accurately.

4.2 Audio source separation example

We study a linear instantaneous mixture of N = 3 audio sources (speech, piano, guitar) with M = 2 observation channels.⁹ The signals are sampled at 8kHz with length N = 65356 (≈ 8s). The actual time series ˆx_t,m for t = 1 . . . T observed at channel m are transformed by a MDCT basis (Mallat, 1998) to new series x_k,m where the new index k encodes both the frequency band and frame number jointly (a particular tile in a time-frequency plane). The MDCT was used with a sine window with time resolution 64ms.

In the first experiment, we set the first row a^>₁ of the mixing matrix to a1 = [1 1 1]^>. The second row a^>₂ is set to [tan ψ₁tan ψ₂tan ψ₃] with ψ₁ = −45^◦, ψ₂ = 15^◦ and ψ₃ = 75^◦.

Sources are reconstructed by inverse MDCT of the estimated source coefficients s. The reconstructions are compared to the original sources using the source separation evaluation criteria are defined in appendix D and described

9 These results are reproduced from an earlier version of this paper (Cemgil et al., 2005)

(20)

500 1000 1500 2000

Epoch

Fig. 8. Source Separation, synthetic example with 70 epochs of Gibbs and 430 epochs of VB. Original parameters are shown with dashed lines.

in great detail by (Vincent et al., 2006). The SDR (Source to Distortion Ra- tio) provides an overall separation performance criterion, the SIR (Source to Interferences Ratio) measures the level of interference from the other sources in each source estimate, SNR (Source to Noise Ratio) measures the error due to the additive noise on the sensors and the SAR (Source to Artifacts Ratio) measures the level of artifacts in the source estimates. The performance criteria are reported in Table 1. We point out that the performance criteria are invariant to a change of basis, so that figures can be computed either on the time sequences or the MDCT coefficients.

In the second experiment we have relaxed the condition that we know the first row. We set the mixing matrix such that the columns are almost linearly dependent with a1 = [1 1 1]^> and a2 = [tan ψ1tan ψ2tan ψ3] with ψ1 = 57^◦, ψ₂ = 66^◦and ψ₃ = 75^◦. The inference problem becomes harder and we observe that both the Gibbs sampler and the VB can get trapped in local maxima easily and multiple restarts are necessary. To render the approach feasible it was important to find a reasonable initialisations for the mixing matrix. For this purpose, we have introduced a fast initialisation schema which can be obtained from a simpler prior structure imposed upon the sources in the zero noise limit (see appendix E). Our approach is analogous to initialisation of Bayesian Gaussian mixture estimation with a fast method such as k-means clustering. The initialisation and reconstruction results are shown in figure 9.

As expected, the initialisation assigns each time-frequency atom to only one

(21)

ˆ s₁

SDR SIR SAR SNR

MCMC 6.3 15.0 7.3 20.4

Variational 6.4 15.4 7.3 21.7 ˆ

s₂

SDR SIR SAR SNR

MCMC 5.1 14.3 5.8 27.8

Variational 5.2 15.0 5.8 24.8 ˆ

s₃

SDR SIR SAR SNR

MCMC 16.6 23.7 17.8 29.8 Variational 16.6 25.3 17.5 29.7 Table 1

Performance criteria of estimated sources with both methods in the first experiment.

Initialisation

SDR SIR SAR SNR

s₁ 5.1 20.5 5.3 48.6 s₂ 4.2 16.0 4.6 49.9 s₃ 10.6 25.5 10.8 58.7

Reconstruction

SDR SIR SAR SNR

s₁ 5.9 15.0 6.6 64.5 s₂ 4.5 14.1 5.2 62.9 s₃ 16.1 25.0 16.7 74.1 Table 2

Performance criteria of estimated sources in the second experiment.

source. Perhaps surprisingly, the separation quality of the initialisation is comparable to the final reconstruction. In subsequent runs with the original model, the source assignments become less crisp; while the quality of separation is not significantly higher, the sound quality of reconstruction is slightly better (in- creasing SDR and SNR) in Table.2.

The estimated sources can be downloaded from http://www-sigproc.eng.

cam.ac.uk/~atc27/papers/cemgil-dsp-bss.html, which is perhaps the best way to assess the audio quality of the results.

(22)

Original, source 1 Original, source 2 Original, source 3

Initialisation, source 1 Initialisation, source 2 Initialisation, source 3

Reconstruction, source 1

Frame

Frequency

Reconstruction, source 2 Reconstruction, source 3

Fig. 9. The logarithm of the absolute value of MDCT coefficients of the original signal (top row), initialisation (middle row) and reconstructions (bottom).

5 Conclusions

We have studied two inference methods, Gibbs sampler and variational Bayes for Bayesian source separation. While both methods are algorithmically similar, they have different characteristics. Our simulations suggest that the inference problem can be hard and the posterior may display multiple local maxima.

In many source separation scenarios, the posterior distribution tends to be multimodal, with each mode typically corresponding to an alternative interpretation of the observed data. For example, the exact posterior marginal on the mixing matrix A has multiple modes (apart from m! equivalent modes corresponding to the BSS indeterminacy on permutation of sources.). Each of these modes imply alternative hypotheses about the mixing process and typically only one of these modes corresponds to the desired/original separation, see, (Chan et al., 2003; Hansen and Petersen, 2003) for illustrative examples. The implication of the multimodality is that local algorithms such as the Gibbs sampler or greedy algorithms such as VB will get trapped in a single mode, depending upon starting configuration. In theory, the basic Gibbs sampler should be able to visit all modes but this usually takes pro- hibitively long. Therefore, it seems to be advisable in practice to use hybrid approaches where a stochastic algorithm is used to explore the state space

(23)

at a high temperature (i.e., overrelaxation) and occasionally switching to a deterministic algorithm to converge quickly to a nearby mode. Alternatively, multiple independent restarts can be tried to locate a starting point in the vicinity of a mode.

Another important issue is the convergence rate. It is known that algorithms based on fixed point iterations such as VB (see Petersen et al. (2005) and references herein) suffer from slow convergence, especially in low observation noise regimes. Intuitively, low observation noise implies that latent sources and hyperparameters tend to be strongly correlated under the posterior due to “explaining away” effects. This renders coordinate ascend methods pro- hibitively slow. By similar reasons, the convergence of the Gibbs sampler to the target posterior is also hampered. Overrelaxation is a standard technique that can applied to speed up convergence (Neal, 1998). It is also possible to compute the gradient of the data likelihood and use gradient descent with adaptive step sizes (Winther and Petersen, 2006).

Typically, in extreme cases, when there are a lot of data points or only a few data points, the posterior distribution may be rendered unimodal. In the first case, the posterior is peaked around the MAP solution and in the latter case it is diffuse and typically quite close to the prior. In both cases, the inference problem may be in fact easy and one would not expect big qualitative differences in the solutions found by both algorithms. This seems to be the picture in our earlier simulation studies with real data.

As we show in the appendix, both algorithms (VB and Gibbs sampler) differ only in a few lines of code and with minimal coding effort, both can be implemented when one is already implemented. Hence, computation speed should be the only determining factor in choosing the inference algorithm. Moreover, the factor graph formalism enables one to mechanise the derivations and auto- mate the code generation process. For example, the latex code for the update equations in the appendix are automatically generated from the model speci- fication by a computer program (and only manually edited to enhance visual appearance). This is particularly attractive since for real applications one has to often build complex hierarchical prior models and coding/testing cycle can be cumbersome.

We conclude by noting that VB and Gibbs sampler are only two possible choices as inference algorithms. It will be interesting to see if sequential Monte Carlo techniques (Doucet et al., 2000; Sudderth et al., 2003; Briers et al., 2006) or alternative deterministic message passing algorithms (EP, EC) (Minka, 2005; Opper and Winther, 2005; Winther and Petersen, 2006) will be useful for large scale Bayesian source separation problems.

(24)

A Standard Distributions in Exponential Form, their sufficient statistics and entropies

• Gamma

G(λ; a, b) ≡ exp

µ

+(a − 1) log λ − 1

bλ − log Γ(a) − a log b

¶

hλi_G= ab hlog λi_G = Ψ(a) + log(b)

H[G] ≡ − hlog Gi_G = −(a − 1)Ψ(a) + log b + a + log Γ(a)

Here, Ψ denotes the digamma function defined as Ψ(a) ≡ d log Γ(a)/da.

• Inverse Gamma IG(r; a, b) ≡ exp

µ

−(a + 1) log r − 1

br − log Γ(a) − a log b

¶

h1/ri_IG= ab hlog ri_IG = −(Ψ(a) + log(b))

H[IG] ≡ − hlog IGi_IG = −(a + 1)Ψ(a) − log b + a + log Γ(a)

• Multivariate Gaussian N (x; µ, P ) = exp

µ

−1

2x^>P⁻¹x + µ^>P⁻¹x −1

2µ^>P⁻¹µ − 1

2log |2πP |

¶

hxi_N = µ ^Dxx^T^E

N = P + µµ^T H[N ] ≡ − hlog N i_N = 1

2log |2πeP |

B Summary of the Model

B.1 Generative Model

λ_n : Scale parameter of n’th source λ_n∼ G(λ_n; a_λ, b_λ)

v_k,n : Variance of the k’th coefficient of the n’th source vk,n|λn∼ IG(vk,n; ν/2, 2/(νλn))

s_k,n : Source coefficient

s_k,n|v_k,n∼ N (s_k,n; 0, v_k,n)

a_m : Vector of mixing proportions for the m’th channel a_m∼ N (a_m; µ_a, P_a)