• Sonuç bulunamadı

Variational and stochastic inference for Bayesian Source Separation

N/A
N/A
Protected

Academic year: 2021

Share "Variational and stochastic inference for Bayesian Source Separation"

Copied!
34
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Variational and stochastic inference for Bayesian Source Separation

A. Taylan Cemgil

a,1

C´edric F´evotte

b

Simon J. Godsill

a

aEngineering Dept, University of Cambridge Trumpington St, CB2 1PZ, Cambridge, UK

{atc27,sjg}@eng.cam.ac.uk

bDept Signal-Image, T´el´ecom Paris fevotte@tsi.enst.fr

37-39, rue Dareau, 75014 Paris, France

Abstract

We tackle the general linear instantaneous model (possibly underdetermined and noisy) where we model the source prior with a Student t distribution. The conjugate- exponential characterisation of the t distribution as an infinite mixture of scaled Gaussians enables us to do efficient inference. We study two well known inference methods, Gibbs sampler and variational Bayes for Bayesian source separation. We derive both techniques as local message passing algorithms to highlight their algo- rithmic similarities and to contrast their different convergence characteristics and computational requirements. Our simulation results suggest that typical posterior distributions in source separation have multiple local maxima. Therefore we propose a hybrid approach where we explore the state space with a Gibbs sampler and then switch to a deterministic algorithm. This approach seems to be able to combine the speed of the variational approach with the robustness of the Gibbs sampler.

Key words: Source Separation, Variational Bayes, Markov Chain Monte Carlo, Gibbs sampler

1 Introduction

In Bayesian source separation (Mohammad-Djafari, 1997, 1999; Knuth, 1998, 1999; Knuth and Vaughan, 1998; Rowe, 2002, 2003), the task is to infer N

1 This research is sponsored by the EPSRC grant EP/D03261X/1 “Probabilistic Modelling of Musical Audio for Machine Listening” and MUSCLE Network of ex- cellence supported by the European commission under FP6.

(2)

source signals sk,n given M observed signals xk,m where n = 1 . . . N, m = 1 . . . M. Here, k is an index with k = 1 . . . K that may directly correspond to time or indirectly to an expansion coefficient of the sources in a (linear) trans- form domain. By letting s ≡ s1:K,1:N and x ≡ x1:K,1:M, a generic hierarchical Bayesian formulation of the problem can be stated as

p(s|x) = 1 Zx

Z

mpp(x|s, Θm)p(s|Θp)p(Θm)p(Θp) (1) The mixing process is characterised by the (possibly degenerate, deterministic) conditional distribution p(x|s, Θm), that is known as the observation model.

Here, Θm denotes the collection of mixing parameters such as the mixing matrix, observation noise variance, etc. The prior term p(s|Θp), the source model, describes the statistical properties of the sources via their own prior parameters Θp. The hierarchical model is completed by postulating hyper- priors over the nuisance parameters Θmand Θp. The normalisation term Zx = p(x) is the marginal probability of the data under the model, also known as the evidence (MacKay, 2003). While its actual numerical value is not directly relevant for separation, it plays a key role for model order selection, for example when the number of sources N is unknown and need to be estimated (Miskin and Mackay, 2001).

In this paper we assume the mixing process to be linear and instantaneous. In reality, a linear mixing process may still involve convolutions (i.e. reverbera- tion) and may evolve with time (i.e. nonstationarity, moving sources). While such general scenarios are definitely of practical interest and pose significant computational challenges, these may be addressed in the same Bayesian frame- work of Eq. 1. Another common assumption is that of a-priori independent sources; i.e., the source model factorises as p(s|Θp) =Qnp(snp), which un- derlies the well-known Independent Component Analysis model (Hyv¨arinen et al., 2001; Cichocki and Amari, 2003). While the validity of the a-priori independence assumption is debatable especially for musical audio, it is nev- ertheless a quite commonly made assumption mainly due to its simplicity.

Note however that some authors have considered source models where mutual dependence is taken into account (Rowe, 2002, 2003).

Once the posterior is computed by evaluating the integral, the estimates of the sources can be obtained as a marginal maximum-a-posteriori (MMAP) or minimum-mean-square-error (MMSE) estimate

s = argmax

s p(s|x) hsip(s|x) =

Z

sp(s|x)ds

Here, and elsewhere the notation hf (x)ip(x) will denote the expectation of the function f (x) under the distribution p(x), i.e. hf (x)ip R dxf (x)p(x).

(3)

Unfortunately, the computation of the posterior distribution p(s|x) via the integral in Eq.1 is intractable for almost all relevant observation and source models, even under conditionally Gaussian and independence assumptions.

Hence, approximate numerical integration techniques have to be employed.

In summary, the success of a practical Bayesian source separation system hinges upon the following points:

• Topology (structure) of the model and the parameter regime

• The accuracy, solution quality and the computational cost of the inference algorithm

Obviously, some tradeoff has to be made between model complexity and com- putational effort. However, for a particular application, the details of how to make this tradeoff are far from clear, especially given that there are many candidate models and approximate inference algorithms. Moreover, it is not uncommon that the difficulty of the inference problem is determined by the particular parameter regime, not only by model structure. For example, the difficulty of a source separation task depends upon the particular details of the mixing system where the effective rank of the mixing matrix and the number of observed data points play a critical role. This affects the convergence prop- erties of approximate inference algorithms and success often depends upon ini- tialisation and other design choices (such as tempering or propagation sched- ule).

In this paper, we will investigate some of the issues raised above. We will focus on the underdetermined case (M < N ) and with noisy mixtures. The underdetermined case is challenging because contrary to the (over)determined case, estimating the mixing system is not sufficient for reconstruction of the sources and the prior structure p(s|Θ) becomes increasingly important for reconstructing the sources in noisy environments.

We will follow the model proposed in (F´evotte et al., 2004; F´evotte and Godsill, in press), where audio sources are decomposed on a MDCT basis (a local cosine transform orthonormal basis) and the coefficient prior p(s|·) is a Student t distribution. This prior prefers the sparsity of the sources on a given dictionary, which intuitively means that the MAP solution s or MMSE solution hsi typically contain few coefficients that are significantly non-zero. The use of source sparsity to handle underdetermined source separation finds its origin in the deterministic approaches of (Zibulevsky et al., 2001; Jourjine et al., 2000).

In (F´evotte and Godsill, in press), a Gibbs sampler is derived to sample from the posterior distribution of the parameters (the mixing matrix, the sources coefficients, the additive noise variance and the hyper-parameters). Minimum Mean Square Error estimates of the coefficients of the sources are then com-

(4)

puted and time domain estimates of the sources are reconstructed by applying inverse-MDCT. The results were reproducible, i.e., independent of initialisa- tion, and the method could compute a perceptually reasonable separation of the sources.

The main advantage of MCMC is its generality, robustness and attractive the- oretical properties. However, the method comes at the price of heavy computa- tional burden which may render it impractical for certain applications. An al- ternative approach for computing the required integrals is based on determinis- tic fixed point iterations (Variational Bayes – Structured Mean field) (Ghahra- mani and Beal, 2000; Wainwright and Jordan, 2003; Bishop, 2006). This set of methods have direct links with the well-known expectation-maximisation (EM) type of algorithms.

Variational methods have been extensively applied to various models for source separation by a number of authors (Attias, 1999; Lappalainen, 1999; Valpola, 2000; Girolami, 2001; Miskin and Mackay, 2001; Hojen-Sorensen et al., 2002;

Chan et al., 2003; Winther and Petersen, 2006). In this contribution, we de- rive both VB and Gibbs sampler as local message passing algorithms on a factor graph (Kschischang et al., 2001; Bishop, 2006). This approach has two advantages: first it provides a pedagogical perspective to highlight the algo- rithmic similarities and to contrast objectively the convergence characteristics and computational requirements of different inference algorithms. Secondly, the framework facilitates generalisation to more complex models and to au- tomation of code generation procedure including alternative message passing methods.

2 Model

The source separation model defines a generative model for the source coef- ficients. The model is applicable to the actual time series ˆxt,m for t = 1 . . . T observed at channel m as well as when transformed by an orthogonal basis to a new series xk,m where the new index k typically encodes both the fre- quency band and frame number (a particular tile in a time-frequency plane).

We assume linear instantaneous mixing2:

xk,m|sk,1:N, am, rm∼ N (xk,m; a>msk,1:N, rm)

where sk,1:N is the vector of source coefficients, am is a N × 1 vector that denotes the mixing proportions for the m’th channel, and rm is the variance

2 N , IG and G denote Gaussian (Normal), Inverse Gamma and Gamma distribu- tions respectively and are defined in the appendix.

(5)

of observation noise in the m’th channel. Since both of these quantities are typically unknown exactly, we place informative priors on them3

rm∼ IG(rm; ar, br) am∼ N (am; µa, Pa)

In many domains, (audio, natural images) the distribution of source coeffi- cients sk,n tends to be heavy tailed. For example in typical audio signals, this is a simple consequence of the fact that the vast majority of physical sys- tems (human vocal tract, many acoustical musical instruments e.t.c.) produce quasi-periodic oscillations. Hence, over short time scales, the energy of signals from these sources is concentrated in a few narrow frequency bands which enables a sparse representation using just a few expansion coefficients. In this respect, a sparse prior can be justified by the underlying physics. We assume the source coefficients to be zero mean Gaussian

sk,n|vk,n∼ N (sk,n; 0, vk,n)

where vk,n is an unknown variance. Note that the variance vk,n (for zero mean case) is equal to the expected amount of energy. Since this quantity is in general unknown, we place an informative prior as

vk,nn∼ IG(vk,n; ν/2, 2/(νλn))

We note that this hierarchical prior model on the source coefficients is equiv- alent to a t-distribution

Tν(s; µ, λ) =

Z

dvN (s; µ, v)IG(v;ν 2, 2

νλ) ≡ Γ((ν + 1)/2)

√νπλ Γ(ν/2)

Ã

1 + 1 ν

(s − µ)2 λ

!−(ν+1)/2

where the parameter λndirectly controls the variance of the t-distribution and hence the power of the n’th source. We place an informative prior on λn as

λn∼ G(λn; aλ, bλ)

In this paper the degrees of freedom ν is fixed to a low value, yielding a distri- bution with heavy tails to model sparsity. However, any estimation procedure of the degrees of freedom of an inverse Gamma distribution could be employed to learn ν as well. As such, in (Cemgil et al., 2005), we have considered a lower

3 Note that one can break the source separation gain indeterminacy (ca)>s = a>(cs), by placing appropriate priors on the mixing proportions a or source coeffi- cients s. The posterior then becomes only invariant to permutations of the sources.

(6)

λ1 . . . λn . . . λN ∼ G(λn; aλ, bλ)

vk,1 . . . vk,n · · · vk,N ∼ IG(vk,n; ν/2, 2/(νλn))

sk,1 . . . sk,n . . . sk,N ∼ N (sk,n; 0, vk,n)

xk,1 . . . xk,M

k = 1 . . . K

∼ N (xk,m; a>msk,1:N, rm)

a1 r1 . . . aM

∼ N (am; · · · ) rM

∼ IG(rm; · · · )

Fig. 1. Graphical model for the source separation model. The rectangle denotes a plate, K copies of the nodes inside.

bound maximisation. Similarly, an empirical adaptive scheme is considered in (F´evotte and Godsill, in press) and a Metropolis-Hasting procedure is pro- posed in (Moussaoui et al., to appear). The graphical model of the full model is shown in Figure 1.

3 Inference

The model described in the previous section is an instance of the basic Bayesian source separation stated in (1), where the mixing model parameters Θm = {a1:M, r1:M} ≡ {A, R} and the source prior parameters are Θp = {v1:K,1:N, λ1:N} ≡ {v, λ}. We will refer to all parameters by Θ ≡ (Θp, Θm). The remaining hyper- parameters, aλ, bλ, ar, br, µa, Pa and ν are assumed to be known. Formally, we need to compute the posterior distribution

p(s|x) = 1 Zx

Z

φx(s, A, R, v, λ)dAdRdvdλ (2)

where φx is the joint distribution evaluated at the observed x given as

(7)

φx(s, A, R, v, λ) ≡ p(x|s, A, R)p(s|v)p(A)p(R)p(v|λ)p(λ)

=

YK

k=1

à N Y

n=1

p(vk,nn)p(sk,n|vk,n)

! Ã M Y

m=1

p(xk,m|sk,1:N, am, rm)

!

à M Y

m=1

p(am)p(rm)

! Ã N Y

n=1

p(λn)

!

and estimate the sources by, e.g., MMSE given as hsip(s|x). Unfortunately, the required integral can not be computed exactly and numerical techniques have to be employed.

In the following two sections, we will review two alternative approaches for numerical computation: a stochastic method (Markov Chain Monte Carlo) and a deterministic method (Structured Mean field - Variational Bayes). Both methods are well understood and a vast amount of literature is available on both subjects. In this paper, we will introduce both techniques in the context of source separation to highlight the algorithmic similarities in order to facilitate a comparison.

3.1 Markov Chain Monte Carlo

Suppose we could generate I independent samples (s(i), Θ(i)) i = 1 . . . I from the joint posterior

1 Zx

φx(s, Θ) ≡ 1 Zx

φx(s, A, R, v, λ)

Then, the intractable integral can be trivially approximated by simply dis- carding Θ(i) and the sources could be recovered by

hsi ≈ 1 I

XI

i=1

s(i) (3)

Unfortunately, generating independent samples from the joint posterior is a difficult task. On the other hand, it is usually easier to generate dependent samples, that is we generate (s(i+1), Θ(i+1)) by making use of (s(i), Θ(i)). Per- haps surprisingly, even though subsequent samples are correlated (and pro- vided certain ergodicity conditions are satisfied), Eq. 3 remains still valid and estimated quantities converge to their true values when number of samples, I, goes to infinity (Gilks et al., 1996).

A sequence of dependent samples (s(i), Θ(i)) is generated by sampling from a Markov chain that has desired joint posterior φx(s, Θ)/Zx as its stationary dis-

(8)

tribution. The chain is defined by a collection of transition probabilities, i.e., a transition kernel T (s(i+1), Θ(i+1)|s(i), Θ(i)). The Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) provides a simple way of defining an ergodic kernel that has the desired stationary distribution. Suppose we have a sample (s(i), Θ(i)). A new candidate (s0, Θ0) is generated by sampling from a proposal distribution q(s, Θ|s(i), Θ(i)). We define the acceptance probability a by

a(s0, Θ0 ← s, Θ) = min

Ã

1,φx(s0, Θ0)/Zx φx(s, Θ)/Zx

q(s, Θ|s0, Θ0) q(s0, Θ0|s, Θ)

!

The new candidate (s0, Θ0) is accepted as the next sample (s(i+1), Θ(i+1)) with probability a, otherwise (s(i+1), Θ(i+1)) ← (s(i), Θ(i)). The algorithm is ini- tialised by generating the first sample (s(0), Θ(0)) according to an (arbitrary) proposal distribution.

However for a given transition kernel T , it is hard to assess the time required to converge to the stationary distribution so in practice one has to run the sim- ulation until a very large number of samples have been obtained, (Roberts and Rosenthal, 1998). The choice of the proposal distribution q is also very critical.

A poor choice may lead to the rejection of many new candidates and conse- quently to a very slow convergence to the stationary distribution. In the con- text of source separation, where the dimension is large, it is not easy to design a proposal that makes global jumps, as suggested by T (s(i+1), Θ(i+1)|s(i), Θ(i)), where a whole block of variables is updated in parallel.

3.2 Gibbs sampling

A simpler approach is to sample the variables one by one or in smaller, mu- tually disjoint blocks. For this purpose, we group all variables in mutually exclusive sets Cα, that we name as clusters. The subscript α = 1 . . . A denotes the unique cluster index and C = {C1, . . . , Cα, . . . , CA} is the set of all clusters.

Often, there is a huge number of choices for grouping the variables. A natural choice for the source separation problem is 4

C = {s1,1:N, . . . , sK,1:N, a1, . . . , aM,

r1, . . . , rM, v1,1, . . . , vk,m, . . . , vK,M, λ1, . . . , λM} (4)

4 Note that in this choice, for each k, we cluster the sources sk,1, . . . , sk,N jointly as sk,1:N. This particular blocking leads to better convergence because whilst sk,1:N are a-priori independent, clearly they are a-posteriori dependent given xk,1:M.

(9)

Given a clustering one can apply a more specialised Markov chain Monte Carlo (MCMC) algorithm, the Gibbs sampler, by choosing the proposal density as the full conditional distributions given as p(Cα|C¬α), where C¬α ≡ C\Cα, i.e., all clusters but Cα. It also turns out that the acceptance probability becomes a = 1 and all the proposals are accepted by default.

To complete the algorithm, we need to decide upon the order in which we will visit the individual clusters α. We will call such a sequence a propagation schedule πα(j) where j = 1 . . . AI.

To generate a new candidate for a cluster Cα, we freeze the configuration of the remaining clusters C¬α(j−1) (where α = πα(j)) and compute the full conditional distribution p(Cα|C¬α(j−1)) ∝ φx(Cα, C¬α(j−1)).

Theoretically, the actual order is not important and can be randomised, pro- vided that each cluster is visited infinitely often in the limit when j → ∞.

In practice however, given limited computational resources, a particular prop- agation schema may be favourable, since it may converge faster. Typically, periodic propagation schedules are employed where each cluster is visited once according to some permutation. The permutation is repeated I times so each cluster is visited I times at the end of the simulation. It is also common prac- tice to let the chain run for an initial burn in period where each cluster is visited I0 times without computing any statistics.

The key observation is that, given a clustering of the variables C, the joint posterior admits the factorisation

φx(C) =Y

β

φx,β({Cα}α∈neigh(β)) (5)

Here, φx,β are positive and possibly unnormalised functions. This expression says that if cluster α appears in factor φx,β, we have α ∈ neigh(β) and the joint can be written as a product of such terms5.

The structure of the factorisation can be conveniently visualised using a factor graph, see Fig 2 (Kschischang et al., 2001). Each black square correspond to a factor φx,β where φx =Qβφx,β and the circles denote the individual cluster variables Cα. A factor φx,β is connected to a cluster node α if α ∈ neigh(β),

5 As an example, suppose φ(s, v) = p(s|v)p(v) = N (s; 0, v)IG(v; 1, 1) and we choose C = {C1, C2} = {{v}, {s}}. Then

φ(s, v) ≡ exp

³

1 2

s2 v 1

2log 2πv

´

exp

³

−2 log v −1 v

´

∝ exp

³

1 2

s2 v

´

exp

³

5 2log v −1

v

´

= φ1(s, v)φ2(v) ≡

Y2 β=1

φβ

and we would have neigh(β = 1) = {1, 2} and neigh(β = 2) = {1}.

(10)

i.e. a variable in Cα appears in the defining expression of φx,β. Obviously the relation is symmetric: a cluster α is connected to φx,β when β ∈ neigh(α).

For a given cluster α, we could write Eq.(5) as

log φx(C) = X

β∈neigh(α)

log φx,β({Ca}a∈neigh(β)) + X

β /∈neigh(α)

log φx,β({Ca}a∈neigh(β)(6))

p(C|C¬α(j−1)) ∝ φx(Cα; C¬α(j−1)) = exp

X

β∈neigh(α)

log φx,β({Ca}a∈neigh(β))

(7)

Given the configuration C¬α(j−1), the second term in Eq.6 does not depend on α, hence only a small subset of clusters in ¬α actually effect the expression of the full conditional. This subset, which is known as the Markov blanket of α, will be denoted by¬α. The Markov blanket consists of nodes that share ad common factor with α, i.e. ¬α ≡ {αd 0 : α0 ∈ ¬α ∧ ∃βs.t.(β ∈ neigh(α0) ∧ β ∈ neigh(α))}. For example, in Fig 2, the Markov blanket of {vk,n} consists of {{λk,n}, {sk,1:N}}.

This simplification reduces the amount of computation required for computing the full conditional distribution. A further computational simplification is ob- tained when the expression for the full conditional corresponds to a known dis- tribution. One way to ensure this is choosing the hierarchical model such that all priors are conjugate-exponential6. In this case, all conditional marginals have a closed form solution. The Gibbs sampling algorithm is summarised below:

• Initialise (or Burn in):

For α = 1 . . . A, Cα(0) ∼ q(Cα)

• Sample the chain (typically J = AI):

For j = 1 . . . J α = πα(j) Cα(j)∼ p(Cα|C¬α(j−1)c )

• Estimate hCαi ≈1

I

X

js.t.πα(j)=α

Cα(j)

6 Consider the example from previous footnote

p(v|s = ˆs) ∝ φ(ˆs, v) ∝ exp

³

1 2 ˆ s2

v

´

exp

³

5 2log v −1

v

´

= exp

³

5 2log v −

³

1 +1 2sˆ2

´1

v

´

∝ IG(v; 3/2, 2/(2 + ˆs2))

The full conditional p(v|s = ˆs) has the same form as the prior p(v).

(11)

λ1 . . . λn . . . λN

. . . . . .

vk,1 . . . vk,n

· · · vk,N

. . . · · ·

sk,1:N

. . .

k = 1 . . . K

a1 r1 . . . aM rM

Fig. 2. Factor graph corresponding to the clustering of the variables as defined in Eq.4.

3.3 Structured Mean Field – Variational Bayes

One alternative approximation method, that leads to an iterative optimisa- tion procedure is the structured mean field method, also known as variational Bayes, see (Wiegerinck, 2000; Ghahramani and Beal, 2000; Wainwright and Jordan, 2003; MacKay, 2003; Bishop, 2006) and references herein. In our case, mean field boils down to approximating the integrand P = φx/Zx in (2) with a simple distribution Q in such a way that the integral becomes tractable. An intuitive interpretation of mean field method is minimising the KL divergence with respect to (the parameters of) Q where

KL(Q||P) = hlog QiQ

¿

log 1 Zxφx

À

Q

(8) Using non-negativity of KL, (Cover and Thomas, 1991), we obtain a lower bound on the evidence

log Zx ≥ hlog φxiQ − hlog QiQ≡ F [P, Q] + H[Q]

Here, F is interpreted as a negative energy term and H[Q] is the entropy of the approximating distribution. The maximisation of this lower bound is equivalent to finding the “nearest” Q to P in terms of KL divergence and

(12)

this solution is obtained by a joint maximisation of the entropy H and F (minimisation of the energy) (Neal and Hinton, 1999).

In this paper, we choose a factorized approximating distribution Q that re- spects the factorisation of form

Q =Y

α

Qα(Cα)

where the clusters are defined as in (4). Although a closed form solution for Q still can not be found, it can be easily shown, e.g. see (Wainwright and Jordan, 2003; Winn and Bishop, 2005), that each factor Qα of the optimal approximating distribution should satisfy the following fixed point equation

Qα∝ exp³hlog φxiQ¬α´ (9)

where Q¬α ≡ Q/Qα, that is the joint distribution of all factors excluding Qα. Hence, the mean field approach leads to a set of (deterministic) fixed point equations that need to be iterated until convergence.

Right hand side of this fixed point iteration can be computed efficiently since the joint posterior admits the factorisation in (5) that translates to

hlog φxiQ¬α=X

β

hlog φx,βiQ¬α =+ X

β∈neigh(α)

hlog φx,βiQ

neigh(β)\α

where f (x) =+g(x) denotes equal up to some irrelevant constant c, i.e., f (x) = g(x)+c. The expectations hφx,βi can be computed easily if all distributions are chosen to be in a conjugate-exponential family, for example see (Ghahramani and Beal, 2000). One additional nice feature of this fixed point iteration is that at every step, the lower bound defined in Eq.9 is guaranteed to increase.

This feature is useful in ensuring that an implementation is bug free.

The Variational Bayes algorithm is summarised below:

• Initialise:

For α = 1 . . . A, set Q(0)α

• Iterate the fixed point equations (typically J = AI):

For j = 1 . . . J α = πα(j) Q(j)α = exp

X

β∈neigh(α)

hlog φx,βiQ(j−1)

neigh(β)\α

(13)

• Estimate

hCαiP≈ hCαiQ(J)

It is important to note the similarity between the Gibbs sampler and varia- tional Bayes. Given the factor graph of Fig.2, both algorithms have a simple

“visual” interpretation. Remember that the factor nodes β ∈ neigh(α) (black squares in Fig.2) define local compatibility functions between α and its Markov blanket α0 ∈¬α.d

In Gibbs sampler, on each cluster node α, we store a configuration. If we wish to sample the cluster node α at step j of the simulation, all we need is the current configuration of the Markov blanket ¬αd(j−1) and we need to evaluate φx,β for β ∈ neigh(α) pointwise. When the nodes are chosen as conjugate exponential, the expression for this full conditional density is already available in closed form as a function of the configuration of the Markov blanket.

In variational Bayes, on each cluster node α, we store sufficient statistics. If we wish to update (the sufficient statistics of) the approximating distribution Qα, all we need is the current sufficient statistics of Q¬αc and we need to evaluate the expectation hφx,βiQ

c

¬α

for β ∈ neigh(α). Again, if the nodes are chosen as conjugate exponential, the expression for Qα is already available in closed form as a function of the sufficient statistics of its Markov blanket7.

3.4 Tempering

As illustrated above, while underlying theoretical principles are different, both the Gibbs sampler and variational Bayes share the same local updating strat- egy and both can suffer from slow convergence or get stuck in a local mode of the posterior. One general strategy to circumvent these problems is temper- ing8. The idea is to define a sequence of target distributions

φx1), . . . , φxj), · · · → φx

7 Consider the example from the first footnote and assume Q1(v) = IG(v, av, bv) and Q2(s) = N (s; ms, Ss) where av, bv, ms, Ssare variational parameters. By Eq.9, we have the following fixed point equations:

log Q1(v) =+hlog φ1(s, v)φ2(v)iQ2(s)= hlog φ1(s, v)iQ2(s)+ log φ2(v) = −1 2

­

s2®1

v5 2log v −1

v

=+log IG(v; 3/2, 2/(2 + Ss+ m2s)) log Q2(s) =+hlog φ1(s, v)φ2(v)i

Q1(v)=+hlog φ1(s, v)i

Q1(v)= −1 2s2

D1 v

E

=+log N (s; 0, 1/(avbv))

8 Different communities use different terms for this idea such as bridging, annealing or overrelaxation. We prefer to reserve the term simulated annealing for the particular case where the inverse temperature τ goes to infinity to locate the modes of the posterior.

(14)

that converge to the desired posterior. The sequence τj is referred as a temper- ing schedule. The idea is to use samples or sufficient statistics obtained from φxj−1) as a the starting point for φxj). In general, the schedule can be quite arbitrary but one simple strategy is to define a sequence 0 ≤ τ1 ≤ · · · ≤ τj

· · · ≤ 1 and to raise the posterior to the power as φxj) = φτxj. When τ is near zero, the posterior will be flatter and the modes will not be very pronounced.

4 Simulation Results

The first simulation is designed to illustrate both algorithms on a simpler problem where there is only one source (N = 1) that is observed through a single noisy observation channel (M = 1). This problem is in fact a de-noising problem where we wish to “separate” a single source from Gaussian noise.

We sample 200 independent problem instances from the model with K = 180 data points using hyper-parameters (aλ, bλ, ar, br, ν) = (0.5, 10, 0.5, 10, 0.5).

The mixing matrix is set A = a1 = 1 and is also assumed to be known.

For each problem instance (x, s, v, r, λ)orig, we infer the posterior mean of the sources srec = hsip(s|xorig)and compare it with sorig. The parameter λorig defines the scale of the source and rorig denotes the variance of the observation noise.

The difficulty of de-noising depends upon the signal to noise ratio λ/r, where cases with low λ/r are typically more difficult to de-noise.

For each of the 200 instances, we initialise the parameters as follows: prior means for Gibbs and prior sufficient statistics for VB. We run both algorithms using the same propagation schedule where we update the parameters peri- odically with the order {s, v, r, λ}. We run the Gibbs sampler for I = 1000 iterations with a burn-in period of I0 = 100 iterations. The variational al- gorithm, in contrast, is only run for I = 100 iterations, which seems to be sufficient to ensure its convergence.

The results for two typical problem instances are shown in Figure 3, where we show the approximation to the posterior marginal p(λ, r|x) as computed by both inference methods. In both cases, the Gibbs sampler seems to be captur- ing the posterior marginal. In contrast, in the harder case, VB underestimates the observation noise variance by about an order of magnitude.

The results of the experiment are summarised in Figure 4. We measure the quality of reconstruction by the SNR defined by

SNR(sorig, srec) = 10 log10

à |sorig|2

|sorig− srec|2

!

(15)

100 10−3 10−2 10−1 100

λ

r

SNR vb : 35.96 SNR gibbs : 35.24

10−1 100 101 102

10−1 100 101 102

λ

r

SNR vb : 78.74 SNR gibbs : 79.22

Fig. 3. Two typical runs of the inference algorithms. (Upper) easy - low noise and (Lower) harder - high noise. The dots correspond to the samples obtained from the Gibbs sampler and the contour plot corresponds to the variational approximation.

The round point denotes (λorig, rorig). Note that this picture is only showing the approximation to the posterior marginal p(λ, r|x).

(16)

In the top panel of Figure 4, we show each problem instance by a point orig, rorig) and classify by the SNR criteria. Although this is a rather crude comparison, the emerging pattern suggests that Gibbs sampling achieves bet- ter performance for harder cases where VB seems to be a viable choice for easier instances. A pairwise comparison of SNR values for methods shows that we can expect comparable performance from both methods and there is a slight tilt towards Gibbs sampler at the difficult cases. We obtain qualita- tively similar behaviour in similar experiments in other reasonable parameter regimes, although the details of these pictures vary (such as the absolute dB levels).

We observe that, in the easy instances both algorithms achieve essentially the same performance level with VB requiring far less computation. In more difficult cases, the Gibbs sampler tends to be slightly superior. In a few isolated cases, the Gibbs sampler failed to converge in I = 1000 iterations, whereas VB converged quite quickly to a viable solution.

4.1 Source separation, synthetic example

Next, we illustrate both algorithms on synthetic data (K = 200) sampled from the model with N = 3 sources and M = 2 observation channels with (aλ, bλ, ar, br, ν) = (2, 10, 2, 1, 1). The first row of the mixing matrix a is set to a1 = [1 1 1]> in order to remove the BSS indeterminacy on gain. The second row a2 is sampled from the prior and is assumed to be unknown. The observations x are shown in Figure 5.

We note that we use the same clustering of the variables for both Gibbs and VB, as given in Eq. 4. In particular, we cluster the sources sk,1, . . . , sk,N jointly as sk,1:N. This is because whilst sk,1:N are a-priori independent, clearly they are a-posteriori dependent given xk,1:M. In MCMC jargon, this is a simple form of blocking which improves convergence to the stationary distribution (Gilks et al., 1996). In the case of VB, a detailed comparison of using a richer approximation versus a simpler factorised approximation is given by Ilin and Valpola (2005) for ICA. It is also easy to see that given this particular model topology, introducing an even richer approximation that tries to capture cor- relations across k (e.g. such as Q(s1:K,1:N)), whilst in principle tractable will not give us a better approximation, see Barber and Wiegerinck (1999) for a discussion of choosing tractable approximations in general graphical models.

Our initial simulations with various K ≤ 2000 have revealed that the pos- terior is multimodal and the solutions that are obtained by both methods depend critically upon starting configuration. The Gibbs sampler was slightly superior but the performance was not too different. Similarly to (F´evotte and

(17)

10−2 10−1 100 101 102 10−2

10−1 100 101 102 103 104

λ

r

ar = 0.5, b

r = 10.00, a

l = 0.5,b

l = 10.00, nu = 0.5, N = 180

VB Gibbs

0 50 100 150 200

0 20 40 60 80 100 120 140 160 180 200

SNRVB SNR Gibbs

ar = 0.5, b

r = 10.00, a

l = 0.5,b

l = 10.00, nu = 0.5, N = 180

Fig. 4. Denoising Problem. (Up) 200 problem instances denoted by (λorig, rorig):

instances where the Gibbs sampler (VB) achieves a higher SNR is shown with crosses (triangles). (Down) Comparison of both algorithms in terms of SNR on the same set of instances.

Godsill, in press), we have added a tempering schedule where we modified the scale parameter of the observation noise variance as log10br = −8 . . . 0 in I/2 epochs. This has the effect of slowly blending in the data contributions. Other approaches, such as tempering ν, were not as useful.

(18)

m

k

20 40 60 80 100 120 140 160 180 200

1

2 −1000

−500 0 500

Fig. 5. Observations xk,m.

500 1000 1500 2000

500 1000 1500 2000

500 1000 1500 2000

Epoch

Fig. 6. Source Separation, synthetic example with 250 epochs of Gibbs with tem- pering and 250 epochs of VB.

We have found a hybrid approach most effective: switching between Gibbs sampler and VB during the simulation cycles. In Figure 6, we show a typical simulation run where we switch between Gibbs and VB every 250 epochs.

When we switch to the Gibbs sampler, we apply a tempering schedule for 300 epochs. This enables the chain to jump between modes. Once we switch back to VB, the chain quickly converges to a “nearby” solution. In Figure 7, we show the reconstruction of the sources as obtained each time before switching to Gibbs sampling. As a careful investigation shows, each reconstruction assigns source coefficients with high amplitude to different sources.

(19)

Epoch = 500

Epoch = 1000

Epoch = 1500

Epoch = 2000

k Original

20 40 60 80 100 120 140 160 180 200

Fig. 7. Reconstructions (from top) and original (bottom). As shown in Fig 6, during first switch at epoch 500, the simulation has not converged and the reconstruction is poor (top panel). Subsequent reconstructions are close to the original with only a few source coefficients assigned incorrect. The final reconstruction is close to the original and the mixing system is estimated accurately.

4.2 Audio source separation example

We study a linear instantaneous mixture of N = 3 audio sources (speech, piano, guitar) with M = 2 observation channels.9 The signals are sampled at 8kHz with length N = 65356 (≈ 8s). The actual time series ˆxt,m for t = 1 . . . T observed at channel m are transformed by a MDCT basis (Mallat, 1998) to new series xk,m where the new index k encodes both the frequency band and frame number jointly (a particular tile in a time-frequency plane). The MDCT was used with a sine window with time resolution 64ms.

In the first experiment, we set the first row a>1 of the mixing matrix to a1 = [1 1 1]>. The second row a>2 is set to [tan ψ1tan ψ2tan ψ3] with ψ1 = −45, ψ2 = 15 and ψ3 = 75.

Sources are reconstructed by inverse MDCT of the estimated source coeffi- cients s. The reconstructions are compared to the original sources using the source separation evaluation criteria are defined in appendix D and described

9 These results are reproduced from an earlier version of this paper (Cemgil et al., 2005)

(20)

500 1000 1500 2000

500 1000 1500 2000

500 1000 1500 2000

Epoch

Fig. 8. Source Separation, synthetic example with 70 epochs of Gibbs and 430 epochs of VB. Original parameters are shown with dashed lines.

in great detail by (Vincent et al., 2006). The SDR (Source to Distortion Ra- tio) provides an overall separation performance criterion, the SIR (Source to Interferences Ratio) measures the level of interference from the other sources in each source estimate, SNR (Source to Noise Ratio) measures the error due to the additive noise on the sensors and the SAR (Source to Artifacts Ratio) measures the level of artifacts in the source estimates. The performance cri- teria are reported in Table 1. We point out that the performance criteria are invariant to a change of basis, so that figures can be computed either on the time sequences or the MDCT coefficients.

In the second experiment we have relaxed the condition that we know the first row. We set the mixing matrix such that the columns are almost linearly dependent with a1 = [1 1 1]> and a2 = [tan ψ1tan ψ2tan ψ3] with ψ1 = 57, ψ2 = 66and ψ3 = 75. The inference problem becomes harder and we observe that both the Gibbs sampler and the VB can get trapped in local maxima eas- ily and multiple restarts are necessary. To render the approach feasible it was important to find a reasonable initialisations for the mixing matrix. For this purpose, we have introduced a fast initialisation schema which can be ob- tained from a simpler prior structure imposed upon the sources in the zero noise limit (see appendix E). Our approach is analogous to initialisation of Bayesian Gaussian mixture estimation with a fast method such as k-means clustering. The initialisation and reconstruction results are shown in figure 9.

As expected, the initialisation assigns each time-frequency atom to only one

(21)

ˆ s1

SDR SIR SAR SNR

MCMC 6.3 15.0 7.3 20.4

Variational 6.4 15.4 7.3 21.7 ˆ

s2

SDR SIR SAR SNR

MCMC 5.1 14.3 5.8 27.8

Variational 5.2 15.0 5.8 24.8 ˆ

s3

SDR SIR SAR SNR

MCMC 16.6 23.7 17.8 29.8 Variational 16.6 25.3 17.5 29.7 Table 1

Performance criteria of estimated sources with both methods in the first experiment.

Initialisation

SDR SIR SAR SNR

s1 5.1 20.5 5.3 48.6 s2 4.2 16.0 4.6 49.9 s3 10.6 25.5 10.8 58.7

Reconstruction

SDR SIR SAR SNR

s1 5.9 15.0 6.6 64.5 s2 4.5 14.1 5.2 62.9 s3 16.1 25.0 16.7 74.1 Table 2

Performance criteria of estimated sources in the second experiment.

source. Perhaps surprisingly, the separation quality of the initialisation is com- parable to the final reconstruction. In subsequent runs with the original model, the source assignments become less crisp; while the quality of separation is not significantly higher, the sound quality of reconstruction is slightly better (in- creasing SDR and SNR) in Table.2.

The estimated sources can be downloaded from http://www-sigproc.eng.

cam.ac.uk/~atc27/papers/cemgil-dsp-bss.html, which is perhaps the best way to assess the audio quality of the results.

(22)

Original, source 1 Original, source 2 Original, source 3

Initialisation, source 1 Initialisation, source 2 Initialisation, source 3

Reconstruction, source 1

Frame

Frequency

Reconstruction, source 2 Reconstruction, source 3

Fig. 9. The logarithm of the absolute value of MDCT coefficients of the original signal (top row), initialisation (middle row) and reconstructions (bottom).

5 Conclusions

We have studied two inference methods, Gibbs sampler and variational Bayes for Bayesian source separation. While both methods are algorithmically sim- ilar, they have different characteristics. Our simulations suggest that the in- ference problem can be hard and the posterior may display multiple local maxima.

In many source separation scenarios, the posterior distribution tends to be multimodal, with each mode typically corresponding to an alternative inter- pretation of the observed data. For example, the exact posterior marginal on the mixing matrix A has multiple modes (apart from m! equivalent modes corresponding to the BSS indeterminacy on permutation of sources.). Each of these modes imply alternative hypotheses about the mixing process and typically only one of these modes corresponds to the desired/original sepa- ration, see, (Chan et al., 2003; Hansen and Petersen, 2003) for illustrative examples. The implication of the multimodality is that local algorithms such as the Gibbs sampler or greedy algorithms such as VB will get trapped in a single mode, depending upon starting configuration. In theory, the basic Gibbs sampler should be able to visit all modes but this usually takes pro- hibitively long. Therefore, it seems to be advisable in practice to use hybrid approaches where a stochastic algorithm is used to explore the state space

(23)

at a high temperature (i.e., overrelaxation) and occasionally switching to a deterministic algorithm to converge quickly to a nearby mode. Alternatively, multiple independent restarts can be tried to locate a starting point in the vicinity of a mode.

Another important issue is the convergence rate. It is known that algorithms based on fixed point iterations such as VB (see Petersen et al. (2005) and references herein) suffer from slow convergence, especially in low observation noise regimes. Intuitively, low observation noise implies that latent sources and hyperparameters tend to be strongly correlated under the posterior due to “explaining away” effects. This renders coordinate ascend methods pro- hibitively slow. By similar reasons, the convergence of the Gibbs sampler to the target posterior is also hampered. Overrelaxation is a standard technique that can applied to speed up convergence (Neal, 1998). It is also possible to compute the gradient of the data likelihood and use gradient descent with adaptive step sizes (Winther and Petersen, 2006).

Typically, in extreme cases, when there are a lot of data points or only a few data points, the posterior distribution may be rendered unimodal. In the first case, the posterior is peaked around the MAP solution and in the latter case it is diffuse and typically quite close to the prior. In both cases, the inference problem may be in fact easy and one would not expect big qualitative differences in the solutions found by both algorithms. This seems to be the picture in our earlier simulation studies with real data.

As we show in the appendix, both algorithms (VB and Gibbs sampler) differ only in a few lines of code and with minimal coding effort, both can be imple- mented when one is already implemented. Hence, computation speed should be the only determining factor in choosing the inference algorithm. Moreover, the factor graph formalism enables one to mechanise the derivations and auto- mate the code generation process. For example, the latex code for the update equations in the appendix are automatically generated from the model speci- fication by a computer program (and only manually edited to enhance visual appearance). This is particularly attractive since for real applications one has to often build complex hierarchical prior models and coding/testing cycle can be cumbersome.

We conclude by noting that VB and Gibbs sampler are only two possible choices as inference algorithms. It will be interesting to see if sequential Monte Carlo techniques (Doucet et al., 2000; Sudderth et al., 2003; Briers et al., 2006) or alternative deterministic message passing algorithms (EP, EC) (Minka, 2005; Opper and Winther, 2005; Winther and Petersen, 2006) will be use- ful for large scale Bayesian source separation problems.

(24)

A Standard Distributions in Exponential Form, their sufficient sta- tistics and entropies

• Gamma

G(λ; a, b) ≡ exp

µ

+(a − 1) log λ − 1

bλ − log Γ(a) − a log b

hλiG= ab hlog λiG = Ψ(a) + log(b)

H[G] ≡ − hlog GiG = −(a − 1)Ψ(a) + log b + a + log Γ(a)

Here, Ψ denotes the digamma function defined as Ψ(a) ≡ d log Γ(a)/da.

• Inverse Gamma IG(r; a, b) ≡ exp

µ

−(a + 1) log r − 1

br − log Γ(a) − a log b

h1/riIG= ab hlog riIG = −(Ψ(a) + log(b))

H[IG] ≡ − hlog IGiIG = −(a + 1)Ψ(a) − log b + a + log Γ(a)

• Multivariate Gaussian N (x; µ, P ) = exp

µ

1

2x>P−1x + µ>P−1x −1

2µ>P−1µ − 1

2log |2πP |

hxiN = µ DxxTE

N = P + µµT H[N ] ≡ − hlog N iN = 1

2log |2πeP |

B Summary of the Model

B.1 Generative Model

λn : Scale parameter of n’th source λn∼ G(λn; aλ, bλ)

vk,n : Variance of the k’th coefficient of the n’th source vk,nn∼ IG(vk,n; ν/2, 2/(νλn))

sk,n : Source coefficient

sk,n|vk,n∼ N (sk,n; 0, vk,n)

am : Vector of mixing proportions for the m’th channel am∼ N (am; µa, Pa)

Referanslar

Benzer Belgeler

Sonsal da˘gılımın çok doruklu olması durumunda farklı doruklardan çekilen örnekler, çakı¸stırma problemi için birbirinden farklı ve anlamlı çözümler elde

Table 2, shows the signal to noise ratio results of the es- timated speech signal using spectral mask without and with smoothing filter as in equation (13).. In this table, we show

It also shows the results of using only visual information (Visual column), using Audio-Visual automatic speech recog- nition without source separation (Audio Visual column),

diagram. The graph given above is called transition diagram.. We will prove that v is probability vector i.e. We will consider A as a double transition matrix,. i.e. sum of each

Bayesian Neural Networks (BNNs) ( MacKay , 1992 ) lie at the intersection of deep learning and the Bayesian approach that learns the pa- rameters of a machine learning model

• Uses well understood algorithms from numerical linear algebra ⇒ often quite fast and numerically stable.. • Model selection can be based on numerical rank analysis; inspection

Index Terms— Markov chain Monte Carlo (MCMC), space al- ternating data augmentation (SADA), space alternating generalized expectation-maximization (SAGE), sparse linear regression,

Recently, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods have been proposed for scaling up Monte Carlo compu- tations to large data problems.. Whilst these