and A. Taylan CEMGIL

(1)

NONNEGATIVE MATRIX FACTORIZATIONS AS PROBABILISTIC INFERENCE IN COMPOSITE MODELS

C´edric F ´ EVOTTE

¹

and A. Taylan CEMGIL

²

1CNRS LTCI; T´el´ecom ParisTech 37-39, rue Dareau 75014 Paris, France [email protected]

2Department of Computer Engineering, Bo˘gazic¸i University

34342 Bebek, Istanbul, Turkey [email protected]

ABSTRACT

We develop an interpretation of nonnegative matrix factorization (NMF) methods based on Euclidean distance, Kullback-Leibler and Itakura-Saito divergences in a probabilistic framework. We describe how these factorizations are implicit in a well-defined statistical model of superimposed components, either Gaussian or Poisson distributed, and are equivalent to maximum likelihood estimation of either mean, variance or intensity parameters. By treating the components as hidden-variables, NMF algorithms can be derived in a typ- ical data augmentation setting. This setting can in particular accommodate regularization constraints on the matrix factors through Bayesian priors. We describe multiplicative, Expectation-Maximization, Markov chain Monte Carlo and Variational Bayes algorithms for the NMF problem. This paper describes in a unified framework both new and known algorithms and aims at providing statistical insights to NMF.

1. INTRODUCTION

Given a data matrix V of dimensions F × N with nonnegative entries, NMF is the problem of finding a factorization

V ≈ WH = ˆV (1)

where W and H are nonnegative matrices of dimensions F × K and K × N, respectively. K is usually chosen such that F K+K N << F N, hence ˆV becomes a low-rank matrix with reduced number of parameters. In the following, the entries of matrices V, W, H and ˆV are denoted vf n, w_{f k}, h_knand vˆf n respectively. We use the colon notation “:” to denote all column or row indices so that W = [w:,1, . . . , w_:,K] and H = [h^>_1,:, . . . , h^>_K,:]^>.

NMF has been applied to diverse problems (such as pat- tern recognition, clustering, mining, source separation, col- laborative filtering) in many areas (such as bioinformatics, audio and image processing, and finance). In the literature, the factorization (1) is usually sought after through the minimization problem

Wmin,H≥0D(V|WH) = D(V| ˆV)^def=

F

∑

f=1 N n=1

∑

d(v_{f n}| ˆv_{f n}) (2)

where d(x|y) is a scalar cost function. Popular choices are the squared Euclidean distance, the (generalized) Kullback- Leibler (KL) divergence, also referred to as I-divergence and

the Itakura-Saito (IS) divergence defined as

d_EUC(x|y) = 1

2(x − y)² (3)

dKL(x|y) = xlogx

y− x + y (4)

d_IS(x|y) = x y− logx

y− 1 (5)

All cost functions are positive and have a single minimum 0 when x = y.

The interpretation of NMF as a low rank matrix approxi- mation in the sense of minimizing a given distance metric d may be sufficient for the derivation of useful signal decomposition algorithms. Certainly, many alternative divergence criteria could also be contemplated [4, 3, 12]. However, for many applications it is not clear which distance metric to take or what the dimension of the latent matrices W and H should be. Such model selection questions are inherently related to the underlying statistical properties of V and can be approached in a principled manner via a Bayesian treatment.

We recast NMF with the popular Euclidean, KL and IS costs from a statistical perspective. We show in Section 2 how these factorizations are underlain by a well-defined statistical model of superimposed components, either Gaussian or Poisson distributed, and are equivalent to maximum likelihood estimation of either mean, variance or intensity parameters. By treating the components as hidden-variable we derive NMF algorithms in Section 3, based on Expectation- Maximization (EM), Markov chain Monte Carlo (MCMC) and Variational Bayes (VB). We also review standard multiplicative algorithms and elaborate on the connections between cost functions (3), (4), (5) and Bregman and β divergences [3, 4]. Finally, we discuss in Section 4 the potentials of such probabilistic interpretations of NMF. Parts of the statistical analysis and some of the algorithms presented here have already been published in the literature (see subsequent references); this paper aims at describing these related works in a unified statistical setting.

2. STATISTICAL MODELS 2.1 Observation models

The choice of a certain cost function d(.|.) to measure the fit between v_{f n} and ˆv_{f n} implies certain statistical assumptions about how v_{f n}is generated from ˆv_{f n}. It was already pointed in various papers, e.g, [5, 2] that Euclidean, KL and IS NMF

(2)

underlie the following generative models :

v_{f n}∼N (vf n; ˆv_{f n}, σ²) EUC-NMF (6) v_{f n}∼P(vf n; ˆv_{f n}) KL-NMF (7) vf n∼G (vf n; a, a/ ˆvf n) IS-NMF (8) whereN , P, G refer to the Gaussian, Poisson and Gamma distribution, respectively, defined in the Appendix and where

ˆ

v_{f n} obeys the parametrization ˆv_{f n}= ∑kw_{f k}h_kn. The likelihood of the parameters W and H under the latter models can be mapped to the corresponding cost function (2), so that NMF is actually equivalent to maximum likelihood estimation. In other words, EUC-NMF underlies an additive Gaussian noise, KL-NMF underlies a Poisson noise and IS-NMF underlies a multiplicative Gamma noise.

As matter of fact, all three cost functions belong to the family of regular Bregman divergences, which are in one to one correspondence to families of regular exponential distributions [1]. For scalars, a Bregman divergence is defined with respect to a (differentiable) convex function φ as follows (see, e.g, [1, 12])

d_φ(x|y) = φ (x) − (φ (y) + φ⁰(y)(x − y)).

We have the following correspondences d_EUC(x|y) ↔ φ (y) = y²/2, dKL(x|y) ↔ φ (y) = y log y − y, dIS(x|y) ↔ φ (y) =

− log y. NMF with Bregman divergences has been studied in [4] where various multiplicative algorithms are described.

2.2 Composite models

An interesting property of the Gaussian and Poisson distributions is that they are closed under summation; when x = ∑kc_k and c_k are Poisson (or Gaussian), x is Poisson (or Gaus- sian). Conversely, any x can be decomposed as ∑kc_kwith- out changing the underlying model. In the sequel, we will elaborate on these specific models by further pointing and exploiting their composite structure. We here introduce the following generative model

x_{f n} =

∑

k

c_{k, f n} (9)

ck, f n ∼ p(c_{k, f n}|θ_k) (10)

where θk= {w_:,k, h_k,:}. The next paragraphs describe how Euclidean, KL and IS NMF are equivalent to ML estimation of θ = {θ1, . . . , θK} in specific cases of the latter model, with either v_{f n}= x_{f n}or v_{f n}= |x_{f n}|². We note C_kand X the F × N matrices with entries {c_{k, f n}}_{f n}and {x_{f n}}_{f n}, respectively. In the sequel we will refer to C_kas component.

NMF with the Euclidean distance (EUC-NMF) The corresponding generative model is

c_{k, f n} ∼ N (ck, f n; w_{f k}h_kn,σ²

K ) (11)

It is easily shown that

− log p(X|W, H, σ²) = 1

σ²D_EUC(X|WH)+NF

2 log(2πσ²) Hence, ML estimation of W and H is equivalent to NMF of V = X into WH where the Euclidean distance is used.

There is however an interpretability ambiguity with the generative model defined by Eqs. (9), (10), (11) as it may produce negative data. As such, even though the resulting optimization problem is in the end the same provided that available data X is nonnegative, there is a semantic difference between the two points of view given by EUC-NMF and ML estimation in the Gaussian composite generative model. A more suitable approach, would be to assume the components to be generated from a truncated normal distribution, but this would break the formal correspondence between the two approaches due to the necessary re-normalization of the component distributions.

NMF with the generalized KL divergence (KL-NMF) Assume the following generative model

c_{k, f n} ∼ P(ck, f n; w_{f k}h_kn) (12) It is easily shown that

− log p(X|W, H)= D^c _KL(X|WH)

where= denotes equality up to a constant. Hence, ML es-^c timation of W and H is equivalent to NMF of V = X into WH where the KL divergence is used. The data X produced by the generative model defined by Eqs. (9), (10), (12) is nonnegative, but there is still an interpretability ambiguity with real-valued data, as the Poisson process produces integers.

NMF with the IS divergence (IS-NMF) Assume the following generative model

c_{k, f n} ∼ Nc(c_{k, f n}; 0, w_{f k}h_kn)

The data X generated from this model is complex (but we could also assume a real Gaussian pdf instead of complex).

It is easily shown that [5]

− log p(X|W, H)= D^c _IS(|X|^.2|WH),

where |X|^.2is the matrix with entries |x_{f n}|². Hence, ML estimation of W and H is equivalent to NMF of V = |X|^.2into WH where the IS divergence is used. This also corresponds to a = 1, i.e, exponential multiplicative noise in Eq. (8).

3. ALGORITHMS 3.1 Multiplicative algorithms

The multiplicative gradient descent approach taken in [8, 3]

is akin to updating each parameter by multiplying its value at previous iteration by the ratio of the negative and positive parts of the derivative of the criterion w.r.t this parameter, namely θ ← θ .[∇ f (θ )]−/[∇ f (θ )]+, where ∇ f (θ ) = [∇ f (θ )]+− [∇ f (θ )]−and the summands are both nonnegative. This ensures nonnegativity of the parameter updates, provided initialization with a nonnegative value. A fixed point θ^? of the algorithm implies either ∇ f (θ^?) = 0 or θ^?= 0. This leads to the following updates,

H ← H.W^T((WH)^{.[β −2]}.X)

W^T(WH)^{.[β −1]} (13)

W ← W.((WH)^{.[β −2]}.X) H^T

(WH)^{.[β −1]}H^T (14)

(3)

where β = 2 corresponds to EUC-NMF, β = 1 to KL-NMF and β = 0 to IS-NMF, and ‘.’ and ‘./.’ denote entrywise op- erations. Other values of β correspond to performing NMF with the β -divergence d_β(x|y) [3, 5], which is actually the Bregman divergence corresponding to φ (y) = _{β (β −1)}¹ y^β, for β /∈ {0, 1}, and which takes the KL and IS cost as limiting cases when β goes to 1 and 0, respectively.

Lee & Seung [8] showed that criterion (2) is nonincreasing under the latter updates for β = 2 (Euclidean distance) and β = 1 (KL divergence) and the proof was extended by Kompass [6] for values 1 ≤ β ≤ 2, i.e, where dβ(x|y) is convex w.r.t y. Solving for the more simple problem v_:,n≈ Wh_:,n with W fixed, the proof is simply based on the construction of the functional

G(h_:,n, ˜h_:,n) =

∑

f k

λ_{k f n}d(v_{f n}|w_{f k}h_kn λk f n

) with λ_{k f n}= w_{f k}˜hkn

[W ˜h]_{f n} which is easily shown to be a suitable auxiliary function for C(h) = D(v|Wh) (i.e, G(h, h) = C(h) and G(h, ˜h) ≥ C(h)) by convexity of d(x|y) and using Jensen’s inequality. A similar auxiliary function can be built to solve for v^T_f,:≈ H^Tw^T_f,:with H fixed.

However, the criterion was observed by many authors [4, 3, 5] to be still nonincreasing under updates (13) and (14) for values of β out of the (1, 2) interval (and in particular for β = 0 corresponding to IS divergence), but no proof is available.

Though popularized by Lee & Seung for NMF within the machine learning community in the last decade, the multiplicative updates for each factor in Euclidean and KL NMF corresponds to well-known algorithms for image restoration in the inverse problem community, see [7] and references therein.

3.2 EM algorithms

In Section 2 we have shown how EUC, KL and IS-NMF underlie statistical composite models. The components act as latent variablesand may be used as complete data in the EM algorithm. In this setting the following functional has to be maximized iteratively

Q(θ |θ⁰)^def= − Z

C

log p(C|θ ) p(C|X, θ⁰) dC.

where θ = {W, H} and C is the tensor with slices C_kand elements ck, f n. The convergence of this algorithm to a sta- tionary point is granted. Using conditional independence

p(C|θ ) =

∏

k

p(C_k|θ_k) the EM functional can be written

Q(θ |θ⁰) =

∑

k

Q_k(θ_k|θ⁰),

Q_k(θ_k|θ⁰)^def= − Z

C_k

log p(C_k|θk) p(C_k|X, θ⁰) dC_k. (15) Under suitable i.i.d assumptions the functional is further reduced to

Q_k(θ_k|θ⁰) = −

∑

f n Z

ck, f n

log p(c_{k, f n}|θ_k) p(c_{k, f n}|x_{f n}, θ⁰) d c_{k, f n}. (16)

We now explicit the EM algorithm in the specific cases of Euclidean, KL and IS NMF. Note that in the following we are not able to minimize Q_k(w_:,k, h_k,:|θ⁰) jointly in w_:,k and h_k,:, but only to perform coordinate descent, i.e, produce w⁽ⁱ⁺¹⁾_:,k and h⁽ⁱ⁺¹⁾_k,: such that Q_k(w⁽ⁱ⁺¹⁾_:,k , h⁽ⁱ⁺¹⁾_k,: |θ⁽ⁱ⁾) ≥ Q_k(w⁽ⁱ⁾_:,k, h⁽ⁱ⁺¹⁾_k,: |θ⁽ⁱ⁾) ≥ Q_k(w⁽ⁱ⁾_:,k, h⁽ⁱ⁾_k,:|θ⁽ⁱ⁾), which leads strictly speaking to a (converging) generalized EM (GEM) algorithm instead of pure EM. In the following, the apostrophe⁰will refer to parameter values as of previous iteration (i).

3.2.1 EUC-NMF

− log p(c_{k, f n}|θ_k) =^c 1

2σ²(c_{k, f n}− w_{f k}h_kn)² p(c_{k, f n}|x_{f n}, θ ) = N (ck, f n|µ_{k, f n}^post, λ_{k, f n}^post) with

µ_{k, f n}^post= w_{f k}h_kn+1

K(x_{f n}− ˆx_{f n}), λ_{k, f n}^post=K− 1

K² σ² (17) where here ˆx_{f n}= ˆv_{f n}= ∑kw_{f k}h_kn. Hence, the minimization of functional (16) subject to nonnegative constraints leads to

h_kn =





∑fw_{f k}

1

K(x_{f n}− ˆx⁰_{f n}) + w⁰_{f k}h⁰_kn

∑fw²_{f k}





+

(18)

w_{f k} =





∑nh_kn

1

K(x_{f n}− ˆx⁰_{f n}) + w⁰_{f k}h⁰_kn

∑_nh²_kn





+

(19)

where bxc₊= max{x, 0}. These update equations differ from the usual multiplicative updates given from Eq. (13) and (14).

3.2.2 KL-NMF

− log p(c_{k, f n}|θk) =^c −w_{f k}h_kn+ c_{k, f n}log(w_{f k}h_kn) p(c_{k, f n}|x_{f n}, θ ) = B ck, f n|v_{f n}, π_{k, f n}

where πk, f n= w_{f k}h_kn/ ˆx_{f n} and here ˆx_{f n}= ˆv_{f n}= ∑kw_{f k}h_kn. This leads to

h_kn= h⁰_kn

∑_fw⁰_{f k}

xf n

ˆ x⁰_{f n}

∑kw_{f k} , w_{f k}= w⁰_{f k}

∑_nh⁰_kn

xf n

ˆ x⁰_{f n}

∑nh_kn (20) which coincides with the usual multiplicative updates given by Eq. (13) and (14).

3.2.3 IS-NMF

− log p(c_{k, f n}|θk) =^c log(w_{f k}h_kn) +|c_{k, f n}|² w_{f k}h_kn p(c_{k, f n}|x_{f n}, θ ) = N (ck, f n|µ_{k, f n}^post, λ_{k, f n}^post) with

µ_{k, f n}^post= w_{f k}h_kn

∑_lw_{f l}h_lnxf n, λ_{k, f n}^post= w_{f k}h_kn

∑_lw_{f l}h_ln

∑

l6=k

w_{f l}h_ln. (21)

(4)

Leading to h_kn= 1

F

∑

f

v⁰_{k, f n}

w_{f k}, w_{f k} = 1 N

∑

n

v⁰_{k, f n}

h_kn , (22) with v⁰_{k, f n}= |µ_{k, f n}^post⁰|²+ λ_{k, f n}^post⁰. These update equations differ from the multiplicative updates given from Eq. (13) and (14), and are equivalent to the SAGE algorithm described in [5].

3.2.4 Bayesian maximum a posteriori

It is interesting to note that the EM framework readily ac- commodates Bayesian approaches for which prior information about the parameters W and H is available in the form of prior distributions p(H) and p(W). The complete data likelihood term − log p(C_k|θk) needs only be changed by

− log p(θ_k|C_k) in Eq. (15), leading to the following functional to be maximized

Q^MAP_k (θk|θ⁰) = Q_k(θk|θ⁰) − log p(w:,k) − log p(hk,:) so that only the M-step is changed.

3.3 MCMC algorithms

Monte Carlo methods [9] are powerful computational techniques to estimate expectations of form

E= hψ(θ )i_{p(θ )}≈1 L

L

∑

i=1

ψ (θ⁽ⁱ⁾) = ˜E_L

where θ⁽ⁱ⁾are samples drawn from p(θ ). Under mild condi- tions on the test function ψ, the estimate ˜E_Lconverges to the true expectation for L → ∞. The difficulty here is obtaining independent samples {θ⁽ⁱ⁾}_i=1...Lfrom complicated distributions. MCMC techniques generate subsequent samples from a Markov chain. One particularly convenient and simple pro- cedure is the Gibbs sampler where one samples each block of variables from full conditional distributions. In the Bayesian setting for the NMF model, a possible Gibbs sampler is

C⁽ⁱ⁾∼ p(C|W⁽ⁱ⁻¹⁾, H⁽ⁱ⁻¹⁾, X) for k = 1 : K do

h⁽ⁱ⁾_k,:∼ p(h_k,:|C⁽ⁱ⁾_k , w⁽ⁱ⁻¹⁾_:,k ) w⁽ⁱ⁾_:,k∼ p(w_:,k|C⁽ⁱ⁾_k , h⁽ⁱ⁾_k,:) end for

Denoting c_{f n}= [c_{1, f n}, . . . , c_{K, f n}]^T, the posterior of the hidden components writes

p(C|W, H, X) =

∏

f n

p(c_{f n}|w_f_,:, h_:,n, x_{f n})

Next, we derive the full conditionals for the three considered models.

3.3.1 EUC-NMF

The posterior of c_{f n}is given by

p(c_{f n}|w_f_,:, h_:,n, x_{f n}) = N (cf n|µ^post_{f n} , Σ^post_{f n} ) with µ^post_{f n} = [µ_{1, f n}^post. . . µ_{K, f n}^post]^T, where µ_{k, f n}^post is defined in Eq. (17), and Σ^post_{f n} = ^σ_K²(I_K −_K¹e_Ke^T_K). The diagonal

terms correspond to the posterior variance in Eq. (17). In the unconstrained case conjugate priors for h_:,n and w_f_,:

would be Gaussian. However, more sophisticated sampling schemes are required to enforce nonnegativity, typically by using Gamma priors, see, e.g, [10, 11].

3.3.2 KL-NMF

The full conditional of c_{f n}is given by

p(c_{f n}|w_f_,:, h_:,n, x_{f n}) = M (cf n|x_{f n}, πf n)

where M refers to the multinomial distribution defined in Appendix and πf n = [π1, f n, . . . , πK, f n], with πk, f n = w_{f k}h_kn/x_{f n}, as defined in Section 3.2.2. Using conjugate priors

p(w_{f k}) = G (wf k|αw, βw), p(h_kn) = G (hkn|α_h, β_h), the full conditionals can be derived as [2]

p(w_{f k}|C_k, h_k,:) = G (wf k|αw+

∑

n

c_{k, f n}, αwβw+

∑

n

h_kn) p(h_kn|C_k, w_:,k) = G (hkn|αh+

∑

f

c_{k, f n}, αhβh+

∑

f

w_{f k})

3.3.3 IS-NMF

Denoting λf n= [w_f₁h1n. . . w_{f K}hKn]^T, the posterior of c_{f n}is given by

p(c_{f n}|w_f_,:, h_:,n, x_{f n}) = N (cf n|µ^post_{f n} , Σ^post_{f n} ) with µ^post_{f n} = [µ_{1, f n}^post. . . µ_{K, f n}^post]^T, where µ_{k, f n}^post is defined in Eq. (21), and Σ^post_{f n} = diag λ_{f n} −_v_ˆ¹

f nλ_{f n}λ^T_{f n}. The diagonal terms correspond to the posterior variance in Eq. (21).

Using conjugate inverse-Gamma priors p(h_kn) = I G (hkn|α_h, β_h), p(w_{f k}) = I G (wf k|αw, βw), the full conditionals of h_k,:and w_:,kwrite p(w_{f k}|C_k, h_k,:) = I G (wf k|αw+ N, βw+

∑

n

|c_{k, f n}|²/h_kn) p(h_kn|C_k, w_:,k) = I G (hkn|αh+ F, β_h+

∑

f

|c_{k, f n}|²/w_{f k})

3.4 Variational Bayes

We finally describe how the composite structure of Eu- clidean, KL and IS NMF can be exploited to derive a variational Bayes algorithm [13]. The idea is to bound the marginal likelihood from below

LX(ϑ ) ≡ log p(X|ϑ ) ≥BV B[q]

≡ Z

qlogp(X, C, W, H|ϑ )

q d(C, W, H)

= hlog p(X, C, H, W|ϑ )i_q+ H[q]

where ϑ denotes the hyperparameters and q is defined as

q =

∏

f n

q(c_{f n})

!

∏

f k

q(w_{f k})

!

∏

kn

q(h_kn)

!

≡

∏

α ∈C

qα

(5)

The integral over C will be a summation when C are discrete (i.e, Poisson component in the KL case). Here, α ∈C = {C, W, H} denotes the set of disjoint clusters of variables.

A local optimum can be attained by the following fixed point iteration:

q⁽ⁱ⁺¹⁾_α ∝ exp

hlog p(X, C, W, H|ϑ )i

q⁽ⁱ⁾_¬α

where q_¬α = q/q_α. The expectations of hlog p(X, C, W, H|ϑ )i are functions of the sufficient statistics of q. It turns out that the variational update equations have very similar forms to the full conditionals derived for the Gibbs sampler. Here, due to lack of space we only give the equations for the KL case:

q(c_{f n}) = M (cf n|x_{f n}, πf n)

where πf n= [π1, f n, . . . , . . . πK, f n] andck, f n = xf nπk, f nwith

πk, f n ≡ exp(log w_{f k} + hlog h_kni)

∑kexp(log wf k + hlog hkni) The full conditionals can be derived as [2]

q(w_{f k}) = G (wf k|αw+

∑

n

ck, f n , αwβw+

∑

n

hh_kni) q(h_kn) = G (hkn|α_h+

∑

f

ck, f n , α_hβ_h+

∑

f

w_{f k})

One attractive feature of VB is that the hyperparameters can be optimized by maximizing the variational bound BV B[q].

While this does not guarantee to increase the true marginal likelihood, it leads in this application to algorithms that en- ables one to do full Bayesian model selection a lot more faster than MCMC based sampling approaches where cal- culation of the marginal likelihood is trickier. For a detailed discussion see [2].

4. DISCUSSION

In this overview paper, we have discussed the probabilistic interpretation of various NMF models in maximum likelihood, MAP and full Bayesian setting. In all the algorithms we discuss, we are exploiting the closure under summation property of the observation model and the closed form avail- ability of all the full conditionals. It should be noted that this is not the case for all divergence measures. In other cases other optimization techniques need to be employed.

Prior structures are needed to control the decompositions for exploratory data analysis or various problems in signal processing. There is an emphasis on optimization strategies for maximum likelihood or MAP estimation in NMF models but less research on efficient Bayesian integration methods (with a few exceptions such as [14, 2, 11]). Moreover, as the number of alternatives for data modelling increases (for example consider the number of factorization options with increasing data dimension in tensor factorization) there is a need to do model order selection and model averaging in a principled manner for which ML approaches are known to be inappropriate. Due to lack of space, we are not giving in this paper simulation results with the developed algorithms but refer the reader to other work, such as [2, 5]. A detailed and exhaustive comparison of the algorithms in terms of ef- fectiveness for various signal decomposition is a natural next step and is currently under progress.

A. STANDARD DISTRIBUTIONS

Multivariate Gaussian, with c = 1/2 or 1 (real/complex case) N (x|µ,Σ) = |π Σ/c|^−cexp −c(x − µ)^TΣ⁻¹(x − µ)

Poisson P(x|λ) = exp(−λ)^λ_x!^x

Binomial B(x|n, p) = ⁿ_x p^x(1 − p)^n−x Multinomial

M (c|n,p) = _c ⁿ

1c₂... cK p^c₁¹p₂^c²· · · p^c_K^Kδ (n − ∑kc_k)

Gamma G (u|α,β) =_{Γ(α )}^β^α u^{α −1}exp(−β u), u ≥ 0 inv.-Gamma I G (u|α,β) =_{Γ(α )}^β^α u^−(α+1)exp(−^β_u), u ≥ 0 Acknowledgements

We wish to thank the reviewers for many very helpful com- ments and suggestions, as well as O. Capp´e for discussions related to this work.

REFERENCES

[1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.

[2] A. T. Cemgil. Bayesian inference in non-negative matrix factorisation models. Technical Report CUED/F-INFENG/TR.609, University of Cambridge, July 2008. Accepted for publication in Computational Intelligence and Neuroscience.

[3] A. Cichocki, R. Zdunek, and S. Amari. Csiszar’s divergences for nonnegative matrix factorization: Family of new algorithms. In Proc. 6th International Conference on Independent Component Analysis and Blind Signal Separation (ICA’06), pages 32–39, Charleston SC, USA, Mar. 2006.

[4] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approxi- mations with Bregman divergences. Advances in Neural Information Processing Systems (NIPS), 19, 2005.

[5] C. F´evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis. Neural Computation, 21(3), Mar. 2009.

[6] R. Kompass. A generalized divergence measure fon nonnegative matrix factorization. Neural Computation, 19(3):780–791, 2007.

[7] H. Lant´eri, M. Roche, O. Cuevas, and C. Aime. A general method to devise maximum-likelihood signal restoration multiplicative algorithms with non-negativity constraints. Signal Processing, 81(5):945–

974, May 2001.

[8] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural and Information Processing Systems 13, pages 556–562, 2001.

[9] J. S. Liu. Monte Carlo strategies in scientific computing. Springer, 2002.

[10] S. Moussaoui, D. Brie, A. Mohammad-Djafari, and C. Carteret.

Separation of non-negative mixture of non-negative sources using a Bayesian approach and mcmc sampling. IEEE Trans. on Signal Pro- cessing, 54(11):4133–4145, Nov. 2006.

[11] M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In In Proc. 8th Internation conference on Inde- pendent Component Analysis and Signal Separation (ICA’09), Paraty, Brazil, Mar. 2009.

[12] A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2008), Part II, number 5212 in LNAI, pages 358–373.

Springer, 2008.

[13] M. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning,, 1:1–305, 2008.

[14] O. Winther and K. B. Petersen. Bayesian independent component analysis: Variational methods and non-negative decompositions. Digital Signal Processing, 17(5):858–872, Sep. 2007.