Signal Processing

(1)

Algorithms for probabilistic latent tensor factorization

Y. Kenan Yılmaz

ⁿ

, A. Taylan Cemgil

Department of Computer Engineering, Bo˘gazic-i University, 34342 Istanbul, Turkey

a r t i c l e i n f o

Article history:

Received 1 March 2011 Received in revised form 13 July 2011

Accepted 16 September 2011

Keywords:

Tensor factorization b-Divergence

Exponential dispersion models EM algorithm

Multiplicative update rules Matricization

a b s t r a c t

We propose a general probabilistic framework for modelling multiway data. Our approach establishes a novel link between graphical representation of probability measures and tensor factorization models that allow us to design arbitrary tensor factorization models while retaining simplicity. Using an expectation-maximization (EM) approach for max- imizing the likelihood of the exponential dispersion models (EDM), we obtain iterative update equations for Kullback–Leibler (KL), Euclidian (EU) or Itakura–Saito (IS) costs as special cases. Besides EM, we derive alternative algorithms with multiplicative update rules (MUR) and alternating projections. We also provide algorithms for MAP estimation with conjugate priors. All of the algorithms can be formulated as message passing algorithm on a graph where vertices correspond to indices and cliques represent factors of the tensor decomposition.

1. Introduction

Advances in computing power, data acquisition, storage technologies made it possible to collect and process huge amounts of data in many disciplines. In order to extract useful information effective and efﬁcient computational tools are needed. In this context, matrix factorization techniques have emerged as a useful paradigm [20,27].

Clustering [5], independent component analysis (ICA) [9,13], non-negative matrix factorization (NMF) [23,20], latent semantic indexing (LSI) [10], collaborative ﬁltering [19], topic models [29] and many related methods can be expressed and understood as matrix factorization pro- blems. Thinking of a matrix as the basic data structure maps well onto special purpose hardware (such as a graphical processing unit—GPU) to make algorithms run faster via parallelization. Moreover, matrix computations come with a toolbox of well understood algorithms with

precise error analysis, performance guarantees and extre- mely efﬁcient standard implementations (e.g., SVD).

A useful method in multiway analysis is tensor factorization (TF) to extract hidden structure in data that consists of more than two entities [18]. However, since there are many more natural ways to factorize a multiway array, there exists a plethora of related models with distinct names, such as canonical decomposition (CP), PARAFAC, TUCKER to name a few. These are discussed in detail in recent excellent tutorial reviews[18,1]. A recent book [7] outlines various optimization algorithms for non-negative TF for alpha and beta divergences. It is also informative to view matrix and tensor factorization from a statistical perspective, by viewing these as hierarchical probabilistic generative models. This approach is central in probabilistic matrix factorization (PMF) [25] or non- negative matrix factorization[6,12]. Statistical interpreta- tions on similar lines for TF focus on a given factorization structure only, such as CP or TUCKER. For example, sparse non-negative TUCKER is discussed in [22], while probabilistic non-negative PARAFAC is discussed in[24,26].

The motivation behind probabilistic latent tensor factorization (PLTF) is to pave the way to a unifying framework where any arbitrary TF model can be constructed and the Contents lists available atSciVerse ScienceDirect

journal homepage:www.elsevier.com/locate/sigpro

Signal Processing

doi:10.1016/j.sigpro.2011.09.033

nCorresponding author. Tel.: þ90 532 285 1435;

fax: þ 90 216 658 8085.

E-mail addresses: kenan@sibnet.com.tr (Y.K. Yılmaz), taylan.cemgil@boun.edu.tr (A.T. Cemgil).

(2)

associated inference algorithm can be derived automati- cally using matrix computation primitives. This is very useful in many application domains such as audio processing, network analysis, collaborative ﬁltering or vision, where it becomes necessary to design application speciﬁc, tailored factorizations.

The main contributions of this paper are: (1) the derivation of update equations for maximum likelihood (ML) estimation based on expectation maximization (EM), multiplicative update rules (MUR) and alternating projections methods for any arbitrary TF model for a large class of statistical models (the so-called exponential dispersion models EDM’s); (2) a maximum a posteriori (MAP) estimation framework for EDM’s via conjugate priors; (3) a novel representation of TF models that closely resembles probabilistic graphical models [14] (undirected graphs and factor graphs) and the development of a practical message passing framework that exploits this link; (4) and a practical implementation technique via matricization and tensorization.

This paper is organized as follows: Section 2 introduces the notation and the probability model for PLTF and derives the update equation for the special case of the KL divergence. The purpose of this section is to introduce the notation and the graphical representation.Section 3 generalizes the TF problem as the optimization of the likelihood of the exponential dispersion models including the use of conjugate priors.Section 4introduces a method for matricizing and tensorizing the element-wise update equations developed in earlier sections along with the examples. Finally we conclude with a discussion section.

The Appendix contains technical details.

2. Latent tensor factorization (TF) model

We define a tensorLas a multiway array with an index set V ¼ fi1,i2, . . . ,iNg where each index in¼1 . . . 9in9 for n ¼ 1 . . . N. Here, 9in9 denotes the cardinality of the index in. An element of the tensorLis a scalar that we denote by Lði1,i2, . . . ,iNÞ or Lⁱ¹^,i²^,...,i^N or as a shorthand notation by LðvÞ. Here, v will be a particular configuration from the product space of all indices in V. For our purposes, it will be convenient to define a collection of tensors, Z1:N¼ fZagfor

a

¼1 . . . N, sharing a set of indicies V. Here, each tensor Za has a corresponding index set Va such thatSN

a¼1Va¼V.

Then, vadenotes a particular conﬁguration of the indices for Za while va denotes a conﬁguration of the compliment Va¼V=Va.

A tensor contraction or marginalization is simply adding the elements of a tensor over a given index set, i.e., for two tensorsLand ^X with index sets V and V0 we write X ðv^ 0Þ ¼P

v₀LðvÞ or ^X ðv0Þ ¼P

v₀Lðv0,v0Þ. To clarify our notation, as an example consider the ordinary matrix multiplication ^X ði,jÞ ¼P

kZ1ði,kÞZ2ðk,jÞ, which is a tensor contraction operation. Although never done in practical computation, formally we can define Lði,j,kÞ ¼ Z1ði,kÞ Z2ðk,jÞ and sum over the index k to find the result. In our formalism, we define V ¼ fi,j,kg, where V0¼ fi,jg, V1¼ fi,kg and V2¼ fk,jg. Hence, V0¼ fkg and we write X ðv^ 0Þ ¼P

v0Z1ðv1ÞZ2ðv2Þ.

A tensor factorization (TF) model is the product of a collection of tensors Z1:N¼ fZag for

a

¼1 . . . N, each deﬁned on the corresponding index set Va, collapsed over a set of indices V0. Given a particular TF model, the latent TF problem is to estimate a set of latent tensors Z1:N

minimize DðXJ ^XÞ s:t: ^Xðv0Þ ¼X

v₀

Y a

ZaðvaÞ ð1Þ

where X is an observed tensor and ^X is the ‘prediction’.

Here, both objects are deﬁned over the same index set V0

and are compared element-wise. The function DðJÞZ0 is a cost function. In this paper, we use ﬁrst the Kullback–

Leibler (KL) divergence as our cost. Later in the generalization section we introduce a form of theb-divergence that uniﬁes Euclidean (EU), KL and Itakura–Saito (IS) cost functions.

Example 1 (TUCKER3 factorization). The TUCKER3 factorization[17,18] aims to ﬁnd Za for

a

¼1 . . . 4 that solves the following optimization problem:

minimize DðXJ ^XÞ s:t: ^X^i,j,k¼X

p,q,r

Z^i,p₁Z^j,q₂ Z^k,r₃ Z^p,q,r₄ 8i,j,k ð2Þ

Both for visualization and for efﬁcient computation, it is useful to introduce a graphical notation to represent the factorization implied by a particular TF model. We deﬁne an undirected graph G ¼ ðV,EÞ with vertex set V and edge set E and associate each index with a vertex of G. For each pairs of indices appearing in a factor index set Va for

a

¼1 . . . N, we add an edge to the edge set of the graph.

Consequently, each clique (fully connected subgraph) of G corresponds to the factor index set Va.Table 1illustrates the representation of several popular models.

In the following section, we introduce a probabilistic model where we cast the minimization problem into an equivalent maximum likelihood estimation problem, i.e., solving the TF problem (1) will be equivalent to maximization of log pðX9Z1:NÞwith respect to Z_a[8,12].

2.1. Probability model

The PLTF generative model represented as the directed acyclic graph (DAG) inTable 1is deﬁned as

LðvÞ ¼Y^N a

ZaðvaÞ intensity ð3Þ

SðvÞ POðS;LðvÞÞ KL cost ð4Þ

Xðv0Þ ¼X

v₀

SðvÞ observation ð5Þ

X ðv^ 0Þ ¼X

v₀

LðvÞ parameter ð6Þ

Mðv0Þ ¼ 0 Xðv0Þis missing 1 otherwise

mask array ð7Þ

Here, we refer toLðvÞ as intensity or the latent intensity ﬁeld. Due to the reproductivity property[21, p. 217]of the Poisson density, the observation Xðv0Þhas the same type of distribution as SðvÞ. Moreover, missing data is handled Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(3)

smoothly as in the likelihood[25,6]

pðX,S9ZÞ ¼Y

v

ðpðXðv0Þ9SðvÞÞpðSðvÞ9LðvÞÞÞ^Mðv⁰^Þ ð8Þ

2.2. Fixed point update equation for the KL cost

The log likelihood LKLis given as LKL¼X

v

Mðv0ÞðSðvÞlogLðvÞLðvÞlogðSðvÞ!ÞÞ ð9Þ

subject to the constraint Xðv0Þ ¼P

v0SðvÞwhenever Mðv0Þ ¼1.

We can easily optimize LKLfor Zaby an EM algorithm. In the E-step we calculate the posterior expectation SðvÞ9Xðv 0Þ

by identifying the posterior pðS9X,ZÞ as a multinomial distribution[21,6]. In the M-step we solve the optimization problem

@LKL=@Z_aðv_aÞ ¼0 to get the ﬁxed point update ðE stepÞ SðvÞ9Xðv 0Þ

¼Xðv0Þ

X ðv^ 0ÞLðvÞ ð10Þ

ðM stepÞ ZaðvaÞ ¼ P

v_aMðv0ÞSðvÞ9Xðv0Þ P

v_aMðv0Þ@LðvÞ

@ZaðvaÞ

ð11Þ

with the following equalities:

@LðvÞ

@ZaðvaÞ¼@_aLðvÞ ¼ Y a⁰aa

Za⁰ðva⁰Þ, LðvÞ ¼ ZaðvaÞ@_aLðvÞ ð12Þ

After substituting (10) in (11) and noting that ZaðvaÞbeing independent of the sumP

v_awe obtain the following multiplicative ﬁxed point iteration for Za:

ZaðvaÞ’ZaðvaÞ P

v_aMðv0ÞXðv0Þ X ðv^ 0Þ

@_aLðvÞ P

vaMðv0Þ@_aLðvÞ ð13Þ

Example 2 (KL-NMF update). It is easy to see that the multiplicative NMF algorithm with KL cost appears as a

special case as

X^^i,j¼X

r

Zî,r₁Z^r,j₂, Zî,r₁’Zî,r1

P

jðMî,jXî,j= ^Xî,jÞZ^r,j₂ P

jZ^r,j₂ ð14Þ

2.3. The priors and the constraints

In this section we incorporate the prior information into the PLTF. The conjugate prior for the Poisson observation model is the Gamma density Z_aðv_aÞ GðZaðv_aÞ; A_aðv_aÞ, BaðvaÞÞ. We optimize the log posterior for ZaðvaÞas below noting that the ﬁrst part is due to the log likelihood of the Poisson while the second part is due to the conjugate gamma prior

@L

@ZaðvaÞ¼ X

va

SðvÞLðvÞ LðvÞ @_aLðvÞ 0

@

1

Aþ A_aðv_aÞ1 ZaðvaÞ BaðvaÞ

ð15Þ

ZaðvaÞ’

ðA_aðv_aÞ1Þ þ Z_aðv_aÞP

v_aMðv0ÞXðv0Þ X ðv^ 0Þ@_aLðvÞ BaðvaÞ þP

vaMðv0Þ@aLðvÞ ð16Þ

This update converges to the mode of the posterior distribution pðZ1:N9XÞ. This simply computes the mode of the full conditional pðZ_a9X,ZaÞ, that is a Gamma distribution for each element ZaðvaÞ. Here, Zadenotes all other factors Za⁰for

a

⁰¼1 . . . N, such that

a

_a

a

⁰.

This prior can be used to impose sparsity: we can take the prior Gðx; a,a=mÞ with mean m, and variance m²=a. For small

a

, most of the elements of Z are expected to be around zero, with only a few large ones.

2.4. Relation to graphical models

An important observation that leads to computational savings and compact representation of the element-wise Table 1

Graphical representation of popular TF models (MF: matrix factorization, CP: PARAFAC or the canonical decomposition and a Tucker decomposition). Any TF model can be visualized using the semantics of undirected graphical models where cliques (fully connected subgraphs) correspond to individual factors and vertices correspond to indices. The shaded vertices are latent, i.e., correspond to indices that are not elements of V0, the index set of X. Whilst algebraically identical to a probabilistic undirected graphical model, our representation merely captures the factorization of the TF model and does not have a probabilistic semantics (we do not represent a probability measure and the indices are not random variables). The underlying probability model of PLTF in the conventional sense is given by the DAG on the left (as a Bayesian network model) where X is the observed multiway data and Za’s are the model parameters. The latent tensor S allows us to treat the problem in a data augmentation setting and apply the EM algorithm.

Symbol MF CP TUCKER3

Model V fi,j,rg fi,j,k,rg fi,j,k,p,q,rg

Observed V0 fi,jg fi,j,kg fi,j,kg

Latent V0 frg frg fp,q,rg

V1,V2 fi,rg,fj,rg fi,rg,fj,rg fi,pg,fj,qg

Factors V3,V4 – fk,rg fk,rg,fp,q,rg

Estimate X^ P

r

Z^i,r₁Z^r,j₂ P

r

Z^i,r₁Z^j,r₂Z^k,r₃ P

p,q,r

Z^i,p₁Z^j,q₂Z^k,r₃ Z^p,q,r₄

(4)

update equation is that the update (13) consists of structurally similar terms in both the denominator and the numerator. For large TF models, this structure needs to be exploited for computational efﬁciency. Therefore, we deﬁne the following tensor valued function, where 9A9 denote the dimensionality of the object A.

Definition 1. We define a tensor valued functionDaðQ Þ : R^9Q9_-R^9Zâ⁹

D^p_aðQ Þ ¼ X

v_a

Q ðv0Þð@aLðvÞÞ^p 2

4

3

5 ð17Þ

whereDaðQ Þ and the variable Q are tensors with the same size as Z_aand of the observation X respectively. The non- negative integer p on the right side denotes the element- wise power while on the left, it should be interpreted as a parameter of the function. Using this notation, we rewrite the update (13) compactly as

Za’ZaJ

D_aðMJX= ^X Þ

DaðMÞ ð18Þ

whereJand / stand for element-wise multiplication and division respectively. Later we develop the explicit matrix forms of these updates.

Interestingly, computing DaðQ Þ is equivalent to computing a tensor contraction, marginal potential over vertices Va. More precisely, we construct a ‘joint tensor’

P ¼ Xðv0ÞQ

aZaðvaÞ. For each

a

, we remove Za from P, which we denote as P=Za, and compute the ‘marginal tensor’ by summing over Va. An example for the TUCKER model is shown inFig. 1. This operation can be viewed as the computation of certain marginal densities given an undirected graphical model[14,29].

3. Generalization of TF forb-divergence

The development of the update equations for TF’s with EU and IS costs can be proceeded similar toSection 2by considering the latent tensor ﬁeld SðvÞ as

SðvÞ N ðSðvÞ;LðvÞ,1Þ EU cost ð19Þ SðvÞ N ðSðvÞ; 0,LðvÞÞ IS cost ð20Þ However, we prefer a single development for the b-divergence that uniﬁes the EU, KL, and IS costs in the

following expression for p ¼ f0; 1,2g respectively[7]

d_bðx,yÞ ¼

1 ð2pÞð1pÞ

ðx^2pþ ð1pÞy^2pð2pÞxy^1pÞ p ¼ 0 ðEUÞ x logx

yþyx p ¼ 1 ðKLÞ

x ylogx

y1 p ¼ 2 ðISÞ

8>

>>

<

>>

>:

ð21Þ where we keep p in the ﬁrst line since it is valid for a range while p ¼0 is only a special case for EU cost. The b-divergence is associated with the Tweedie distributions that belong to the exponential dispersion models (EDM) [15]. The dispersion models relate the variance of the distribution to some power p of its mean LðvÞ given as VarðSðvÞÞ ¼

j

¹_s vðLðvÞÞ with

j

¹_s being the dispersion (or scale) parameter and vðLðvÞÞ ¼LðvÞ^p being the variance function[15].

This section follows the similar outline as the previous section where we obtain various update rules for the TF models withb-divergence via the EM. Here we present the main results and leave technical details toAppendix C.

In addition, to keep the notation simpler, we will drop iteration indices from the update equations and ignore the missing mask array M, assuming all the elements of X are observed.

3.1. EM updates for the ML estimate

Starting with the E-step, the posterior expectation /SðvÞ9Xðv0ÞSfor the EDM’s is identiﬁed as

ðE stepÞ /SðvÞ9Xðv0ÞS ¼LðvÞ þ LðvÞ^p P

v₀LðvÞ^pðXðv0Þ ^X ðv0ÞÞ

for p Z0 ð22Þ

ðE stepÞ /SðvÞ9Xðv0ÞS ¼LðvÞ þ

r

^L^ðvÞ

p

X ðv^ 0Þ^pðXðv0Þ ^X ðv0ÞÞ

for p ¼ f0; 1,2g ð23Þ

where

r

is the ratio of the dispersion parameters

j

¹_s =

j

¹_x

and it is set to be f1=K,1, ^X ðv0Þ=LðvÞg for p ¼ f0; 1,2g where K is the cardinality of the invisible vertex set V0, i.e.

K ¼ 9V09 recalling ^Xðv0Þ ¼P

v0LðvÞ. Note that for p ¼ f0; 1,2g, i.e. for Gaussian, Poisson and gamma (for i.i.d variables), both expressions are identical. Indeed, for p¼1 they recover /SðvÞ9Xðv0ÞS ¼ ðXðv0Þ= ^X ðv0ÞÞLðvÞ the posterior expectation of the Poisson identiﬁed inSection 2, and for p¼0 they recover the posterior expectation of the Gaussian given in Table 2. For assuming i.i.d. variables with the gamma distribution, the ratio

r

is even further parameter- ized as

r

¼K^p1.

In this section we use the posterior expectation in (23) which is the special case for Gaussian, Poisson and the gamma since this expression is better simpliﬁed when plugged into the M-step than the general case (22) having an extra termP

v0LðvÞ^p. However, for the cases paf0; 1,2g, e.g. for p 2 ð1; 2Þ (the compound Poisson) or for p¼3 (inverse Gaussian)[4], one can follow the similar steps for the derivations of the update rules, i.e. plug (22) as the

i j

k

p q

r

i j

k

p q

r

i j

k

p q

r

Fig. 1. Undirected graphical models for the computation of theDaof TUCKER3 decomposition. (a) Joint tensor P deﬁned as P ¼ XQ

aZa, (b) graph for P=Z1used for computation ofD1where V1¼ fi,pg, (c) graph for P=Z4used forD4corresponding to the core tensor where V4¼ fp,q,rg.

Double circled indices are not summed over. Rather than using a naive direct approach, the contraction can be computed efﬁciently by dis- tributing the summation over the product, a procedure that is algebraically identical to variable elimination for inference in probabilistic graphical models.

Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(5)

posterior expectation in the M-step. The M-step with the expectation in (23), then, gives the ML estimate

ðM stepÞ Z_aðv_aÞ ¼ P

v_a/SðvÞ9Xðv0ÞS@aLðvÞ^1p P

v_a@_aLðvÞ^2p ð24Þ

ZaðvaÞ’ZaðvaÞ þZaðvaÞ^pP

va

r

ðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ^p@aLðvÞ P

v_a@aLðvÞ^2p

ð25Þ where, for p¼ 1 it recovers the update for KL developed in Section 2. For p¼2 (IS cost) it can be compared to the Gaussian variance modelling of the IS cost given in Appendix B.

3.2. Multiplicative updates for the ML estimate

Although the EM update can also be formulated in the multiplicative form as

vaðrðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ^pþLðvÞ^1pÞ@aLðvÞ P

vaLðvÞ^1p@aLðvÞ

ð26Þ in general this is different from the multiplicative update rule (MUR) as in the sense of[20]due to the subtraction (potentially resulting to a negative value) in the nomina- tor of (26). The MUR equations were popularized by[20]

for the NMF and extensively analyzed[7,12]. They ensure the non-negative parameter updates as long as starting with the non-negative initial values. For the PLTF models, the MUR update equation is given as

Z_aðv_aÞ’Zaðv_aÞ P

v_aXðv0Þ ^X ðv0Þ^p@aLðvÞ P

v_aX ðv^ 0Þ^1p@aLðvÞ ð27Þ

while this coincides with the EM update for the KL case (p¼ 1) it differs for general b-divergence. This equation

successfully recovers the NMF updates in[20]and update for CP for b-divergence in [7, such as Eq. (7.59)]. The realization of the EMMLand MURMLupdate equations for p ¼ f0; 1,2g are given in Table 2. Further details can be found inAppendix C.2.

The multiplicative update rules MURMLcan be obtained by plugging the heuristic LðvÞ^1p¼

r

X ðv^{^} 0Þ^1p in EMML

update (26). A mathematical justiﬁcation of this substitu- tion would prove directly the convergence of MUR;

however while we always observe convergence in prac- tice, it is an open problem to rigorously establish the convergence of MUR for generalb-divergence.

Example 3 (CP MUR update for Z1for KL and EU costs).

Z^i,r₁’Z^i,r1

P

j,kðX^i,j,k= ^X^i,j,kÞZ^j,r₂Z^k,r₃ P

j,kZ^j,r₂Z^k,r₃ , Z^i,r₁’Z^i,r1

P

j,kX^i,j,kZ^j,r₂Z^k,r₃ P

j,kX^^i,j,kZ^j,r₂Z^k,r₃ ð28Þ

Example 4 (TUCKER3 MUR update for Z1 and Z4 (core tensor) for EU cost).

Z^i,p₁ ’Z^i,p1

P

j,k,q,rðX^i,j,k= ^X^i,j,kÞZ^j,q₂Z^k,r₃ Z^p,q,r₄ P

j,k,q,rZ^j,q₂Z^k,r₃ Z^p,q,r₄ Z^p,q,r₄ ’Z^p,q,r4

P

i,j,kðXî,j,k= ^Xî,j,kÞZî,p₁ Z^j,q₂Z^k,r₃ P

i,j,kZ^i,p₁ Z^j,q₂Z^k,r₃ ð29Þ

3.3. Direct solution via alternating projections

The method of alternating projections attempts to solve the estimation problem by optimizing one variable while keeping the others ﬁxed [7, Chapter 4]. It is a direct solution obtained by equating the gradient to zero and Table 2

(Up) EM update equations and posterior expectations. Note that the variance function vðLðvÞ is some power of the expectationLðvÞ. (Down) The EM update (multiplicative form) and the MUR update are compared.

p vðLðvÞÞ EM update r SðvÞ9Xðv0Þ

b LðvÞ^p

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ^1p P@aLðvÞ^2p

j_x

j_s ^L^{ðvÞ þ}r^L^ðvÞ

p

X ðv^ 0Þ^pðXðv0Þ ^X ðv0ÞÞ

EU 0 1

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ P@aLðvÞ²

1

K LðvÞ þ1

KðXðv0Þ ^X ðv0ÞÞ

KL 1 LðvÞ

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ P@aLðvÞ

1 LðvÞXðv0Þ

X ðv^ 0Þ

IS 2 LðvÞ²

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ¹ P1

X ðv^ 0Þ

LðvÞ LðvÞ þLðvÞ

X ðv^ 0ÞðXðv0Þ ^X ðv0ÞÞ

EM update MUR update

vaðrðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ^pþLðvÞ^1pÞ@aLðvÞ P

vaLðvÞ^1p@aLðvÞ ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þ^p@aLðvÞ P

vaX ðv^ 0Þ^1p@aLðvÞ EU

ZaðvaÞ’

P

va LðvÞ þ1

KðXðv0Þ ^X ðv0ÞÞ

@aLðvÞ P

va@aLðvÞ²

vaXðv0Þ@aLðvÞ P

vaX ðv^ 0Þ@aLðvÞ KL

vaXðv0Þ ^X ðv0Þ¹@aLðvÞ P

va@aLðvÞ ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þ¹@aLðvÞ P

va@aLðvÞ IS

vaðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ¹þ1 P

va1 ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þ²@aLðvÞ P

va

X ðv^ 0Þ¹@aLðvÞ

(6)

solving directly as X

v_a

Xðv0Þ ^X ðv0Þ^p@aLðvÞ ¼X

v_a

X ðv^ 0Þ^1p@aLðvÞ ð30Þ

In matricization section we show how this is further simpliﬁed for the EU cost (p¼0) by use of the pseudo- inverse operation where it turns to alternating least squares (ALS).

3.4. Update rules for MAP estimation

We can incorporate prior belief as in the form of conjugate prior pðZaðvaÞ9N⁰_aðvaÞ,Z⁰_aðvaÞÞ, with N⁰_aðvaÞ and Z⁰_aðvaÞbeing the hyperparameters for ZaðvaÞ. As speciﬁed in [11], N⁰_aðv_aÞ might be thought of a prior sample size while Z⁰_aðvaÞis a prior expectation of the mean parameter.

The exact definition of N⁰_aðvaÞand N⁰_aðvaÞcan be identified for each distribution separately by following the definition of a conjugate prior given inAppendix C.3. The EMMAP

estimate is as

Z_aðv_aÞ ¼N⁰_aðvaÞZ⁰_aðvaÞ þP

v_a/SðvÞ9Xðv0ÞS

j

_s@aLðvÞ^1p N⁰_aðvaÞ þP

v_a

j

_s@aLðvÞ^2p

while the multiplicative update rules MUR_MAPbecomes

ZaðvaÞ’N⁰_aðvaÞZ⁰_aðvaÞ þZaðvaÞ^pP

v_a

j

_xXðv0Þ ^X ðv0Þ^p@aLðvÞ N⁰_aðvaÞ þZaðvaÞ^p1P

v_a

j

_xX ðv^{^} 0Þ^1p@aLðvÞ ð32Þ Prior belief can always be speciﬁed in terms of N⁰_aðvaÞand Z⁰_aðvaÞ. However, if prior distribution is to be speciﬁed explicitly, they can be tied to the prior parameters similar to[16].

Example 5 (Conjugate prior for Poisson ðp ¼ 1Þ). The Gamma distribution is the conjugate prior for the Poisson pðS9ZÞ as pðZaðvaÞ9N⁰_aðvaÞ,Z⁰_aðvaÞÞ ¼GðZaðvaÞ9AaðvaÞ,BaðvaÞÞ with

N⁰_aðvaÞ ¼BaðvaÞ, N⁰_aðvaÞZ⁰_aðvaÞ ¼AaðvaÞ1 ð33Þ which for p ¼1 EMMAP and MURMAPsuccessfully recover the update for KL (16) inSection 2.

Example 6 (Conjugate prior for Gaussian mean ðp ¼ 0Þ).

The Gaussian distribution is the conjugate prior for the Gaussian pðS9Z,SÞwith unknown mean and known var- ianceSwith

pðZ_aðv_aÞ9N⁰_aðv_aÞ,Z⁰_aðv_aÞÞ ¼N ðZaðv_aÞ9A_aðv_aÞ,B_aðv_aÞÞ ð34Þ

N⁰_aðvaÞ ¼S=BaðvaÞ, Z⁰_aðvaÞ ¼AaðvaÞ ð35Þ

3.5. Missing data and tensor forms

It is straightforward to handle missing data in EM and MUR updates. We recall that the scalar value Mðv0Þ is, indeed, simply a multiplier in the sumP

v_a of the initial M-step equation as P

v_aMðv0Þð/SðvÞ9Xðv0ÞSLðvÞÞ

j

_s LðvÞ^p@aLðvÞ. Hence we simply put it back inside the sums P

v_a. For example the MURMLupdate turns to be

v_aMðv0ÞXðv0Þ ^X ðv0Þ^p@aLðvÞ P

v_aMðv0Þ ^X ðv0Þ^1p@aLðvÞ ð36Þ As we did inSection 2 we represent all the update equations in tensor forms by use of theDa abstraction.

InTable 3, we give a summary of general update rules in tensor forms. For alternating projections, considering the least square approximation, it is natural to set p ¼0 and by the use of the pseudo-inverse. We need to solve DaðXÞ ¼Dað ^X Þ. This equation leads to a linear matrix equation of the form LXR ¼ L ^X R where L and R are structured matrices determined by the form of the TF model. This equation can be solved via pseudo-inverses in a least square sense, when there is no missing data (i.e.

M ¼ 1). For general M, we could not derive a compact equation without considering tedious reindexings.

4. Representation and implementation

For the tensorized forms of the update equations such as the KL update Za¼ZaJDaðMJX= ^X Þ=DaðMÞ, the main task is the computation of theDaðÞ function. Below, we deﬁne a matricization procedure that convertsDaðÞ(any element-wise equation indeed) into the matrix form in terms of matrix multiplication abstracts such as the Kronecker product and Khati–Rao product.

4.1. Matricization

Matricization as deﬁned originally in [17,18] is the operation of converting a multiway array into a matrix Z_aðv_aÞ’N⁰_aðvaÞZ⁰_aðvaÞ þZaðvaÞP

v_af

j

_s@aLðvÞ^1pþZaðvaÞ^p1

j

_xðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ^pg@aLðvÞ N⁰_aðvaÞ þP

v_a

j

_s@aLðvÞ^2p ð31Þ

Table 3

The table lists the updates in tensor forms viaDaðÞfunction. All multi- plications and divisions are element-wise.

EM Q₁,Q₂ Q1¼M^JðX ^X Þ^JX^^pQ2¼M ML Za’ZaþZ^p_{a J}DaðrQ₁Þ

D^2p_a ðQ₂Þ

MAP Za’N⁰a JZ⁰aþZa JDa^1pðQ2Þ þZ^pa JDaðj_xQ1Þ N⁰aþD^2pa ðj_sQ2Þ MUR Q1,Q2 Q1¼M^JX^JX^^p, Q2¼M^JX^^1p

ML Za’Za JDaðQ₁Þ DaðQ₂Þ

MAP Za’N⁰_{a J}Z⁰_aþZ^p_{a J}Daðj_xQ1Þ Z⁰aþZ^p1a JDaðj_xQ2Þ Alternating

projection

ML solveDaðXÞ ¼Dað ^X Þ assuming M ¼ 1

Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(7)

by reordering the column fibers. In this paper we refer to this definition as ‘unfolding’ and refer to matricization as the procedure to convert an element-wise equation into a corresponding matrix form. We use Einstein’s summation convention where repeated indices are added over. The conversion rules are given inTable 4. Our notation is best illustrated with an example: consider a matrix Xî,j with row index i and column index j. If we assume a column by column memory layout, we refer to the vectorization of vec X (vertical concatenation of columns) as X^ji; adopting a ‘the faster index last’ convention and we drop the comma. Here i is the faster index since when traversing the elements of the matrix X in sequence i changes more rapidly. With this, we arrive at the following definition.

Definition 2. Consider a multiway array X 2RÎ¹Î^Mwith a generic element denoted by Xⁱ¹^,i²^,...,i^M. The mode-n unfolding of X is the matrix XðnÞ2RÎⁿ^Q^kanÎ^k with row index inas XðnÞXⁱ_i^M^...iⁿ¹ⁱ^{n þ 1}^...i²ⁱ¹

n ð37Þ

where the fastest index is in the order ii,i2, . . . ,iM.

Here we follow the natural ordering, that is, for the mode-1 unfolding, mode-2 rows are placed before mode- 3 rows and similarly for the mode-2 unfolding, mode-1 rows are placed before mode-3 rows and so on. Hence

during matricization, we start with scalar terms, while we stick up the desired indices (the outcome indices), we may freely reorder the terms inside the sum, transpose them, unfold them, join them by using Kronecker and Khati–Rao product. In addition, if needed we may introduce ones matrix 1 of any size when the indices of the terms are insufﬁcient as Y^q₁¼P

pY^q_p¼P

p1^p₁Y^q_p¼ ð1YÞ^q₁ where the index p to be marginalize out.

Here are the examples of matricization, MUR and ALS updates.Table 5summarizes updates for known models.

To get rid of the annoying subindices, here we relax the notation by using A,B,C,G for Za.

Example 7 (Matricization of TUCKER3 factorization). For mode-1 unfolding ^Xð1Þ

X^^i,j,k¼X

pqr

G^p,q,rA^i,pB^j,qC^k,r

start with element-wise equation ð38Þ ð ^Xð1ÞÞ^kj_i ¼ ðGð1ÞÞ_p^rqA^p_iB^q_jC^r_k place row=column indices

¼ ðAGð1ÞÞ^rq_i ðC BÞ^rq_kj reorder and form Kronecker product

¼ ððAGð1ÞÞðC BÞ^TÞ^kj_i transpose and drop matching indices

X^ð1Þ¼AGð1ÞðC BÞ^T remove remaining indices ð39Þ Example 8 (Update equations of TUCKER3 factorization).

The functionsDAðQ Þ andDGðQ Þ are

DAðQ Þ ¼ ðQð1ÞÞ^kj_i B^q_jC^r_kG^rq_p ¼Qð1ÞðC BÞG^T_ð1Þ ð40Þ DGðQ Þ ¼ ðQð1ÞÞ^kj_iA^p_iB^q_jC^r_k¼A^TQð1ÞðC BÞ ð41Þ

If MURMLthe general format of the update equation is Za’ZaJDaðQ1Þ=DaðQ2Þ with Q1¼MJXJX^^p and Q2¼ M^JX^^1p. Then

A ¼ A^JQ1_ð1ÞðC BÞG^T_ð1Þ

Q2_ð1ÞðC BÞG^T_ð1Þ, Gð1Þ¼Gð1ÞJA^TQ1_ð1ÞðC BÞ A^TQ2_ð1ÞðC BÞ

ð42Þ where, for example, for KL (p ¼1) we evaluate Q1¼ MJðX= ^X Þ and Q₂¼M.

Table 4

Index notation used to matricize an element-wise equation into the matrix form. Following Einstein convention, duplicate indices are summed over. Khatri–Rao product and mode-n unfolding are imple- mented in N-way Toolbox[3].

Equivalence Matlab Remark

X^j_iX X Matrix notation

X^kj_i Xð1Þ nshape(X,1) Array (mode-1 unfolding)

X^j_i ðX^TÞⁱ_j X⁰ Transpose

vec X^j_i ðXÞ¹_ji Xð:Þ Vectorize X^j_iY^p_j ðXYÞ^p_i XⁿY Matrix product X^p_iY^p_j ðX YÞ^p_ij krbðX,YÞ Khatri–Rao product X^p_iY^q_j ðX YÞ^pq_ij kronðX,YÞ Kronecker product X^j_iX^j_i ðX^JXÞ^j_i X:ⁿX Hadamard product

Table 5

The table lists model output, theDafunction, the MUR and ALS updates for the typical components of NMF, CP, TUCKER3 and PARATUCK2 models. For PARATUCK2 F is given as Fð1Þ¼ ððCa G^TÞ CbÞ^T. The symbols , and^Jare for Knonecker, Khatri–Rao and Hadamard products in the order and the division is of the element-wise type.

Model DaðÞ MUR update ALS (p ¼0)

NMF X ¼ AB^ DAðQ Þ ¼ QB^T

A ¼ A^JQ1B^T Q2B^T

A ¼ XB^y

CP X^ð1Þ¼AðC BÞ^T DAðQ Þ ¼ Q_ð1ÞðC BÞ

A ¼ A^JQ1ð1ÞðC BÞ Q2ð1ÞðC BÞ

A ¼ Xð1ÞððC BÞ^TÞ^y

T3 X^ð1Þ¼AGð1ÞðC BÞ^T DAðQ Þ ¼ Q_ð1ÞðC BÞG^T_ð1Þ

A ¼ AJ

Q₁_ð1ÞðC BÞG^T_ð1Þ Q2ð1ÞðC BÞG^T_ð1Þ

A ¼ Xð1ÞðGð1ÞðC BÞ^TÞ^y

DGðQ Þ ¼ A^TQð1ÞðC BÞ

Gð1Þ¼Gð1ÞJA^TQ1ð1ÞðC BÞ A^TQ2ð1ÞðC BÞ

Gð1Þ¼A^yXð1ÞððC BÞ^TÞ^y

PT2 X^ð1Þ¼AFð1Þð1 B^TÞ DAðQ Þ ¼ Q_ð1Þð1 BÞF^T

A ¼ AJ

Q₁_ð1Þð1 BÞF^T Q2ð1Þð1 BÞF^T

(8)

If MURMAP the update for the factor A with the hyperparameters N⁰_aðv_aÞ,Z⁰_aðv_aÞ is as below. Here Q₁,Q₂ are as Q₁¼MJXJX^^p and Q₂¼MJX^^1p as for the MURML. A sample implementation is given in Algorithm 1.

A ¼N⁰_aðv_aÞZ⁰_aðv_aÞ þA^pJð

j

xQ₁_ð1ÞðC BÞG^T_ð1ÞÞ

N⁰_aðvaÞ þA^p1Jð

j

_xQ2ð1ÞðC BÞG^T_ð1ÞÞ ð43Þ

If ALS (for p ¼0) and assuming no missing values we solve X ¼ ^X for the core tensor G

Xð1Þ¼ ^Xð1Þ¼AGð1ÞðC BÞ^T

)Gð1Þ¼A^yXð1ÞððC BÞ^TÞ^y ð44Þ with X^y¼X^TðXX^TÞ¹. Here we simply move all the unrelated factors to the other side of the equation while taking their pseudo-inverses.

Example 9 (PARATUCK2 model). After the factorization process ends we might consider factorization of the latent tensors deeper. PARATUCK2 [18,7,2] is a nice example to illustrate this. After inserting the scalar value 1^k_k PARATUCK2 model can be matricized as

X^^i,j,k¼X

pq

A^i,pB^j,qF^p,q,k) ^Xð1Þ¼ ½A^p_iðB^TÞ^j_qF^kqp1^kk ¼AFð1Þð1 B^TÞ ð45Þ

Then, we consider further factorization of F, as below where at the last step we transpose twice:

F^kq_p G^q_pC^p_a

kC^q_b

k¼ ððCa G^TÞ^TÞ^kq_pðC^T_bÞ^kq₁

¼ ððCa G^TÞ^TÞ ðC^T_bÞÞ_p^kq¼ ððCa G^TÞ CbÞ^T ð46Þ

Algorithm 1. Matlab implementation for TUCKER3 MURMAP

update forb-divergence.

% Input: X, N0, Z0, Z, M, MAXITER, p, dX, N. Output: Z (latent factors)

for e¼ 1:MAXITER

for a¼ 1:N % i,j,k: size of X

X2¼ reshape(Z1nnshape(Z4,1)ⁿkron(Z3,Z2)’, i, j, k) Q1¼ dXnM .ⁿX .ⁿX2.b (-p); % dX : 1/dispersion of X Q2¼ dXnM .nX2.b (1-p);

if (a¼ ¼ 1) % tensor A (the code for B and C are similar)

UP¼ (nshape(Q1, a)nkron (Z3, Z2)nnshape(Z4, a)’);

DW¼ (nshape(Q2, a)nkron (Z3, Z2)ⁿnshape(Z4, a)’);

else (a¼ ¼ 4) % tensor G

UP¼ Z1’nnshape(Q1,1)ⁿkron(Z3, Z2);

DW¼ Z1’nnshape(Q2,1)nkron(Z3, Z2);

UP¼ reshape(UP, p, q, r); % p,q,r: size of G (latent size)

DW¼ reshape(DW, p, q, r);

end

UP¼ N0{a}.ⁿZ0{a} þ Z{a}.b(p) .ⁿUP ; DW¼ N0{a} þ Z{a}.b(p-1) .ⁿDW ; Z{a}¼ UP./DW ; % the update end

end

5. Discussion

In this paper, we have developed a probabilistic framework for multiway analysis of high dimensional datasets.

By exploiting the link between graphical models and tensor factorization models we cast any arbitrary tensor factorization problem, and many popular models such as CP or TUCKER3 as inference, where tensor factorization reduces to a parameter estimation problem. We ﬁrst illustrated our approach for the conditionally Poisson case and employed the EM algorithm for the optimization. We also extend the probability model to include the conjugate priors and obtain the update equations accordingly. One main saving in our framework appears in the computation of D_aðÞ, that is computationally equivalent to computing expectations under probability distributions that factorize according to a given graph structure. As is the case with graphical models, this quantity can be computed via message passing:

algebraically we distribute the summation over all va and compute the sum in stages.

We also sketched a straightforward matricization procedure to convert element-wise equations into the matrix forms to ease implementation and compact representation. The use of the matricization is simple, easy and powerful that without any use of matrix algebra it is possible to derive the update equations mechanically in the corresponding matrix forms.

Perhaps the most important contribution of the paper is the generalization of TF problem to a point that allows one to ‘invent’ new factorization models appropriate to their applications. Pedagogically, the framework guides building new models as well as deriving update equations forb-divergence that uniﬁes the popular cost functions.

Indeed, results scattered in the literature can be derived in a straightforward manner. Model selection can also be automized; however due to space constraints, we will cover model selection issues in a future publication.

Acknowledgments

This work is funded by the TUBITAK grant number 110E292, Bayesian matrix and tensor factorisations (BAYTEN) and Bogazici University research fund BAP5723.

Appendix A. Common distributions Gaussian distribution:

N ðx; a,bÞ ¼ exp ðxaÞ² 2b þ1

2 log 1 2

p

b

!

Poisson distribution:

POðx; aÞ ¼ expða þ x log alog x!Þ Gamma distribution:

Gðx; a,bÞ ¼ expðða1Þlog xbx þ a log blogGðaÞÞ

Appendix B. Gaussian based modelling of IS cost

IS cost can also be modeled as zero mean unknown variance of a Gaussian [12] as SðvÞ N ðSðvÞ; 0,LðvÞÞ, Xðv0Þ N ðXðv0Þ; 0, ^X ðv0ÞÞ, where Xðv0Þ ¼P

v₀SðvÞ. Taking the derivative of the log likelihood of the Gaussian wrt Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

Signal Processing

Algorithms for probabilistic latent tensor factorization

Y. Kenan Yılmaz

, A. Taylan Cemgil

Signal Processing

a

a

a

a

a

a

a

a

a

j

j

r

r

j

j

r

r

r

r

j

j

j

j

j

j

j

j





j

j



p