• Sonuç bulunamadı

Signal Processing

N/A
N/A
Protected

Academic year: 2021

Share "Signal Processing"

Copied!
11
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Algorithms for probabilistic latent tensor factorization

Y. Kenan Yılmaz

n

, A. Taylan Cemgil

Department of Computer Engineering, Bo˘gazic-i University, 34342 Istanbul, Turkey

a r t i c l e i n f o

Article history:

Received 1 March 2011 Received in revised form 13 July 2011

Accepted 16 September 2011

Keywords:

Tensor factorization b-Divergence

Exponential dispersion models EM algorithm

Multiplicative update rules Matricization

a b s t r a c t

We propose a general probabilistic framework for modelling multiway data. Our approach establishes a novel link between graphical representation of probability measures and tensor factorization models that allow us to design arbitrary tensor factorization models while retaining simplicity. Using an expectation-maximization (EM) approach for max- imizing the likelihood of the exponential dispersion models (EDM), we obtain iterative update equations for Kullback–Leibler (KL), Euclidian (EU) or Itakura–Saito (IS) costs as special cases. Besides EM, we derive alternative algorithms with multiplicative update rules (MUR) and alternating projections. We also provide algorithms for MAP estimation with conjugate priors. All of the algorithms can be formulated as message passing algorithm on a graph where vertices correspond to indices and cliques represent factors of the tensor decomposition.

&2011 Elsevier B.V. All rights reserved.

1. Introduction

Advances in computing power, data acquisition, storage technologies made it possible to collect and process huge amounts of data in many disciplines. In order to extract useful information effective and efficient computational tools are needed. In this context, matrix factorization techniques have emerged as a useful paradigm [20,27].

Clustering [5], independent component analysis (ICA) [9,13], non-negative matrix factorization (NMF) [23,20], latent semantic indexing (LSI) [10], collaborative filtering [19], topic models [29] and many related methods can be expressed and understood as matrix factorization pro- blems. Thinking of a matrix as the basic data structure maps well onto special purpose hardware (such as a graphical processing unit—GPU) to make algorithms run faster via parallelization. Moreover, matrix computations come with a toolbox of well understood algorithms with

precise error analysis, performance guarantees and extre- mely efficient standard implementations (e.g., SVD).

A useful method in multiway analysis is tensor factor- ization (TF) to extract hidden structure in data that consists of more than two entities [18]. However, since there are many more natural ways to factorize a multiway array, there exists a plethora of related models with distinct names, such as canonical decomposition (CP), PARAFAC, TUCKER to name a few. These are discussed in detail in recent excellent tutorial reviews[18,1]. A recent book [7] outlines various optimization algorithms for non-negative TF for alpha and beta divergences. It is also informative to view matrix and tensor factorization from a statistical perspective, by viewing these as hierarchical probabilistic generative models. This approach is central in probabilistic matrix factorization (PMF) [25] or non- negative matrix factorization[6,12]. Statistical interpreta- tions on similar lines for TF focus on a given factorization structure only, such as CP or TUCKER. For example, sparse non-negative TUCKER is discussed in [22], while prob- abilistic non-negative PARAFAC is discussed in[24,26].

The motivation behind probabilistic latent tensor factor- ization (PLTF) is to pave the way to a unifying framework where any arbitrary TF model can be constructed and the Contents lists available atSciVerse ScienceDirect

journal homepage:www.elsevier.com/locate/sigpro

Signal Processing

0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.sigpro.2011.09.033

nCorresponding author. Tel.: þ90 532 285 1435;

fax: þ 90 216 658 8085.

E-mail addresses: kenan@sibnet.com.tr (Y.K. Yılmaz), taylan.cemgil@boun.edu.tr (A.T. Cemgil).

(2)

associated inference algorithm can be derived automati- cally using matrix computation primitives. This is very useful in many application domains such as audio proces- sing, network analysis, collaborative filtering or vision, where it becomes necessary to design application specific, tailored factorizations.

The main contributions of this paper are: (1) the derivation of update equations for maximum likelihood (ML) estimation based on expectation maximization (EM), multiplicative update rules (MUR) and alternating projec- tions methods for any arbitrary TF model for a large class of statistical models (the so-called exponential dispersion models EDM’s); (2) a maximum a posteriori (MAP) esti- mation framework for EDM’s via conjugate priors; (3) a novel representation of TF models that closely resembles probabilistic graphical models [14] (undirected graphs and factor graphs) and the development of a practical message passing framework that exploits this link; (4) and a practical implementation technique via matriciza- tion and tensorization.

This paper is organized as follows: Section 2 intro- duces the notation and the probability model for PLTF and derives the update equation for the special case of the KL divergence. The purpose of this section is to introduce the notation and the graphical representation.Section 3 generalizes the TF problem as the optimization of the likelihood of the exponential dispersion models including the use of conjugate priors.Section 4introduces a method for matricizing and tensorizing the element-wise update equations developed in earlier sections along with the examples. Finally we conclude with a discussion section.

The Appendix contains technical details.

2. Latent tensor factorization (TF) model

We define a tensorLas a multiway array with an index set V ¼ fi1,i2, . . . ,iNg where each index in¼1 . . . 9in9 for n ¼ 1 . . . N. Here, 9in9 denotes the cardinality of the index in. An element of the tensorLis a scalar that we denote by Lði1,i2, . . . ,iNÞ or Li1,i2,...,iN or as a shorthand notation by LðvÞ. Here, v will be a particular configuration from the product space of all indices in V. For our purposes, it will be convenient to define a collection of tensors, Z1:N¼ fZagfor

a

¼1 . . . N, sharing a set of indicies V. Here, each tensor Za has a corresponding index set Va such thatSN

a¼1Va¼V.

Then, vadenotes a particular configuration of the indices for Za while va denotes a configuration of the compliment Va¼V=Va.

A tensor contraction or marginalization is simply adding the elements of a tensor over a given index set, i.e., for two tensorsLand ^X with index sets V and V0 we write X ðv^ 0Þ ¼P

v0LðvÞ or ^X ðv0Þ ¼P

v0Lðv0,v0Þ. To clarify our notation, as an example consider the ordinary matrix multiplication ^X ði,jÞ ¼P

kZ1ði,kÞZ2ðk,jÞ, which is a tensor contraction operation. Although never done in practical computation, formally we can define Lði,j,kÞ ¼ Z1ði,kÞ Z2ðk,jÞ and sum over the index k to find the result. In our formalism, we define V ¼ fi,j,kg, where V0¼ fi,jg, V1¼ fi,kg and V2¼ fk,jg. Hence, V0¼ fkg and we write X ðv^ 0Þ ¼P

v0Z1ðv1ÞZ2ðv2Þ.

A tensor factorization (TF) model is the product of a collection of tensors Z1:N¼ fZag for

a

¼1 . . . N, each defined on the corresponding index set Va, collapsed over a set of indices V0. Given a particular TF model, the latent TF problem is to estimate a set of latent tensors Z1:N

minimize DðXJ ^XÞ s:t: ^Xðv0Þ ¼X

v0

Y a

ZaðvaÞ ð1Þ

where X is an observed tensor and ^X is the ‘prediction’.

Here, both objects are defined over the same index set V0

and are compared element-wise. The function DðJÞZ0 is a cost function. In this paper, we use first the Kullback–

Leibler (KL) divergence as our cost. Later in the general- ization section we introduce a form of theb-divergence that unifies Euclidean (EU), KL and Itakura–Saito (IS) cost functions.

Example 1 (TUCKER3 factorization). The TUCKER3 factor- ization[17,18] aims to find Za for

a

¼1 . . . 4 that solves the following optimization problem:

minimize DðXJ ^XÞ s:t: ^Xi,j,k¼X

p,q,r

Zi,p1Zj,q2 Zk,r3 Zp,q,r4 8i,j,k ð2Þ

Both for visualization and for efficient computation, it is useful to introduce a graphical notation to represent the factorization implied by a particular TF model. We define an undirected graph G ¼ ðV,EÞ with vertex set V and edge set E and associate each index with a vertex of G. For each pairs of indices appearing in a factor index set Va for

a

¼1 . . . N, we add an edge to the edge set of the graph.

Consequently, each clique (fully connected subgraph) of G corresponds to the factor index set Va.Table 1illustrates the representation of several popular models.

In the following section, we introduce a probabilistic model where we cast the minimization problem into an equivalent maximum likelihood estimation problem, i.e., solving the TF problem (1) will be equivalent to max- imization of log pðX9Z1:NÞwith respect to Za[8,12].

2.1. Probability model

The PLTF generative model represented as the directed acyclic graph (DAG) inTable 1is defined as

LðvÞ ¼YN a

ZaðvaÞ intensity ð3Þ

SðvÞ  POðS;LðvÞÞ KL cost ð4Þ

Xðv0Þ ¼X

v0

SðvÞ observation ð5Þ

X ðv^ 0Þ ¼X

v0

LðvÞ parameter ð6Þ

Mðv0Þ ¼ 0 Xðv0Þis missing 1 otherwise



mask array ð7Þ

Here, we refer toLðvÞ as intensity or the latent intensity field. Due to the reproductivity property[21, p. 217]of the Poisson density, the observation Xðv0Þhas the same type of distribution as SðvÞ. Moreover, missing data is handled Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(3)

smoothly as in the likelihood[25,6]

pðX,S9ZÞ ¼Y

v

ðpðXðv0Þ9SðvÞÞpðSðvÞ9LðvÞÞÞMðv0Þ ð8Þ

2.2. Fixed point update equation for the KL cost

The log likelihood LKLis given as LKL¼X

v

Mðv0ÞðSðvÞlogLðvÞLðvÞlogðSðvÞ!ÞÞ ð9Þ

subject to the constraint Xðv0Þ ¼P

v0SðvÞwhenever Mðv0Þ ¼1.

We can easily optimize LKLfor Zaby an EM algorithm. In the E-step we calculate the posterior expectation SðvÞ9Xðv 0Þ

by identifying the posterior pðS9X,ZÞ as a multinomial distribu- tion[21,6]. In the M-step we solve the optimization problem

@LKL=@ZaðvaÞ ¼0 to get the fixed point update ðE stepÞ SðvÞ9Xðv 0Þ

¼Xðv0Þ

X ðv^ 0ÞLðvÞ ð10Þ

ðM stepÞ ZaðvaÞ ¼ P

vaMðv0ÞSðvÞ9Xðv0Þ P

vaMðv0Þ@LðvÞ

@ZaðvaÞ

ð11Þ

with the following equalities:

@LðvÞ

@ZaðvaÞ¼@aLðvÞ ¼ Y a0aa

Za0ðva0Þ, LðvÞ ¼ ZaðvaÞ@aLðvÞ ð12Þ

After substituting (10) in (11) and noting that ZaðvaÞbeing independent of the sumP

vawe obtain the following multi- plicative fixed point iteration for Za:

ZaðvaÞ’ZaðvaÞ P

vaMðv0ÞXðv0Þ X ðv^ 0Þ

@aLðvÞ P

vaMðv0Þ@aLðvÞ ð13Þ

Example 2 (KL-NMF update). It is easy to see that the multiplicative NMF algorithm with KL cost appears as a

special case as

X^i,j¼X

r

Zi,r1Zr,j2, Zi,r1’Zi,r1

P

jðMi,jXi,j= ^Xi,jÞZr,j2 P

jZr,j2 ð14Þ

2.3. The priors and the constraints

In this section we incorporate the prior information into the PLTF. The conjugate prior for the Poisson observation model is the Gamma density ZaðvaÞ GðZaðvaÞ; AaðvaÞ, BaðvaÞÞ. We optimize the log posterior for ZaðvaÞas below noting that the first part is due to the log likelihood of the Poisson while the second part is due to the conjugate gamma prior

@L

@ZaðvaÞ¼ X

va

SðvÞLðvÞ LðvÞ @aLðvÞ 0

@

1

Aþ AaðvaÞ1 ZaðvaÞ BaðvaÞ

 

ð15Þ

ZaðvaÞ’

ðAaðvaÞ1Þ þ ZaðvaÞP

vaMðv0ÞXðv0Þ X ðv^ 0Þ@aLðvÞ BaðvaÞ þP

vaMðv0Þ@aLðvÞ ð16Þ

This update converges to the mode of the posterior dis- tribution pðZ1:N9XÞ. This simply computes the mode of the full conditional pðZa9X,ZaÞ, that is a Gamma distribution for each element ZaðvaÞ. Here, Zadenotes all other factors Za0for

a

0¼1 . . . N, such that

a

a

a

0.

This prior can be used to impose sparsity: we can take the prior Gðx; a,a=mÞ with mean m, and variance m2=a. For small

a

, most of the elements of Z are expected to be around zero, with only a few large ones.

2.4. Relation to graphical models

An important observation that leads to computational savings and compact representation of the element-wise Table 1

Graphical representation of popular TF models (MF: matrix factorization, CP: PARAFAC or the canonical decomposition and a Tucker decomposition). Any TF model can be visualized using the semantics of undirected graphical models where cliques (fully connected subgraphs) correspond to individual factors and vertices correspond to indices. The shaded vertices are latent, i.e., correspond to indices that are not elements of V0, the index set of X. Whilst algebraically identical to a probabilistic undirected graphical model, our representation merely captures the factorization of the TF model and does not have a probabilistic semantics (we do not represent a probability measure and the indices are not random variables). The underlying probability model of PLTF in the conventional sense is given by the DAG on the left (as a Bayesian network model) where X is the observed multiway data and Za’s are the model parameters. The latent tensor S allows us to treat the problem in a data augmentation setting and apply the EM algorithm.

Symbol MF CP TUCKER3

Model V fi,j,rg fi,j,k,rg fi,j,k,p,q,rg

Observed V0 fi,jg fi,j,kg fi,j,kg

Latent V0 frg frg fp,q,rg

V1,V2 fi,rg,fj,rg fi,rg,fj,rg fi,pg,fj,qg

Factors V3,V4 fk,rg fk,rg,fp,q,rg

Estimate X^ P

r

Zi,r1Zr,j2 P

r

Zi,r1Zj,r2Zk,r3 P

p,q,r

Zi,p1Zj,q2Zk,r3 Zp,q,r4

(4)

update equation is that the update (13) consists of structurally similar terms in both the denominator and the numerator. For large TF models, this structure needs to be exploited for computational efficiency. Therefore, we define the following tensor valued function, where 9A9 denote the dimensionality of the object A.

Definition 1. We define a tensor valued functionDaðQ Þ : R9Q9-R9Za9

DpaðQ Þ ¼ X

va

Q ðv0Þð@aLðvÞÞp 2

4

3

5 ð17Þ

whereDaðQ Þ and the variable Q are tensors with the same size as Zaand of the observation X respectively. The non- negative integer p on the right side denotes the element- wise power while on the left, it should be interpreted as a parameter of the function. Using this notation, we rewrite the update (13) compactly as

Za’ZaJ

DaðMJX= ^X Þ

DaðMÞ ð18Þ

whereJand / stand for element-wise multiplication and division respectively. Later we develop the explicit matrix forms of these updates.

Interestingly, computing DaðQ Þ is equivalent to com- puting a tensor contraction, marginal potential over vertices Va. More precisely, we construct a ‘joint tensor’

P ¼ Xðv0ÞQ

aZaðvaÞ. For each

a

, we remove Za from P, which we denote as P=Za, and compute the ‘marginal tensor’ by summing over Va. An example for the TUCKER model is shown inFig. 1. This operation can be viewed as the computation of certain marginal densities given an undirected graphical model[14,29].

3. Generalization of TF forb-divergence

The development of the update equations for TF’s with EU and IS costs can be proceeded similar toSection 2by considering the latent tensor field SðvÞ as

SðvÞ  N ðSðvÞ;LðvÞ,1Þ EU cost ð19Þ SðvÞ  N ðSðvÞ; 0,LðvÞÞ IS cost ð20Þ However, we prefer a single development for the b-divergence that unifies the EU, KL, and IS costs in the

following expression for p ¼ f0; 1,2g respectively[7]

dbðx,yÞ ¼

1 ð2pÞð1pÞ

 ðx2pþ ð1pÞy2pð2pÞxy1pÞ p ¼ 0 ðEUÞ x logx

yþyx p ¼ 1 ðKLÞ

x ylogx

y1 p ¼ 2 ðISÞ

8>

>>

>>

>>

>>

<

>>

>>

>>

>>

>:

ð21Þ where we keep p in the first line since it is valid for a range while p ¼0 is only a special case for EU cost. The b-divergence is associated with the Tweedie distributions that belong to the exponential dispersion models (EDM) [15]. The dispersion models relate the variance of the distribution to some power p of its mean LðvÞ given as VarðSðvÞÞ ¼

j

1s vðLðvÞÞ with

j

1s being the dispersion (or scale) parameter and vðLðvÞÞ ¼LðvÞp being the variance function[15].

This section follows the similar outline as the previous section where we obtain various update rules for the TF models withb-divergence via the EM. Here we present the main results and leave technical details toAppendix C.

In addition, to keep the notation simpler, we will drop iteration indices from the update equations and ignore the missing mask array M, assuming all the elements of X are observed.

3.1. EM updates for the ML estimate

Starting with the E-step, the posterior expectation /SðvÞ9Xðv0ÞSfor the EDM’s is identified as

ðE stepÞ /SðvÞ9Xðv0ÞS ¼LðvÞ þ LðvÞp P

v0LðvÞpðXðv0Þ ^X ðv0ÞÞ

for p Z0 ð22Þ

ðE stepÞ /SðvÞ9Xðv0ÞS ¼LðvÞ þ

r

LðvÞ

p

X ðv^ 0ÞpðXðv0Þ ^X ðv0ÞÞ

for p ¼ f0; 1,2g ð23Þ

where

r

is the ratio of the dispersion parameters

j

1s =

j

1x

and it is set to be f1=K,1, ^X ðv0Þ=LðvÞg for p ¼ f0; 1,2g where K is the cardinality of the invisible vertex set V0, i.e.

K ¼ 9V09 recalling ^Xðv0Þ ¼P

v0LðvÞ. Note that for p ¼ f0; 1,2g, i.e. for Gaussian, Poisson and gamma (for i.i.d variables), both expressions are identical. Indeed, for p¼1 they recover /SðvÞ9Xðv0ÞS ¼ ðXðv0Þ= ^X ðv0ÞÞLðvÞ the poster- ior expectation of the Poisson identified inSection 2, and for p¼0 they recover the posterior expectation of the Gaussian given in Table 2. For assuming i.i.d. variables with the gamma distribution, the ratio

r

is even further parameter- ized as

r

¼Kp1.

In this section we use the posterior expectation in (23) which is the special case for Gaussian, Poisson and the gamma since this expression is better simplified when plugged into the M-step than the general case (22) having an extra termP

v0LðvÞp. However, for the cases paf0; 1,2g, e.g. for p 2 ð1; 2Þ (the compound Poisson) or for p¼3 (inverse Gaussian)[4], one can follow the similar steps for the derivations of the update rules, i.e. plug (22) as the

i j

k

p q

r

i j

k

p q

r

i j

k

p q

r

Fig. 1. Undirected graphical models for the computation of theDaof TUCKER3 decomposition. (a) Joint tensor P defined as P ¼ XQ

aZa, (b) graph for P=Z1used for computation ofD1where V1¼ fi,pg, (c) graph for P=Z4used forD4corresponding to the core tensor where V4¼ fp,q,rg.

Double circled indices are not summed over. Rather than using a naive direct approach, the contraction can be computed efficiently by dis- tributing the summation over the product, a procedure that is algebrai- cally identical to variable elimination for inference in probabilistic graphical models.

Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(5)

posterior expectation in the M-step. The M-step with the expectation in (23), then, gives the ML estimate

ðM stepÞ ZaðvaÞ ¼ P

va/SðvÞ9Xðv0ÞS@aLðvÞ1p P

va@aLðvÞ2p ð24Þ

ZaðvaÞ’ZaðvaÞ þZaðvaÞpP

va

r

ðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þp@aLðvÞ P

va@aLðvÞ2p

ð25Þ where, for p¼ 1 it recovers the update for KL developed in Section 2. For p¼2 (IS cost) it can be compared to the Gaussian variance modelling of the IS cost given in Appendix B.

3.2. Multiplicative updates for the ML estimate

Although the EM update can also be formulated in the multiplicative form as

ZaðvaÞ’ZaðvaÞ P

vaðrðXðv0Þ ^X ðv0ÞÞ ^X ðv0ÞpþLðvÞ1pÞ@aLðvÞ P

vaLðvÞ1p@aLðvÞ

ð26Þ in general this is different from the multiplicative update rule (MUR) as in the sense of[20]due to the subtraction (potentially resulting to a negative value) in the nomina- tor of (26). The MUR equations were popularized by[20]

for the NMF and extensively analyzed[7,12]. They ensure the non-negative parameter updates as long as starting with the non-negative initial values. For the PLTF models, the MUR update equation is given as

ZaðvaÞ’ZaðvaÞ P

vaXðv0Þ ^X ðv0Þp@aLðvÞ P

vaX ðv^ 0Þ1p@aLðvÞ ð27Þ

while this coincides with the EM update for the KL case (p¼ 1) it differs for general b-divergence. This equation

successfully recovers the NMF updates in[20]and update for CP for b-divergence in [7, such as Eq. (7.59)]. The realization of the EMMLand MURMLupdate equations for p ¼ f0; 1,2g are given in Table 2. Further details can be found inAppendix C.2.

The multiplicative update rules MURMLcan be obtained by plugging the heuristic LðvÞ1p¼

r

X ðv^ 0Þ1p in EMML

update (26). A mathematical justification of this substitu- tion would prove directly the convergence of MUR;

however while we always observe convergence in prac- tice, it is an open problem to rigorously establish the convergence of MUR for generalb-divergence.

Example 3 (CP MUR update for Z1for KL and EU costs).

Zi,r1’Zi,r1

P

j,kðXi,j,k= ^Xi,j,kÞZj,r2Zk,r3 P

j,kZj,r2Zk,r3 , Zi,r1’Zi,r1

P

j,kXi,j,kZj,r2Zk,r3 P

j,kX^i,j,kZj,r2Zk,r3 ð28Þ

Example 4 (TUCKER3 MUR update for Z1 and Z4 (core tensor) for EU cost).

Zi,p1 ’Zi,p1

P

j,k,q,rðXi,j,k= ^Xi,j,kÞZj,q2Zk,r3 Zp,q,r4 P

j,k,q,rZj,q2Zk,r3 Zp,q,r4 Zp,q,r4 ’Zp,q,r4

P

i,j,kðXi,j,k= ^Xi,j,kÞZi,p1 Zj,q2Zk,r3 P

i,j,kZi,p1 Zj,q2Zk,r3 ð29Þ

3.3. Direct solution via alternating projections

The method of alternating projections attempts to solve the estimation problem by optimizing one variable while keeping the others fixed [7, Chapter 4]. It is a direct solution obtained by equating the gradient to zero and Table 2

(Up) EM update equations and posterior expectations. Note that the variance function vðLðvÞ is some power of the expectationLðvÞ. (Down) The EM update (multiplicative form) and the MUR update are compared.

p LðvÞÞ EM update r SðvÞ9Xðv0Þ

b LðvÞp

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ1p P@aLðvÞ2p

jx

js LðvÞ þrLðvÞ

p

X ðv^ 0ÞpðXðv0Þ ^X ðv0ÞÞ

EU 0 1

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ P@aLðvÞ2

1

K LðvÞ þ1

KðXðv0Þ ^X ðv0ÞÞ

KL 1 LðvÞ

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ P@aLðvÞ

1 LðvÞXðv0Þ

X ðv^ 0Þ

IS 2 LðvÞ2

ZaðvaÞ ¼

PSðvÞ9Xðv0Þ

@aLðvÞ1 P1

X ðv^ 0Þ

LðvÞ LðvÞ þLðvÞ

X ðv^ 0ÞðXðv0Þ ^X ðv0ÞÞ

EM update MUR update

ZaðvaÞ’ZaðvaÞ P

vaðrðXðv0Þ ^X ðv0ÞÞ ^X ðv0ÞpþLðvÞ1pÞ@aLðvÞ P

vaLðvÞ1p@aLðvÞ ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þp@aLðvÞ P

vaX ðv^ 0Þ1p@aLðvÞ EU

ZaðvaÞ’

P

va LðvÞ þ1

KðXðv0Þ ^X ðv0ÞÞ

 

@aLðvÞ P

va@aLðvÞ2

ZaðvaÞ’ZaðvaÞ P

vaXðv0Þ@aLðvÞ P

vaX ðv^ 0Þ@aLðvÞ KL

ZaðvaÞ’ZaðvaÞ P

vaXðv0Þ ^X ðv0Þ1@aLðvÞ P

va@aLðvÞ ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þ1@aLðvÞ P

va@aLðvÞ IS

ZaðvaÞ’ZaðvaÞ P

vaðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þ1þ1 P

va1 ZaðvaÞ’ZaðvaÞ

P

vaXðv0Þ ^X ðv0Þ2@aLðvÞ P

va

X ðv^ 0Þ1@aLðvÞ

(6)

solving directly as X

va

Xðv0Þ ^X ðv0Þp@aLðvÞ ¼X

va

X ðv^ 0Þ1p@aLðvÞ ð30Þ

In matricization section we show how this is further simplified for the EU cost (p¼0) by use of the pseudo- inverse operation where it turns to alternating least squares (ALS).

3.4. Update rules for MAP estimation

We can incorporate prior belief as in the form of conjugate prior pðZaðvaÞ9N0aðvaÞ,Z0aðvaÞÞ, with N0aðvaÞ and Z0aðvaÞbeing the hyperparameters for ZaðvaÞ. As specified in [11], N0aðvaÞ might be thought of a prior sample size while Z0aðvaÞis a prior expectation of the mean parameter.

The exact definition of N0aðvaÞand N0aðvaÞcan be identified for each distribution separately by following the defini- tion of a conjugate prior given inAppendix C.3. The EMMAP

estimate is as

ZaðvaÞ ¼N0aðvaÞZ0aðvaÞ þP

va/SðvÞ9Xðv0ÞS

j

s@aLðvÞ1p N0aðvaÞ þP

va

j

s@aLðvÞ2p

while the multiplicative update rules MURMAPbecomes

ZaðvaÞ’N0aðvaÞZ0aðvaÞ þZaðvaÞpP

va

j

xXðv0Þ ^X ðv0Þp@aLðvÞ N0aðvaÞ þZaðvaÞp1P

va

j

xX ðv^ 0Þ1p@aLðvÞ ð32Þ Prior belief can always be specified in terms of N0aðvaÞand Z0aðvaÞ. However, if prior distribution is to be specified explicitly, they can be tied to the prior parameters similar to[16].

Example 5 (Conjugate prior for Poisson ðp ¼ 1Þ). The Gamma distribution is the conjugate prior for the Poisson pðS9ZÞ as pðZaðvaÞ9N0aðvaÞ,Z0aðvaÞÞ ¼GðZaðvaÞ9AaðvaÞ,BaðvaÞÞ with

N0aðvaÞ ¼BaðvaÞ, N0aðvaÞZ0aðvaÞ ¼AaðvaÞ1 ð33Þ which for p ¼1 EMMAP and MURMAPsuccessfully recover the update for KL (16) inSection 2.

Example 6 (Conjugate prior for Gaussian mean ðp ¼ 0Þ).

The Gaussian distribution is the conjugate prior for the Gaussian pðS9Z,SÞwith unknown mean and known var- ianceSwith

pðZaðvaÞ9N0aðvaÞ,Z0aðvaÞÞ ¼N ðZaðvaÞ9AaðvaÞ,BaðvaÞÞ ð34Þ

N0aðvaÞ ¼S=BaðvaÞ, Z0aðvaÞ ¼AaðvaÞ ð35Þ

3.5. Missing data and tensor forms

It is straightforward to handle missing data in EM and MUR updates. We recall that the scalar value Mðv0Þ is, indeed, simply a multiplier in the sumP

va of the initial M-step equation as P

vaMðv0Þð/SðvÞ9Xðv0ÞSLðvÞÞ

j

s LðvÞp@aLðvÞ. Hence we simply put it back inside the sums P

va. For example the MURMLupdate turns to be

ZaðvaÞ’ZaðvaÞ P

vaMðv0ÞXðv0Þ ^X ðv0Þp@aLðvÞ P

vaMðv0Þ ^X ðv0Þ1p@aLðvÞ ð36Þ As we did inSection 2 we represent all the update equations in tensor forms by use of theDa abstraction.

InTable 3, we give a summary of general update rules in tensor forms. For alternating projections, considering the least square approximation, it is natural to set p ¼0 and by the use of the pseudo-inverse. We need to solve DaðXÞ ¼Dað ^X Þ. This equation leads to a linear matrix equation of the form LXR ¼ L ^X R where L and R are structured matrices determined by the form of the TF model. This equation can be solved via pseudo-inverses in a least square sense, when there is no missing data (i.e.

M ¼ 1). For general M, we could not derive a compact equation without considering tedious reindexings.

4. Representation and implementation

For the tensorized forms of the update equations such as the KL update Za¼ZaJDaðMJX= ^X Þ=DaðMÞ, the main task is the computation of theDaðÞ function. Below, we define a matricization procedure that convertsDaðÞ(any element-wise equation indeed) into the matrix form in terms of matrix multiplication abstracts such as the Kronecker product and Khati–Rao product.

4.1. Matricization

Matricization as defined originally in [17,18] is the operation of converting a multiway array into a matrix ZaðvaÞ’N0aðvaÞZ0aðvaÞ þZaðvaÞP

vaf

j

s@aLðvÞ1pþZaðvaÞp1

j

xðXðv0Þ ^X ðv0ÞÞ ^X ðv0Þpg@aLðvÞ N0aðvaÞ þP

va

j

s@aLðvÞ2p ð31Þ

Table 3

The table lists the updates in tensor forms viaDaðÞfunction. All multi- plications and divisions are element-wise.

EM Q1,Q2 Q1¼MJðX ^X ÞJX^pQ2¼M ML Za’ZaþZpa JDaðrQ1Þ

D2pa ðQ2Þ

MAP ZaN0a JZ0aþZa JDa1pðQ2Þ þZpa JDaðjxQ1Þ N0aþD2pa ðjsQ2Þ MUR Q1,Q2 Q1¼MJXJX^p, Q2¼MJX^1p

ML Za’Za JDaðQ1Þ DaðQ2Þ

MAP ZaN0a JZ0aþZpa JDaðjxQ1Þ Z0aþZp1a JDaðjxQ2Þ Alternating

projection

ML solveDaðXÞ ¼Dað ^X Þ assuming M ¼ 1

Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

(7)

by reordering the column fibers. In this paper we refer to this definition as ‘unfolding’ and refer to matricization as the procedure to convert an element-wise equation into a corresponding matrix form. We use Einstein’s summation convention where repeated indices are added over. The conversion rules are given inTable 4. Our notation is best illustrated with an example: consider a matrix Xi,j with row index i and column index j. If we assume a column by column memory layout, we refer to the vectorization of vec X (vertical concatenation of columns) as Xji; adopting a ‘the faster index last’ convention and we drop the comma. Here i is the faster index since when traversing the elements of the matrix X in sequence i changes more rapidly. With this, we arrive at the following definition.

Definition 2. Consider a multiway array X 2RI1IMwith a generic element denoted by Xi1,i2,...,iM. The mode-n unfold- ing of X is the matrix XðnÞ2RInQkanIk with row index inas XðnÞXiiM...in1in þ 1...i2i1

n ð37Þ

where the fastest index is in the order ii,i2, . . . ,iM.

Here we follow the natural ordering, that is, for the mode-1 unfolding, mode-2 rows are placed before mode- 3 rows and similarly for the mode-2 unfolding, mode-1 rows are placed before mode-3 rows and so on. Hence

during matricization, we start with scalar terms, while we stick up the desired indices (the outcome indices), we may freely reorder the terms inside the sum, transpose them, unfold them, join them by using Kronecker and Khati–Rao product. In addition, if needed we may intro- duce ones matrix 1 of any size when the indices of the terms are insufficient as Yq1¼P

pYqp¼P

p1p1Yqp¼ ð1YÞq1 where the index p to be marginalize out.

Here are the examples of matricization, MUR and ALS updates.Table 5summarizes updates for known models.

To get rid of the annoying subindices, here we relax the notation by using A,B,C,G for Za.

Example 7 (Matricization of TUCKER3 factorization). For mode-1 unfolding ^Xð1Þ

X^i,j,k¼X

pqr

Gp,q,rAi,pBj,qCk,r

start with element-wise equation ð38Þ ð ^Xð1ÞÞkji ¼ ðGð1ÞÞprqApiBqjCrk place row=column indices

¼ ðAGð1ÞÞrqi ðC  BÞrqkj reorder and form Kronecker product

¼ ððAGð1ÞÞðC  BÞTÞkji transpose and drop matching indices

X^ð1Þ¼AGð1ÞðC  BÞT remove remaining indices ð39Þ Example 8 (Update equations of TUCKER3 factorization).

The functionsDAðQ Þ andDGðQ Þ are

DAðQ Þ ¼ ðQð1ÞÞkji BqjCrkGrqp ¼Qð1ÞðC  BÞGTð1Þ ð40Þ DGðQ Þ ¼ ðQð1ÞÞkjiApiBqjCrk¼ATQð1ÞðC  BÞ ð41Þ



If MURMLthe general format of the update equation is Za’ZaJDaðQ1Þ=DaðQ2Þ with Q1¼MJXJX^p and Q2¼ MJX^1p. Then

A ¼ AJQ1ð1ÞðC  BÞGTð1Þ

Q2ð1ÞðC  BÞGTð1Þ, Gð1Þ¼Gð1ÞJATQ1ð1ÞðC  BÞ ATQ2ð1ÞðC  BÞ

ð42Þ where, for example, for KL (p ¼1) we evaluate Q1¼ MJðX= ^X Þ and Q2¼M.

Table 4

Index notation used to matricize an element-wise equation into the matrix form. Following Einstein convention, duplicate indices are summed over. Khatri–Rao product and mode-n unfolding are imple- mented in N-way Toolbox[3].

Equivalence Matlab Remark

XjiX X Matrix notation

Xkji Xð1Þ nshape(X,1) Array (mode-1 unfolding)

Xji ðXTÞij X0 Transpose

vec Xji ðXÞ1ji Xð:Þ Vectorize XjiYpj ðXYÞpi XnY Matrix product XpiYpj ðX YÞpij krbðX,YÞ Khatri–Rao product XpiYqj ðX  YÞpqij kronðX,YÞ Kronecker product XjiXji ðXJji X:nX Hadamard product

Table 5

The table lists model output, theDafunction, the MUR and ALS updates for the typical components of NMF, CP, TUCKER3 and PARATUCK2 models. For PARATUCK2 F is given as Fð1Þ¼ ððCa GTÞ CbÞT. The symbols , andJare for Knonecker, Khatri–Rao and Hadamard products in the order and the division is of the element-wise type.

Model DaðÞ MUR update ALS (p ¼0)

NMF X ¼ AB^ DAðQ Þ ¼ QBT

A ¼ AJQ1BT Q2BT

A ¼ XBy

CP X^ð1Þ¼AðC BÞT DAðQ Þ ¼ Qð1ÞðC BÞ

A ¼ AJQ1ð1ÞðC BÞ Q2ð1ÞðC BÞ

A ¼ Xð1ÞððC BÞTÞy

T3 X^ð1Þ¼AGð1ÞðC  BÞT DAðQ Þ ¼ Qð1ÞðC  BÞGTð1Þ

A ¼ AJ

Q1ð1ÞðC  BÞGTð1Þ Q2ð1ÞðC  BÞGTð1Þ

A ¼ Xð1ÞðGð1ÞðC  BÞTÞy

DGðQ Þ ¼ ATQð1ÞðC  BÞ

Gð1Þ¼Gð1ÞJATQ1ð1ÞðC  BÞ ATQ2ð1ÞðC  BÞ

Gð1Þ¼AyXð1ÞððC  BÞTÞy

PT2 X^ð1Þ¼AFð1Þð1  BTÞ DAðQ Þ ¼ Qð1Þð1  BÞFT

A ¼ AJ

Q1ð1Þð1  BÞFT Q2ð1Þð1  BÞFT

(8)



If MURMAP the update for the factor A with the hyperparameters N0aðvaÞ,Z0aðvaÞ is as below. Here Q1,Q2 are as Q1¼MJXJX^p and Q2¼MJX^1p as for the MURML. A sample implementation is given in Algorithm 1.

A ¼N0aðvaÞZ0aðvaÞ þApJð

j

xQ1ð1ÞðC  BÞGTð1ÞÞ

N0aðvaÞ þAp1Jð

j

xQ2ð1ÞðC  BÞGTð1ÞÞ ð43Þ



If ALS (for p ¼0) and assuming no missing values we solve X ¼ ^X for the core tensor G

Xð1Þ¼ ^Xð1Þ¼AGð1ÞðC  BÞT

)Gð1Þ¼AyXð1ÞððC  BÞTÞy ð44Þ with Xy¼XTðXXTÞ1. Here we simply move all the unrelated factors to the other side of the equation while taking their pseudo-inverses.

Example 9 (PARATUCK2 model). After the factorization process ends we might consider factorization of the latent tensors deeper. PARATUCK2 [18,7,2] is a nice example to illustrate this. After inserting the scalar value 1kk PARATUCK2 model can be matricized as

X^i,j,k¼X

pq

Ai,pBj,qFp,q,k) ^Xð1Þ¼ ½ApiðBTÞjqFkqp1kk ¼AFð1Þð1  BTÞ ð45Þ

Then, we consider further factorization of F, as below where at the last step we transpose twice:

Fkqp GqpCpa

kCqb

k¼ ððCa GTÞTÞkqpðCTbÞkq1

¼ ððCa GTÞTÞ ðCTbÞÞpkq¼ ððCa GTÞ CbÞT ð46Þ

Algorithm 1. Matlab implementation for TUCKER3 MURMAP

update forb-divergence.

% Input: X, N0, Z0, Z, M, MAXITER, p, dX, N. Output: Z (latent factors)

for e¼ 1:MAXITER

for a¼ 1:N % i,j,k: size of X

X2¼ reshape(Z1nnshape(Z4,1)nkron(Z3,Z2)’, i, j, k) Q1¼ dXnM .nX .nX2.b (-p); % dX : 1/dispersion of X Q2¼ dXnM .nX2.b (1-p);

if (a¼ ¼ 1) % tensor A (the code for B and C are similar)

UP¼ (nshape(Q1, a)nkron (Z3, Z2)nnshape(Z4, a)’);

DW¼ (nshape(Q2, a)nkron (Z3, Z2)nnshape(Z4, a)’);

else (a¼ ¼ 4) % tensor G

UP¼ Z1’nnshape(Q1,1)nkron(Z3, Z2);

DW¼ Z1’nnshape(Q2,1)nkron(Z3, Z2);

UP¼ reshape(UP, p, q, r); % p,q,r: size of G (latent size)

DW¼ reshape(DW, p, q, r);

end

UP¼ N0{a}.nZ0{a} þ Z{a}.b(p) .nUP ; DW¼ N0{a} þ Z{a}.b(p-1) .nDW ; Z{a}¼ UP./DW ; % the update end

end

5. Discussion

In this paper, we have developed a probabilistic frame- work for multiway analysis of high dimensional datasets.

By exploiting the link between graphical models and tensor factorization models we cast any arbitrary tensor factorization problem, and many popular models such as CP or TUCKER3 as inference, where tensor factorization reduces to a parameter estimation problem. We first illustrated our approach for the conditionally Poisson case and employed the EM algorithm for the optimization. We also extend the probability model to include the conjugate priors and obtain the update equations accordingly. One main saving in our framework appears in the computation of DaðÞ, that is computationally equivalent to computing expectations under probability distributions that factorize according to a given graph structure. As is the case with graphical models, this quantity can be computed via message passing:

algebraically we distribute the summation over all va and compute the sum in stages.

We also sketched a straightforward matricization pro- cedure to convert element-wise equations into the matrix forms to ease implementation and compact representa- tion. The use of the matricization is simple, easy and powerful that without any use of matrix algebra it is possible to derive the update equations mechanically in the corresponding matrix forms.

Perhaps the most important contribution of the paper is the generalization of TF problem to a point that allows one to ‘invent’ new factorization models appropriate to their applications. Pedagogically, the framework guides building new models as well as deriving update equations forb-divergence that unifies the popular cost functions.

Indeed, results scattered in the literature can be derived in a straightforward manner. Model selection can also be automized; however due to space constraints, we will cover model selection issues in a future publication.

Acknowledgments

This work is funded by the TUBITAK grant number 110E292, Bayesian matrix and tensor factorisations (BAYTEN) and Bogazici University research fund BAP5723.

Appendix A. Common distributions Gaussian distribution:

N ðx; a,bÞ ¼ exp ðxaÞ2 2b þ1

2 log 1 2

p

b

!

Poisson distribution:

POðx; aÞ ¼ expða þ x log alog x!Þ Gamma distribution:

Gðx; a,bÞ ¼ expðða1Þlog xbx þ a log blogGðaÞÞ

Appendix B. Gaussian based modelling of IS cost

IS cost can also be modeled as zero mean unknown variance of a Gaussian [12] as SðvÞ  N ðSðvÞ; 0,LðvÞÞ, Xðv0Þ N ðXðv0Þ; 0, ^X ðv0ÞÞ, where Xðv0Þ ¼P

v0SðvÞ. Taking the derivative of the log likelihood of the Gaussian wrt Please cite this article as: Y.K. Yılmaz, A.T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal

Referanslar

Benzer Belgeler

An experiment in stock price forecasting was used to compare the effectiveness of outcome and performance feedback: (i) when different forms of probability forecast were required,

In fact, our motivation to edit this Research Topic was threefold: (i) to provide current views on the functional roles of feedforward and feedback projections for the perception

His study showed that graduate students perceived several connections between teaching and research and that four factors mediated their view: student ability and

Tablo 4.3 incelendiğinde, öğretmen adaylarının kimya derslerinde kullandığı Genel Kimya 1 kitabında yer alan gösterimler tipine göre analiz edildiğinde, kitapta yer

This study aims to examine EFL learners’ perceptions of Facebook as an interaction, communication, socialization and education environment, harmful effects of Facebook, a language

The absorption behavior of C8, C4 oligomers and their thin films were characterized by UV-vis spec- troscopy; the thickness and the refractive index of the films were studied by

13-14 Nisan 2017 tarihinde yapacağımız Beton 2017 Kongresi’nde; beton bileşenleri, üretimde ve yerinde nitelik denetimi, özel beton- lar, özel projelerde beton tasarım

Abdülham id, M eclisi M ebusan’ı tatil ettikten sonra yavaş yavaş bütün y etk ileri kendi şahsında topladı ve kişisel konutu Y ıld ız Sa- rayı’nı devlet