processing Non-negative tensor factorization models for Bayesian audio Digital Signal Processing

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

Non-negative tensor factorization models for Bayesian audio processing

Umut ¸Sim ¸sekli

^a^,^∗

, Tuomas Virtanen

^b

, Ali Taylan Cemgil

^a

aDepartmentofComputerEngineering,Bo˘gaziçiUniversity,34342,Bebek,˙Istanbul,Turkey bDepartmentofSignalProcessing,TampereUniversityofTechnology,33720Tampere,Finland

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonlinexxxx

Keywords:

Nonnegativematrixandtensorfactorization Coupledfactorization

Bayesianaudiomodeling Bayesianinference

We provide an overview of matrix and tensor factorization methods from a Bayesian perspective, giving emphasisonboththe inferencemethods and modelingtechniques. Factorization basedmodels and their many extensions such as tensor factorizations have proved useful in a broad range of applications,supportingapracticalandcomputationallytractableframeworkformodeling.Especiallyin audio processing, tensor models helpin auniﬁed mannerthe use ofpriorknowledge about signals, the data generation processes as well as available data from different modalities. After a general reviewoftensormodels,wedescribethegeneralstatistical framework,giveexamplesofseveralaudio applicationsanddescribemodelingstrategiesforkeyproblemssuchasdeconvolution,sourceseparation, andtranscription.

1. Introduction

Withtherecenttechnological advancesofsensorandcommu- nicationtechnologies, the cost of data acquisition andstorage is significantlyreduced.Consequently,thelast decadehaswitnessed thedramaticincreaseintheamountofdatathatcanbeeasilycol- lected.Oneimportantfacetofdataprocessingisextractingmean- ingfulinformationfromhighly structured datasets that canbe of interestforscientific,financial,ortechnologicalpurposes.

The key to exploiting the potential of large datasets lies in developing computational techniques that can efficiently extract meaningful information. These computational methods must be scalableand tailored for the specifics of an application, butstill be versatileenough to be useful inseveral scenarios. Inthis paper, wewill focusonaudio processingandreviewone particular class of such models, that provide a favorable balance between highmodelingaccuracy,easeofimplementationandeaseofman- agementof requiredcomputational resources. This class ofmod- els, coined underthe name of tensor factorization models along withtheir Bayesianinterpretations,willbe thefocusofthistuto- rialpaper.Themathematicalsetupmaylooksomewhatabstractat a first sight, but the generic nature ofthe approach makes ten- sorssuitable fora broadrangeofapplications wherecomplicated

*

Correspondingauthor.

E-mailaddresses:umut.simsekli@boun.edu.tr(U. ¸Sim ¸sekli),

tuomas.virtanen@tut.ﬁ(T. Virtanen),taylan.cemgil@boun.edu.tr(A.T. Cemgil).

structured datasets need to be analyzed. In particular, we will show examples inthe domainof audioprocessingwhere signiﬁ- cantprogresshasbeenachievedusingtensormethods.Whilethe modeling and inference strategies can be applied inthe broader contextofgeneralaudioandothernon-stationarytimeseriesanal- ysis,thehierarchicalBayesiannatureoftheframeworkmakes the approachparticularlysuitablefortheanalysisofacousticalsignals.

In audioprocessing, an increasing number of applicationsare developed that can handle challenging acoustical conditions and highly variable sound sources. Here, one needs to exploit the inherent structure of acoustic signals to address some of the key problemssuch asdenoising, restoration,interpolation, source separation, transcription, bandwidth extension, upmixing, coding, eventrecognitionandclassiﬁcation.Notsurprisingly,manydiffer- entmodelingtechniqueshavebeendevelopedforthosepurposes.

However,asisthecaseforcomputationalmodelingofallphysical phenomena, weface herewithatradeoff: accuracyversus computational tractability –a physically realistic andaccurate model maybe toocomplexto meetthedemands ofa givenapplication tobeusefulinpractice.

Typically, there is a lot of a-priori knowledge available for acoustic signals. This includes knowledge ofthe physical or cog- nitive mechanisms by which sounds are generated or perceived, as well as the hierarchical nature by which they are organized in an acoustical scene. In more speciﬁc domains, such as mu- sictranscription or audioeventrecognition, more specializedas- sumptionsaboutthehierarchicalorganizationareneeded.Yet,the http://dx.doi.org/10.1016/j.dsp.2015.03.011

(2)

resulting models often possess complex statistical structure and highlyadaptiveandpowerfulcomputationaltechniquesareneeded toperforminference.

Factorization-basedmodelinghasbeenusefulinaddressingthe modelingaccuracy versuscomputational requirementtradeoff in various domainsbeyond audiosignal processing[1],withpromi- nentexamplessuchastextprocessing[2],bioinformatics[3],com- putervision[4],socialmediaanalysis[5],andnetworktraﬃcanal- ysis[6]. Theaimin suchmodeling strategiesis todecompose an observed matrix or tensor (multidimensional array) into seman- tically meaningful factors in order to obtain useful predictions.

Meanwhile, the factors themselves also provide a useful feature representationaboutthespeciﬁcsofthedomain.

Inthispaper,wereviewtensorbasedstatisticalmodelsandas- sociatedinference methodsdevelopedrecentlyforaudioandmu- sicprocessinganddescribe variousextensionsandapplicationsof thesemodels.InSection2,weillustratetheideas offactorization basedmodeling, andthen in Section 3 we describe a probabilis- ticinterpretationofthesemodels.The probabilisticinterpretation opensupthewayforafullBayesiantreatmentviaBayesianhierar- chicalmodeling.Thisleadstoaverynaturalmeansforuniﬁcation, allowingtheformulationofhighly structuredprobabilisticmodels foraudio data atthe various levels of abstraction, aswe will il- lustrateinSection6.Thepaperconcludeswithremarksonfuture researchdirections.

2. Factorization-baseddatamodeling

In this section, we will describe the basics of factorization based modeling, anddescribe extensionssuch ascoupled tensor factorizations and nonnegative decompositions. This section will describethemainstructureandthenotation.

Inmanyapplications,datacanbe representedasamatrix,for example,thespectrogram ofan audiosignal (frequency vstime), adatasetofimages(pixelcoordinatesvsinstances),wordfrequen- cies among different documents (words vs documents), and the adjacency structure of a graph (nodes vsnodes) to name a few.

Heretheindicesofthematrixcorrespondtotheentities,andthe matrixelementsdescribearelationbetweenthetwoentities.Ma- trix Factorization (MF) models are one of the most widely used methods for analyzing the data that involve two entities [7–10].

Thegoalinthesemodelsistocalculateafactorizationoftheform:

X₁

(

i

,

j

) ≈ ˆ

^X1

(

i

,

j

) =

k

Z₁

(

i

,

k

)

Z₂

(

k

,

j

)

(1)

where X1 is the given data matrix, Xˆ1 is an approximation to X1, and Z1, and Z2 are factor matrices to be estimated. Even though we have a single observed matrixin this model,we use asubscriptin X₁ sincewe willconsiderfactorizationmodelsthat involvemorethanoneobservedmatrixortensor,laterinthissec- tion. Here, X1 is expressed as the product of Z1 and Z2, where Z1 isconsideredasthedictionary matrixand Z2 containsthecor- respondingweights.Fromanotherperspective, X₁ isapproximated asthesumofinnerproducts ofthecolumns of Z1 andthe rows of Z2, as illustrated atthe top of Fig. 2. Note that, if Z1 would havebeenfixed, the problemwouldhavebeen equivalent toba- sis regression where the weights (expansion coefficients) Z₂ are estimated[11].In contrast, inmatrix factorizationthe dictionary (the setofbasis vectors)isestimatedalong withthe coefficients.

Thismodeling strategy has been shownto be successful invari- ousﬁeldsincludingsignalprocessing, ﬁnance,bioinformatics,and naturallanguageprocessing[8].

Matrixfactorization modelsare applicable when theobserved dataencapsulatestherelationoftwodifferententities(e.g.,i and j in Eq. (1)). However, when the data involves multiple entities

Fig. 1. Illustrationofa)avectorX(i):anarraywithoneindex,b) a matrixX(i,j)an arraywithtwoindices,c) atensorX(i,j,k):anarraywiththreeormoreindices.In thisstudy,werefervectorsastensorswithonemodeandmatricesastensorswith twomodes.

ofinterest, such asternaryorhigherorder relationsit cannotbe represented without loss of structure by using matrices. For example a multichannel sound library of severalinstances maybe representedinthetime-frequencydomainconvenientlyasanob- ject withseveralentities,saythe powerateach(frequency, time, channel,instance).Onecouldinprinciple‘concatenate’eachspec- trogramacrosstimeandinstancestoobtainabigmatrix,say(frequency×^channel,^time×^instance)^but^thisrepresentationwould obscure important structuralinformation – comparesimply with representingamatrixwithacolumnvector.Henceoneneedsnat- urallymultiway tables, the so-calledtensors, whereeach element is denotedby T(i,j,k,. . .).Here, T is thetensorandthe indices i,j,k,. . .are theentities.The numberofdistinctentities dictates themode ofatensor.Hence avectorandamatrixare tensorsof modeoneandtworespectively.TensorsareillustratedinFig. 1and wewillgiveamorepreciseandcompactdeﬁnitioninSection3.

For modeling multiway arrays with more than two entities thecanonicalpolyadicdecomposition[12,13](alsoreferredas,CP, PARAFAC,orCANDECOMP)isoneofthemostpopularfactorization models.Themodel,forthreeentities,isdeﬁnedasfollows:

X₂

(

i

,

m

,

r

) ≈ ˆ

^X2

(

i

,

m

,

r

) =

k

Z₁

(

i

,

k

)

Z₃

(

m

,

k

)

Z₄

(

r

,

k

)

(2)

wheretheobservedtensorX₂ isdecomposedasaproductofthree different matrices. Analogous to MF models, thismodel approxi- mates X2 asthesumof‘innerproducts’ofthecolumnsofZ1, Z3, and Z₄ asillustrated atthebottomofFig. 2.Thismodelhasbeen showntobeusefulinchemometrics[14],psychometrics[12],and signalprocessing[8].

Tucker model [15] is another important model for analyzing tensors with three modes, which is a generalization of the PARAFACmodel.Themodelisdeﬁnedasfollows:

X3

(

i

,

j

,

k

) ≈ ˆ

^X3

(

i

,

j

,

k

)

=

p

q

r

Z1

(

i

,

p

)

Z2

(

j

,

q

)

Z3

(

k

,

r

)

Z4

(

p

,

q

,

r

)

(3)

where X3 isexpressedastheproductofthreematrices( Z1:³)and a ‘core tensor’( Z4). When thecoretensor Z4 ischosen assuper diagonal ( Z₄(p,q,r)=^{0 only}^if ^p=^q=^r), ^Tuckerdecomposition reducestoPARAFAC.

2.1. Coupledfactorizationmodels

In certain applications, informationfrom differentsources are available and need to be combined for obtaining more accurate predictions [16–20]. In musicalaudio processing, one example is havinga largecollectionofannotatedaudiodataandacollection of symbolic musicscores asside information. Similarly, in product recommendation systems, a customer–product rating matrix

(3)

Fig. 2. CoupledMF-PARAFACillustration.Theobservedmatrix X1isapproximatedasthesumofinnerproductsofthecolumnsofZ1andtherowsofZ2.Similarly,X2 is approximatedasthesumof‘innerproducts’ofthecolumnsofZ1, Z3,and Z4.Theoverallmodeliscoupledsincethematrix Z1issharedinbothfactorizations.HereK denotesthesizeoftheindexk:k∈ {1,. . . ,K}.

canbeenhancedwithconnectivityinformationfromasocialnet- workanddemographic information fromthecustomer.Forthese problems,a singlefactorizationmodelwouldnotbesuﬃcientfor exploiting all the information in the data and we need to de- velopmorecomprehensivemodelingstrategiesinordertobeable tocombinedifferentdatasources ina singlefactorizationmodel.

Such models are called as coupled tensorfactorization methods, where the aim is to simultaneously factorize multiple observed tensorsthatshareasetoflatentfactors.

Letusconsideranexamplecoupledmatrix–tensor factorization modelwheretwoobservedtensors X₁ and X₂ arecollectivelyde- composedas

X₁

(

i

,

j

) ≈ ˆ

^X1

(

i

,

j

) =

k

Z₁

(

i

,

k

)

Z₂

(

k

,

j

)

X₂

(

i

,

m

,

r

) ≈ ˆ

^X2

(

i

,

m

,

r

) =

k

Z₁

(

i

,

k

)

Z₃

(

m

,

k

)

Z₄

(

r

,

k

)

(4)

where X₁ is decomposed by using an MF model and X₂ is de- composedbyusingaPARAFACmodel.Thefactor Z1 isthe‘shared factor’inbothdecompositions,makingtheoverallmodelcoupled.

Fig. 2 illustrates this model. In Section 6, we will illustrate the usefulnessofvariousfactorizationmodelsonaudioprocessingap- plications.

2.2.Non-negativetensorfactorizations

Sofar, wehave describedsome of themostimportantmatrix andtensorfactorizationmodels.However,eveniftwofactorization models have the exact same topology (e.g., both models are MF modelswiththe samenumberofparameters),depending on the constraintsplaced overthelatentfactors,thefactorizationsmight havecompletelydifferent interpretations.For instance,inthe MF models,oneoptionisnottorestrictthefactorsbynotplacingany constraintsoverthem.Ontheotherhand,wecanhaveorthogonal- ityconstraintsonthe factors,wherethefactorizationwouldturn intotheprincipalcomponentanalysis.

Evenifwe placehighly restrictive constraints suchas orthog- onality, the estimated factors would be dense and their physical interpretationsin applicationswouldbe quite limitedaslong astheir elements are allowed to take any positive and negative values. In this study, we will consider non-negative factorization models[8],wherewe willrestrict allthe elementsof thefactors tobe non-negative. Here,the non-negativity constraintimplicitly imposes an additive structure onthe model,where thecontribu- tionsofthelatentfactorsarealwaysaddedsincetherewillnotbe

anycancellationsduetonegativevalues,asopposedtotheafore- mentionedcases.Therefore,thisstrategypromotessparsityonthe factorssince mostoftheentries inthefactors wouldbecloseto zeroinorder themodelto beable toﬁtthe data,andmoreim- portantly the estimated factors will havephysical interpretations that might be essentialin manyﬁelds,such asaudioprocessing.

Ontheotherhand,aswewilldescribeinmoredetailinSections3 and4.2,bymodelingtensorswithprobabilistictensorfactorization models,weessentiallydecomposetheparametersofaprobabilistic modelthat arenon-negativeby deﬁnition(e.g.,theintensityofa Poissondistributionorthemeanofagammadistribution)andare constructedasthesumofnon-negativesources[9].Inthismodel- ing strategy,thenon-negativityconstraintonthefactors israther anecessitythananoption.

In audio processing, which is the main application focus of thisstudy,wemodeltheenergiesofsignalsinthetime-frequency domain that are known as the magnitudeor power spectra. For modeling these spectra, the non-negativity constraint turns out to be very natural, since realistic sounds can be viewed as be- ing composed ofpurely additive components,aswill be detailed in Section 5. For example, music signals consist of multiple in- struments,andthesignal ofeach instrumentconsistsofmultiple notes played by the instrument. Speech signals consist of basic units such as phonemes and words.Cancellation of sounds hap- pensonlyintentionallyandinveryspeciﬁcscenarios,forexample inechocancellationsystems.

3. Probabilisticmodelingofnon-negativetensorfactorizations

Inapplications,aswe willdemonstrateinSection 6,we often needtocomeup withcustommodeltopologies,whereeitherthe observedtensorsorthelatentfactorsinvolvemultipleentitiesand cannotberepresentedbyusingmatriceswithoutlossofimportant structuralinformation.Inordertobeabletomodeltherealworld datasetsthatmightconsist ofseveraltensorsandrequirecustom factorizationmodels,weneedtohandleabroadvarietyofmodel topologies.

The Generalized Coupled Tensor Factorization (GCTF) framework [21] is a generalization of matrix and tensor factorization modelsto jointlyfactorizemultipletensors. Theformal deﬁnition oftheGCTFframeworkisasfollows:

X_ν

(

u_ν

) ≈ ˆ

X_ν

(

u_ν

) =

¯ uν

α

Z_α

(

v_α

)

^Rν^,^α (5)

where

ν

= ¹,. . . ,N_x is the observed tensor index and

α

= 1,. . . ,Nzisthefactorindex.Inthisframework,thegoalistocom-

(4)

Table 1

IllustrationofdifferentfactorizationmodelsintheGCTFnotation.HereNxisthenumberofobservedtensors,Nzisthenumberoflatentfactors,V isthesetofallindices inthemodel,Uν arethesetofindicesofXν,VαarethesetofindicesofZα,R isthecouplingmatrix.

Nx Nz V Uν Vα R

MF (Eq.(1)) 1 2 {i,j,k} {i,j} {i,k},{k,j} [1,1]

PARAFAC (Eq.(2)) 1 3 {ⁱ,j,k,r} {ⁱ,j,k} {i,r},

[¹,1,1] {^j,r}^,{^k,r}

TUCKER (Eq.(3)) 1 4 {i,j,k,p,q,r} {i,j,k} {{^kⁱ,,pr}}^,^,{^p{,^j,qq,}r^,} [1,1,1,1]

MF-PARAFAC (Eq.(4)) 2 4 {ⁱ,j,k,m,r} {ⁱ,j}^, {ⁱ,k}^,{^k,j}^,

1 1 0 0 1 0 1 1

{i,m,r} {m,k},{r,k}

Table 2

Tweediedistributionswithcorrespondingnormalizingconstantsanddivergenceforms.Thegeneralformofthedistributionisgiven inEq.(7).

p Divergence Distribution Normalizing constant Divergence form

2− β β-divergence Tweedie K(x, φ,p) dp(x||ˆx)

0 Euclidean Gaussian (2πφ)¹^/² ¹₂(x− ˆ^x)²

1 Kullback–Leibler Poisson ^e^x^/φ⁽^x^/φ+¹⁾

(x/φ)^x^/φ x log_x

ˆ x

−^x+ ˆ^x

2 Itakura–Saito Gamma (1/φ)(eφ)¹^/φx ^x_ˆ_x−logx

ˆ x

−1

3 – Inverse Gaussian (2πx³φ)¹^/² ¹₂⁽^x^−ˆ_x_x_ˆ^x2⁾²

puteanapproximatefactorizationofgivenobservedtensors Xν in terms of a product of individual factors Zα, some of which are possiblyshared.Here,wedeﬁne

V asthesetofallindicesfoundinamodel, Uν asthesetofvisibleindicesofthetensorXν , V α asthesetofindicesinZα ,

Uν¯ =^V\Uν as the set of invisible indices that are not present in Xν .

We usesmall letters as vα torefer to a particular settingof in- dicesin V α (similarly uν∈Uν anduν¯ ∈ ¯Uν ). Furthermore, R isa couplingmatrix thatisdeﬁnedasfollows: Rν^,α=^{1 if} Xν and Zα are connected and Rν^,α=0 otherwise. In other words, the cou- plingmatrixRν^,α speciﬁesthefactors Zα thataffecttheobserved tensor Xν . As the product

α Zα(vα) is collapsed over a set of indices,thefactorizationislatent.Here,weconsidernon-negative observationsandfactors: Xν(_uν)≥^{0 and} Zα(_vα)≥^0. ^This^rather abstract notation is needed to express increasingly complicated tensormodelswithoutlimitingonetoaparticulartopology.

In order to illustrate the framework, we deﬁne the coupled PARAFAC-MF model of Eq. (4) in the GCTF notation as follows.

TheobservedindexsetsaregivenasU1= {ⁱ,j}ândÛ2= {ⁱ,m,r}^. Theindexsetsofthefactors aregivenas: V₁= {ⁱ,k}^, ^V2= {^k,j}^, V3 = {^m,k}^, ^V4 = {^r,k}^. ^The ^coupling ^matrix îs ^given âs ^R= [¹¹⁰⁰;¹⁰¹¹]ândîndicates^that Xˆ1 isafunctionofZ1and Z2 and

ˆ

X₂ is afunction of Z₁, Z₃,and Z₄.Table 1 listsall thefactoriza- tionmodelsdescribedearlierinthepaperasspeciﬁcinstancesof theGCTFnotation.

TheGCTFframeworkassumesthefollowingprobabilisticmodel overeachscalarelementoftheobservedtensors[21]:

X_ν

(

u_ν

) ∼

T Wpν

X_ν

(

u_ν

) ; ˆ

^Xν

(

u_ν

), φ

_ν

, ν ₌

1

, . . . ,

Nx (6) where Xˆ1:^Nx are the model output tensors that are deﬁned in Eq. (5) and T W ^denotes ^the ^so-called ^Tweedie distribution.

Tweedie densities T Wp(x;xˆ,φ) can be written in the following momentform:

P(

x

; ˆ

x

, φ,

p

) =

¹ K

(

x

, φ,

p

)

^exp

−

¹

φ

^d^p

(

x

||ˆ

x

)

(7)

where^{x is}ˆ ^the^mean,φisthedispersion,p isthepowerparameter anddp(·)^denotes^theβ-divergencedeﬁnedasfollows:

dp

(

x

||ˆ

x

) =

^x²⁻^p

(

1

−

p

)(

2

−

p

) −

^x

ˆ

x¹⁻^p 1

−

p

+ ˆ

x²⁻^p

2

−

p (8)

The Tweedie distribution is an important special case of expo- nentialdispersion models,characterizedby threeparameters:the mean, dispersion, and power. This model is also known as the powervariance model, since thevariance ofthe data hasthe fol- lowingform:var(x)= φˆ^x^p^.

By takingappropriate limits,we can verifythat therather ar- bitrary looking divergence dp(·) ^defined ⁱⁿ Êq. ⁽⁸⁾ ^yields â ^lot morefamiliardivergencefunctionssuchastheEuclideandistance square,Kullback–Leibler(KL)divergence,andItakura–Saito(IS)di- vergenceforp=⁰,1,2,respectively.Usingasuitabledivergenceis important in applications; forexample the ISdivergence is scale invariant,wherethedivergencebetweentwo pointsx and y does not change when both points are multiplied by the same constant,i.e.d₂(x||^y)=^d2(ax||ây)[9].Physically,thistranslatestoour intuitive notion that the discrepancy betweentwo sound signals shouldnotchangebyjustturninguptheirvolume.

From theprobabilistic perspective, differentchoicesof p yield importantdistributions suchasGaussian(p=^0), ^Poisson^(p=^1), compoundPoisson(1<p<2),gamma(p=²⁾ ^and^inverse^Gaus- sian (p=³⁾ distributions. Duetoa rathertechnicalcondition, no Tweedie modelexists fortheinterval 0<p<1,butforallother values of p, one obtains the very rich family of Tweedie stable distributions [22].Table 2 illustrates theTweedie distributionfor p∈ {⁰,1,2,3}^.

An importantpropertyoftheTweedie modelsisthat thenor- malizingconstant K(·)^does^not^depend^on^the^mean^parameterx.ˆ Therefore,provided that p andφ are given, itiseasy tosee that solvingamaximumlikelihoodproblemfor^{x is}ˆ ^equivalent^to^min- imization oftheβ-divergence.Forinstance,fortheGaussiancase (seeTable 2),thedivergencefunctionisthesquaredEuclideandis- tanceandthedispersion issimplythevariance.Forallpossible p values,wehaveasimilarform;theTweediemodelsgeneralizethe established theoryofleastsquareslinearregressiontomoregen- eralnoisemodels.

Forregularizationandincorporatingpriorknowledge,weplace priordistributionsoverthelatentfactors.Dependingontheappli- cation,wemightconsiderdifferentpriordistributionsonthelatent factors.Forexample,wecanchooseanexponentialprioroverthe latent factorsthat canbe usedforallthe casesofthepowerpa- rameter p,shownasfollows:

(5)

Z_α

(

v_α

) ∼

E

(

Z_α

(

v_α

);

A_α

(

v_α

))

whereE ^denotes^theexponentialdistribution.Wecanalsochoose moresophisticatedpriorsforindividualcasesofp.Forinstance,in thecaseofPoissonobservations(p=^1),^we^can^choose^a^gamma priormodel

Z_α

(

v_α

) ∼

G

(

Z_α

(

v_α

);

^Aα

(

v_α

),

B_α

(

v_α

))

where G ^denotes ^the ^gamma distribution. The deﬁnitions of the distributions that are mentioned in this paper are given in Ap- pendix A.

4. Inference

Onceweobservethetensors X1:^Nx,wewouldliketomakein- ferenceintheprobabilisticmodeldeﬁnedinEq.(6).Dependingon theapplication,wemightbeinterestedinobtaining

•^Point êstimates, ^such âs^maximum ^likelihood ^(ML) ôr ^maxi- muma-posteriori(MAP):

– ML: Z₁_:_N

z=^argmax

Z_1:Nz log

P(^X1:^Nx|^Z1:^Nz)

– MAP:Z₁_:_N

z=^argmax

Z1:Nz

log

P(^X1:Nx|^Z1:Nz)P(^Z1:Nz)

•^The^full^posteriordistribution:P(^Z1:Nz|^X1:Nx)

whereX1:^Nxand Z1:^Nz denotealltheobservedtensorsandallthe latentfactors,respectively.

Inthissection,wewillexplaininferencealgorithmsforeachof theseproblems.Firstly,wewillexplainagradient-basedinference algorithmformaximumlikelihoodora-posterioriestimation.Then we will explain a Markov Chain Monte Carloprocedure, namely theGibbssampler,forsamplingtheposteriordistributionoverthe latentfactors.

4.1.Maximumlikelihoodandmaximuma-posterioriestimation

Giventhedispersionφ_{ν and}powerparameters pν ,MLestima- tionofthefactors Z1:^Nz reducestotheproblemofminimizingthe β-divergencebetweentheobservationsandtheproductofthela- tentfactors,givenasfollows:

minimize

ν

uν

1

φ

_ν^d^pν

X_ν

(

u_ν

)||

¯ uν

α

Z_α

(

v_α

)

^Rν^,^α

subject to Z_α

(

v_α

) ≥

⁰

, ∀ α ∈ [

^Nz

],

^vα

∈

^Vα (9) where,the power parameter pν determines the cost function to be usedfor Xν and the dispersion parameter φν determines the relativeweightoftheapproximationerrorto Xν .

MLestimationofthelatentfactors Zα canbeachievedviaiter- ativemethods,byﬁxingallfactors Zα for

α

=

α

butone Zα and updatinginanalternatingfashion.Fornon-negativedataandfac- tors,wepresentmultiplicativeupdaterulesthathavethefollowing form[21]:

Z_α

←

^Zα

◦

νR^ν^,^α

φ

⁻_ν¹

_α,ν

( ˆ

X⁻_ν^pν

◦

^Xν

)

νR^ν^,^α

φ

_ν⁻¹

_α,ν

( ˆ

X¹_ν⁻^pν

) ,

(10) where◦^is^theelement-wiseproduct andthedivision operatoris alsoelement-wise.Thekeyquantityintheaboveupdateequation isthe _α,ν functionthatisdeﬁnedasfollows:

_α,ν

(

A

) =

⎡

⎣

uν∩¯vα

A

(

u_ν

)

¯ uν∩¯vα

α=α

Z_α

(

v_α

)

^Rν^,^α

⎤

⎦

₍₁₁₎

For updating Zα, we need to compute this function twice for arguments A = ˆ^Xν⁻^p^ν ◦Xν and A = ˆ^Xν¹⁻^p^ν. Even though this function seems complicated, α,ν( ˆX¹ν⁻^p^ν)− α,ν( ˆX⁻ν^p^ν ◦Xν) is nothing but the gradient of d_β(_Xν|| ˆXν) with respect to Zα, as detailed in Appendix B. Intuitively, the terms ( ˆX⁻ν^p^ν ◦Xν) and Xˆν¹⁻^p^ν comefromthederivativeoftheβ-divergenceandtheterm

u¯_ν∩¯^vα

α=α Zα(_vα )^R^ν^,^α is just the derivative of _{Xν with}ˆ re- spectto Zα,wheretheproductisoverallthefactorsbutZα (i.e.,

α

=

α

).ThemonotonicityoftheseupdaterulesforN_x=^{1 is}^ana- lyzedin[23].

ForMAPinference, theobjective giveninEq.(9)will havean additionalregularizationtermthatinvolvesZα .Theresultingalgo- rithmforMAPinferenceturnsouttobesimilartotheMLschema.

For exponential priors over the factors, the update equation be- comesasimplemodiﬁcation:

Z_α

←

^Zα

◦

νR^ν^,^α

φ

_ν⁻¹

α,ν

( ˆ

X⁻_ν^pν

◦

^Xν

)

A_α

+

νR^ν^,^α

φ

_ν⁻¹

α,ν

( ˆ

X_ν¹⁻^pν

) .

(12) For other conjugate priors, the update rules have similar forms [24].

The beneﬁt of the multiplicative updates is that they can be applied on any tensor factorizationmodel andare rather simple toimplement. The downside ofthemultiplicative updatesisthat theytypicallyrequirealargenumberofiterationstoconvergence.

Alternativeoptimizationmethodsbasedonsecond-orderoptimiza- tion[25–27]andactive-setmethods[28,27]canalsobeappliedfor speciﬁctypesofmodels.

4.2. FullBayesianinferenceviatheGibbssampler

Maximumlikelihoodanda-posterioriestimationmethodspro- vide ususefulandpracticaltoolsthat canbe usedinvariousap- plications.However,incertaincases,theyarepronetoover-ﬁtting sincetheyfallshortatcapturinguncertaintiesthatariseinthein- ferenceprocess.Insteadofaimingtoobtainsinglepointestimates ofthelatentvariables,fullBayesianinferenceaimstocharacterize thefullposteriordistributionoverthelatentvariables.Apartfrom beingabletohandletheuncertainties andthereforebeingrobust to over-ﬁtting, full Bayesian inference hasmany advantages over thepoint estimationmethods invarioustasks suchasthe model selection problem.Inthissection, wewill developa MonteCarlo methodforcharacterizingthefullposterioroverthelatentfactors.

MonteCarlomethodsareasetofnumericaltechniquestoesti- mateexpectationsoftheform:

ϕ (

x

)

π(x)

=

ϕ (

x

) π (

x

)

dx

≈

¹ N

N i=¹

ϕ (

x⁽ⁱ⁾

)

(13)

where

ϕ

(x)π(x) denotestheexpectationofthefunction

ϕ

(x)un- derthedistribution

π

(x)andx⁽ⁱ⁾ areindependentsamplesdrawn fromthetargetdistribution

π

(x),thatwillbetheposteriordistri- butioninourcase. Undermildconditionsonthe testfunction

ϕ

, this estimate convergesto the true expectationas N goes to in- ﬁnity.Thechallengehereisobtainingindependentsamplesfroma nonstandardtargetdensity

π

.

The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samplesfroma Markovchaindeﬁnedby a transition kernel T,that is,one generates x⁽ⁱ⁺¹⁾ conditioned onx⁽ⁱ⁾ asfol- lows:

x⁽ⁱ⁺¹⁾

∼

T

(

x

|

^x⁽ⁱ⁾

).

(14)

The transitionkernelT ^does^not^need ^to^be ^formed ^explicitlyⁱⁿ practice; we onlyneed a procedurethat samples anew conﬁgu- ration, giventhe previous one. Perhapssurprisingly,even though

(6)

Fig. 3. GraphicalrepresentationoftheGCTFframework. Thenodesrepresentthe randomvariablesandthearrowsrepresenttheconditionalindependencestructure.

these subsequent samples are correlated, Eq. (13) remains still valid, and estimated expectations converge to their true values whennumberof samplesi goes to inﬁnity,provided that T sat- isﬁescertain ergodicityconditions.Inordertodesigna transition kernelT ^such ^that ^the^desired distributionis thestationarydis- tribution, that is,

π

(x)=

T (^x|^x)

π

(x)dx,various strategies can be applied; the mostpopular one beingthe Metropolis–Hastings (MH)algorithm [29]. Oneparticularlyconvenient andsimpleMH strategy is the Gibbs sampler where one samples each block of variablesfromthesocalledfullconditionaldistributions.

Derivingasingle Gibbs samplerforall thecasesofthepower parameter p isnot straightforward, therefore one needs to focus onindividual casesofthe Tweedie family.ForthePoissonmodel (p=^1),^we^deﬁne^the^following^augmented^generative^model:

Z_α

(

v_α

) ∼

G

(

Z_α

(

v_α

) ;

^Aα

(

v_α

),

B_α

(

v_α

))

factor priors

ν

(

v

) =

α

Z_α

(

v_α

)

^Rν^,^α intensities

S_ν

(

v

)|

^Z1:Nz

∼

PO

(

S_ν

(

v

); (

^v

))

sources X_ν

(

u_ν

) =

¯ uν

S_ν

(

v

)

outputs

Inthe modeldefinedinEq.(6),theobserved tensors Xν directly dependon thelatentfactors Zα.Here, weforma socalledcom- positemodelwhereweaugmentthemodelinEq.(6)bydefining theintensitytensors_{ν and}thesourcetensorsSν asintermediate layers,wherein thismodel,the observedtensorsare determinis- ticfunctionsofthesources.Fig. 3illustratesthismodel.Notethat, theTweediemodelswithp=^{0 and}^p=^2,^canâlso^berepresented ascompositemodels[30];however,theothercaseshavenotbeen exploredintheliterature.

The Gibbs sampler forthe GCTF modelwithPoisson observa- tions canbe formed by iteratively drawingsamples fromthefull conditionaldistributionsasfollows:

S⁽_νⁱ⁺¹⁾

∼

^p

(

S

|

^Z⁽₁ⁱ_:⁾_N_z

,

X_ν

, ) ν =

¹

. . .

N_x (15) Z⁽_αⁱ⁺¹⁾

∼

^p

(

Z_α

|

^S⁽₁ⁱ_:⁾_N_x

,

Z_¬_α

,

X1:^Nx

, ) α ₌

1

. . .

Nz (16) where Z_¬α denotes the mostrecent valuesofall the factors but Zα,denotesthepriordistributionparameters {Aα,_Bα}^N_α₌^z1,and thefullconditionalsaredeﬁnedas:

p

(

S_ν

|·) =

uν

M

S_ν

(

u_ν

, ¯

U_ν

);

ν

(

u_ν

, ¯

U_ν

)

X

ˆ

_ν

(

u_ν

)

p

(

Z_α

|·) =

vα

G

Z_α

(

v_α

) ;

α

(

v_α

),

_α

(

v_α

)

where

_α

(

v_α

) =

A_α

(

v_α

) +

⎡

⎣

ν R_ν,α

⎛

⎝

¯ vα

S_ν

(

v

)

⎞

⎠

⎤

⎦

_α

(

v_α

) =

B_α

(

v_α

) +

⎡

⎣

ν

⎛

⎝

¯ vα

α=α

Z_α

(

v_α

)

^Rν^,^α

⎞

⎠

⎤

⎦

Here, M ^denotes ^the multinomial distribution. Verbally, given a particular instanceof observedindices uν , thefull conditional of Sν isamultinomialdistributionoverallthelatentindices_{Uν .}¯ Note that,thealgorithmspresentedin[10]and[31]arespecialcasesof thepresentedsamplingalgorithm.

The marginal likelihood of the observed data under a tensor factorization model p(X₁_:_N_x) is oftennecessary for certain problems suchasmodelselection.Byusingthesamplesgeneratedby the Gibbs sampler, the marginal likelihood P(X1:^Nx) can be estimated by using Chib’s method[32].Besides, more eﬃcientsam- plerscanbeformedbyusingspacealternatingdataaugmentation [33,31].

5. Audioprocessingapplications

Inadditiontotherichtheoreticalstructureofnon-negativetensorfactorizations,theyhaveaplentyofpracticalapplications.Be- causeoftheirabilitytoefficientlymodelthehigh-levelstructureof audiosignals, tensorfactorizationshavebeenused tosolvegrand signal processingchallengessuch assource separationandrobust pattern recognition.Theyhave alsoprovided completely newso- lutions to problems that are more specific to audiosignals, such asaudiobandwidthextension,dereverberation,andaudioupmix- ing.Wewillfirstdescribetheaudiorepresentationsthatareused intensorfactorizations,andthendescribedifferentaudioprocess- ingapplicationswheretensorfactorizationshavesuccessfullybeen applied.

5.1. Time-frequencyrepresentation

An audiosignal represents the airpressure asthefunction of time and contains both positive andnegative values. Inthe sim- plest scenario the signal is recorded with only one microphone, and the signal is therefore one dimensional. In order to enable modeling these kinds of audio signals with non-negative tensor factorization,wetypicallyrepresentthemusingaspectrogram that characterizes the intensityof sound as the function of time and frequency.Fig. 4representsanexampleaudiosignal anditsspectrogram.

Unlikeraw audiosignalsthat canshow huge variations,spec- trogramrepresentationsofsoundspossesstypicallyeasilyperceiv- ablepatterns.Forexample,intheexampleﬁgure,onecanseethat thenotesareharmonic(therearelargeamplitudesatregularfre- quency intervals), andthe spectrum ofeach note is ratherstatic overtime.Thisallowsmodelingthemeﬃcientlywithnon-negative tensor factorizations.Sound typeswith morediverse characteris- ticscan alsobe modeledwithtensorfactorizations,providedthat an appropriatemodelthat takesintoaccount thestructureofthe soundisused.

A standard procedureto calculate thespectrogram consistsof thefollowingsteps:

1. Segment the signal into ﬁxed-length frames. Typical frame lengthrangebetween10msand100ms.Adjacentframesare typicallyoverlapping50%or25%.

2. WindoweachframewithawindowfunctionsuchastheHam- mingwindowtosmoothdiscontinuitiesatframeboundaries.

3. Calculatethespectrumwithineachframebyapplyingdiscrete Fouriertransform.

4. Calculatethemagnitudespectrumbytakingtheabsoluteval- uesofeach entryofthespectrum.The spectrumcanalso be decimatede.g.byintegratingmagnitudeswithinMelbands.

(7)

Fig. 4. Anexampleaudiosignal(toppanel)consistingofnotesequenceC4,G4,C4+ G4playedbypiano,anditsspectrogram(bottompanel).

Theresulting datamatrix X(i,j) isindexed by frequencyindexi andframeindex j.Inthecaseofmultichannelaudiowheremul- tiple microphones are available, the above procedure would be appliedseparately to each channel, soresult ina 3D datatensor indexedbytime,frequency,andchannel.

Insome applicationssuchassignal classification[34] orpoly- phonicmusictranscription [35] it suffices to apply tensorfactor- izations on the data matrices, without the need to reconstruct time-domainaudiosignals.However,inmanyapplicationssuchas signal denoisingand sourceseparation [36,37], the target output isan audiosignal,andthere isneedto reconstructan audiosig- nalbased on the factormatrices. Steps 1–3above are invertible, butStep4discardsthephaseinformation,andisthereforenotin- vertible.Thereexist methodsforreconstructingsignalsfrommag- nitudespectrograms [38,39], buta simple andefficient approach uses simply the phases of the complex spectrogram obtained at Step3,andassignsthemfortheoutputdatamatrices.

5.2.Sourceseparationandsignaldenoising

Manyreal-world audiosignalsare mixtures consisting ofmul- tiplesources.Forexample,inmusicrecordingstherearetypically multipleinstrumentsplaying,andspeechsignalscapturedbymo- bilephonescontainsomeinterferingsourcesfromthebackground.

Manysignalprocessingalgorithms (classiﬁcation, coding,etc.)as- sumeasinglesource,andthereforethereisalargeneedforsource separation algorithmsthat extractthesignal producedbyan indi- vidualsourcefromamixture.

In the context of tensor factorizations, mixture data matrix X(i,j)ismodeledasthesumofsourcesS(i,j,n)(seeSection4.2) as

X

(

i

,

j

) =

N n=¹

S

(

i

,

j

,

n

),

(17)

wheren is thesourceindexandN isthenumberofsources. For examplein the caseof speechdenoising, we wouldhave N=2, andsource n=^{1 would} ^correspond ^to ^speech, ^andⁿ=^{2 would} correspondtonoise[40].

Eachsourceismodeledwithtensorfactorization.Forexample, theNMFfactorizationforsourcen isgivenas

S

(

i

,

j

,

n

) ≈ ˆ

^S

(

i

,

j

,

n

) =

k

Z₁

(

i

,

k

,

n

)

Z₂

(

k

,

j

,

n

).

(18)

Bycombiningthe aboveequations,we can convenientlywrite themodelforthe mixturedatamatrixalso asa tensorfactoriza- tionas

X

(

i

,

j

) ≈ ˆ

^X

(

i

,

j

) =

N n=¹

k

Z₁

(

i

,

k

,

n

)

Z₂

(

k

,

j

,

n

).

(19)

Provided that the data matricesof individual sources become correctly estimated, each source can be reconstructed separately accordingtoEq.(18).

Dependinghowmuch aprioriknowledgeaboutthesources is utilized,asourceseparationproblemcanbe termedasbeingun- supervised,supervised,orsemisupervised:

• Unsupervised: all the factormatrices are estimated using the mixturesignalonly[37],andnotrainingdataisusedtoesti- matethem.

• Supervised: isolatedtrainingdataofeachsourceexistsandhas beenusedtoobtainfactormatricesZ₁(i,k,n)representingthe source spectraat a training stage, whichare then kept ﬁxed [41,40].

• Semi-supervised: isolatedtrainingdataexistsatleastforoneof thesourceswhichisusedtoestimateentriesof Z₁(i,k,n)for thosen forwhichtrainingdataisavailable.However,isolated training datadoesnotexist forall thesources and Z1(i,k,n) for the rest of the sources needs to be estimated from the mixturedatamatrix[42,43].Furthermore,insomecasesthere mightbesideinformationavailableforsome sources,suchas acollectionofsymbolicmusicdatathatisrelatedtoacertain source.Incorporatingsuch side informationtothe estimation process viacoupled factorizationmodelscan furtherimprove theseparationaccuracy[44].

5.3. Dereverberation

Innaturalenvironmentsthesignalrecordedbyamicrophoneis aconvolutionofasourcesignalandtheimpulseresponsefromthe source to the microphone. Large amounts of reverberationmake thesignallessintelligibleanddiﬃculttoprocessandanalyzewith algorithmsthatassumeundistortedsourcesignals.Thereforethere isneedforalgorithmsthateitherdereverberate thesignal,orthose that are able to process and analyze reverberant signals. Tensor factorizationscanbeusedforbothpurposes.

Thedatamatrix X(i,j)ofreverberantsignalcanbemodeledas

X

(

i

,

j

) ≈ ˆ

X

(

i

,

j

) =

t

U

(

i

,

t

)

H

(

i

,

d

j

−

t

)

=

t

d

U

(

i

,

t

)

H

(

i

,

d

)δ(

d

−

j

+

t

)

=

t,d

U

(

i

,

t

)

H

(

i

,

d

)

Z3

(

d

,

i

,

t

)

(20)

whereδ(x)=^{1 if}^x=^{0 and} δ(x)=0 otherwise,U(i,t)isthedata matrixofunreverberant,drysignal,andH(i,d)istheconvolution ﬁlter in the time-frequency domain [45,46]. Tensor Z3(d,i,t)= δ(d−^j+^t)isﬁxed.

Notethat abovewe assumethat theunreverberantsignalma- trix U(i,t) andthe convolution ﬁlterresponse matrix H(i,d)are