Contents lists available atScienceDirect
Digital Signal Processing
www.elsevier.com/locate/dsp
Non-negative tensor factorization models for Bayesian audio processing
Umut ¸Sim ¸sekli
a,∗, Tuomas Virtanen
b, Ali Taylan Cemgil
aaDepartmentofComputerEngineering,Bo˘gaziçiUniversity,34342,Bebek,˙Istanbul,Turkey bDepartmentofSignalProcessing,TampereUniversityofTechnology,33720Tampere,Finland
a r t i c l e i n f o a b s t ra c t
Articlehistory:
Availableonlinexxxx
Keywords:
Nonnegativematrixandtensorfactorization Coupledfactorization
Bayesianaudiomodeling Bayesianinference
We provide an overview of matrix and tensor factorization methods from a Bayesian perspective, giving emphasisonboththe inferencemethods and modelingtechniques. Factorization basedmodels and their many extensions such as tensor factorizations have proved useful in a broad range of applications,supportingapracticalandcomputationallytractableframeworkformodeling.Especiallyin audio processing, tensor models helpin aunified mannerthe use ofpriorknowledge about signals, the data generation processes as well as available data from different modalities. After a general reviewoftensormodels,wedescribethegeneralstatistical framework,giveexamplesofseveralaudio applicationsanddescribemodelingstrategiesforkeyproblemssuchasdeconvolution,sourceseparation, andtranscription.
©2015ElsevierInc.All rights reserved.
1. Introduction
Withtherecenttechnological advancesofsensorandcommu- nicationtechnologies, the cost of data acquisition andstorage is significantlyreduced.Consequently,thelast decadehaswitnessed thedramaticincreaseintheamountofdatathatcanbeeasilycol- lected.Oneimportantfacetofdataprocessingisextractingmean- ingfulinformationfromhighly structured datasets that canbe of interestforscientific,financial,ortechnologicalpurposes.
The key to exploiting the potential of large datasets lies in developing computational techniques that can efficiently extract meaningful information. These computational methods must be scalableand tailored for the specifics of an application, butstill be versatileenough to be useful inseveral scenarios. Inthis pa- per, wewill focusonaudio processingandreviewone particular class of such models, that provide a favorable balance between highmodelingaccuracy,easeofimplementationandeaseofman- agementof requiredcomputational resources. This class ofmod- els, coined underthe name of tensor factorization models along withtheir Bayesianinterpretations,willbe thefocusofthistuto- rialpaper.Themathematicalsetupmaylooksomewhatabstractat a first sight, but the generic nature ofthe approach makes ten- sorssuitable fora broadrangeofapplications wherecomplicated
*
Correspondingauthor.E-mailaddresses:umut.simsekli@boun.edu.tr(U. ¸Sim ¸sekli),
tuomas.virtanen@tut.fi(T. Virtanen),taylan.cemgil@boun.edu.tr(A.T. Cemgil).
structured datasets need to be analyzed. In particular, we will show examples inthe domainof audioprocessingwhere signifi- cantprogresshasbeenachievedusingtensormethods.Whilethe modeling and inference strategies can be applied inthe broader contextofgeneralaudioandothernon-stationarytimeseriesanal- ysis,thehierarchicalBayesiannatureoftheframeworkmakes the approachparticularlysuitablefortheanalysisofacousticalsignals.
In audioprocessing, an increasing number of applicationsare developed that can handle challenging acoustical conditions and highly variable sound sources. Here, one needs to exploit the inherent structure of acoustic signals to address some of the key problemssuch asdenoising, restoration,interpolation, source separation, transcription, bandwidth extension, upmixing, coding, eventrecognitionandclassification.Notsurprisingly,manydiffer- entmodelingtechniqueshavebeendevelopedforthosepurposes.
However,asisthecaseforcomputationalmodelingofallphysical phenomena, weface herewithatradeoff: accuracyversus com- putational tractability –a physically realistic andaccurate model maybe toocomplexto meetthedemands ofa givenapplication tobeusefulinpractice.
Typically, there is a lot of a-priori knowledge available for acoustic signals. This includes knowledge ofthe physical or cog- nitive mechanisms by which sounds are generated or perceived, as well as the hierarchical nature by which they are organized in an acoustical scene. In more specific domains, such as mu- sictranscription or audioeventrecognition, more specializedas- sumptionsaboutthehierarchicalorganizationareneeded.Yet,the http://dx.doi.org/10.1016/j.dsp.2015.03.011
1051-2004/©2015ElsevierInc.All rights reserved.
resulting models often possess complex statistical structure and highlyadaptiveandpowerfulcomputationaltechniquesareneeded toperforminference.
Factorization-basedmodelinghasbeenusefulinaddressingthe modelingaccuracy versuscomputational requirementtradeoff in various domainsbeyond audiosignal processing[1],withpromi- nentexamplessuchastextprocessing[2],bioinformatics[3],com- putervision[4],socialmediaanalysis[5],andnetworktrafficanal- ysis[6]. Theaimin suchmodeling strategiesis todecompose an observed matrix or tensor (multidimensional array) into seman- tically meaningful factors in order to obtain useful predictions.
Meanwhile, the factors themselves also provide a useful feature representationaboutthespecificsofthedomain.
Inthispaper,wereviewtensorbasedstatisticalmodelsandas- sociatedinference methodsdevelopedrecentlyforaudioandmu- sicprocessinganddescribe variousextensionsandapplicationsof thesemodels.InSection2,weillustratetheideas offactorization basedmodeling, andthen in Section 3 we describe a probabilis- ticinterpretationofthesemodels.The probabilisticinterpretation opensupthewayforafullBayesiantreatmentviaBayesianhierar- chicalmodeling.Thisleadstoaverynaturalmeansforunification, allowingtheformulationofhighly structuredprobabilisticmodels foraudio data atthe various levels of abstraction, aswe will il- lustrateinSection6.Thepaperconcludeswithremarksonfuture researchdirections.
2. Factorization-baseddatamodeling
In this section, we will describe the basics of factorization based modeling, anddescribe extensionssuch ascoupled tensor factorizations and nonnegative decompositions. This section will describethemainstructureandthenotation.
Inmanyapplications,datacanbe representedasamatrix,for example,thespectrogram ofan audiosignal (frequency vstime), adatasetofimages(pixelcoordinatesvsinstances),wordfrequen- cies among different documents (words vs documents), and the adjacency structure of a graph (nodes vsnodes) to name a few.
Heretheindicesofthematrixcorrespondtotheentities,andthe matrixelementsdescribearelationbetweenthetwoentities.Ma- trix Factorization (MF) models are one of the most widely used methods for analyzing the data that involve two entities [7–10].
Thegoalinthesemodelsistocalculateafactorizationoftheform:
X1
(
i,
j) ≈ ˆ
X1(
i,
j) =
k
Z1
(
i,
k)
Z2(
k,
j)
(1)where X1 is the given data matrix, Xˆ1 is an approximation to X1, and Z1, and Z2 are factor matrices to be estimated. Even though we have a single observed matrixin this model,we use asubscriptin X1 sincewe willconsiderfactorizationmodelsthat involvemorethanoneobservedmatrixortensor,laterinthissec- tion. Here, X1 is expressed as the product of Z1 and Z2, where Z1 isconsideredasthedictionary matrixand Z2 containsthecor- respondingweights.Fromanotherperspective, X1 isapproximated asthesumofinnerproducts ofthecolumns of Z1 andthe rows of Z2, as illustrated atthe top of Fig. 2. Note that, if Z1 would havebeenfixed, the problemwouldhavebeen equivalent toba- sis regression where the weights (expansion coefficients) Z2 are estimated[11].In contrast, inmatrix factorizationthe dictionary (the setofbasis vectors)isestimatedalong withthe coefficients.
Thismodeling strategy has been shownto be successful invari- ousfieldsincludingsignalprocessing, finance,bioinformatics,and naturallanguageprocessing[8].
Matrixfactorization modelsare applicable when theobserved dataencapsulatestherelationoftwodifferententities(e.g.,i and j in Eq. (1)). However, when the data involves multiple entities
Fig. 1. Illustrationofa)avectorX(i):anarraywithoneindex,b) a matrixX(i,j)an arraywithtwoindices,c) atensorX(i,j,k):anarraywiththreeormoreindices.In thisstudy,werefervectorsastensorswithonemodeandmatricesastensorswith twomodes.
ofinterest, such asternaryorhigherorder relationsit cannotbe represented without loss of structure by using matrices. For ex- ample a multichannel sound library of severalinstances maybe representedinthetime-frequencydomainconvenientlyasanob- ject withseveralentities,saythe powerateach(frequency, time, channel,instance).Onecouldinprinciple‘concatenate’eachspec- trogramacrosstimeandinstancestoobtainabigmatrix,say(fre- quency×channel,time×instance)butthisrepresentationwould obscure important structuralinformation – comparesimply with representingamatrixwithacolumnvector.Henceoneneedsnat- urallymultiway tables, the so-calledtensors, whereeach element is denotedby T(i,j,k,. . .).Here, T is thetensorandthe indices i,j,k,. . .are theentities.The numberofdistinctentities dictates themode ofatensor.Hence avectorandamatrixare tensorsof modeoneandtworespectively.TensorsareillustratedinFig. 1and wewillgiveamorepreciseandcompactdefinitioninSection3.
For modeling multiway arrays with more than two entities thecanonicalpolyadicdecomposition[12,13](alsoreferredas,CP, PARAFAC,orCANDECOMP)isoneofthemostpopularfactorization models.Themodel,forthreeentities,isdefinedasfollows:
X2
(
i,
m,
r) ≈ ˆ
X2(
i,
m,
r) =
k
Z1
(
i,
k)
Z3(
m,
k)
Z4(
r,
k)
(2)wheretheobservedtensorX2 isdecomposedasaproductofthree different matrices. Analogous to MF models, thismodel approxi- mates X2 asthesumof‘innerproducts’ofthecolumnsofZ1, Z3, and Z4 asillustrated atthebottomofFig. 2.Thismodelhasbeen showntobeusefulinchemometrics[14],psychometrics[12],and signalprocessing[8].
Tucker model [15] is another important model for analyz- ing tensors with three modes, which is a generalization of the PARAFACmodel.Themodelisdefinedasfollows:
X3
(
i,
j,
k) ≈ ˆ
X3(
i,
j,
k)
=
p
q
r
Z1
(
i,
p)
Z2(
j,
q)
Z3(
k,
r)
Z4(
p,
q,
r)
(3)where X3 isexpressedastheproductofthreematrices( Z1:3)and a ‘core tensor’( Z4). When thecoretensor Z4 ischosen assuper diagonal ( Z4(p,q,r)=0 onlyif p=q=r), Tuckerdecomposition reducestoPARAFAC.
2.1. Coupledfactorizationmodels
In certain applications, informationfrom differentsources are available and need to be combined for obtaining more accurate predictions [16–20]. In musicalaudio processing, one example is havinga largecollectionofannotatedaudiodataandacollection of symbolic musicscores asside information. Similarly, in prod- uct recommendation systems, a customer–product rating matrix
Fig. 2. CoupledMF-PARAFACillustration.Theobservedmatrix X1isapproximatedasthesumofinnerproductsofthecolumnsofZ1andtherowsofZ2.Similarly,X2 is approximatedasthesumof‘innerproducts’ofthecolumnsofZ1, Z3,and Z4.Theoverallmodeliscoupledsincethematrix Z1issharedinbothfactorizations.HereK denotesthesizeoftheindexk:k∈ {1,. . . ,K}.
canbeenhancedwithconnectivityinformationfromasocialnet- workanddemographic information fromthecustomer.Forthese problems,a singlefactorizationmodelwouldnotbesufficientfor exploiting all the information in the data and we need to de- velopmorecomprehensivemodelingstrategiesinordertobeable tocombinedifferentdatasources ina singlefactorizationmodel.
Such models are called as coupled tensorfactorization methods, where the aim is to simultaneously factorize multiple observed tensorsthatshareasetoflatentfactors.
Letusconsideranexamplecoupledmatrix–tensor factorization modelwheretwoobservedtensors X1 and X2 arecollectivelyde- composedas
X1
(
i,
j) ≈ ˆ
X1(
i,
j) =
k
Z1
(
i,
k)
Z2(
k,
j)
X2
(
i,
m,
r) ≈ ˆ
X2(
i,
m,
r) =
k
Z1
(
i,
k)
Z3(
m,
k)
Z4(
r,
k)
(4)where X1 is decomposed by using an MF model and X2 is de- composedbyusingaPARAFACmodel.Thefactor Z1 isthe‘shared factor’inbothdecompositions,makingtheoverallmodelcoupled.
Fig. 2 illustrates this model. In Section 6, we will illustrate the usefulnessofvariousfactorizationmodelsonaudioprocessingap- plications.
2.2.Non-negativetensorfactorizations
Sofar, wehave describedsome of themostimportantmatrix andtensorfactorizationmodels.However,eveniftwofactorization models have the exact same topology (e.g., both models are MF modelswiththe samenumberofparameters),depending on the constraintsplaced overthelatentfactors,thefactorizationsmight havecompletelydifferent interpretations.For instance,inthe MF models,oneoptionisnottorestrictthefactorsbynotplacingany constraintsoverthem.Ontheotherhand,wecanhaveorthogonal- ityconstraintsonthe factors,wherethefactorizationwouldturn intotheprincipalcomponentanalysis.
Evenifwe placehighly restrictive constraints suchas orthog- onality, the estimated factors would be dense and their physi- cal interpretationsin applicationswouldbe quite limitedaslong astheir elements are allowed to take any positive and negative values. In this study, we will consider non-negative factorization models[8],wherewe willrestrict allthe elementsof thefactors tobe non-negative. Here,the non-negativity constraintimplicitly imposes an additive structure onthe model,where thecontribu- tionsofthelatentfactorsarealwaysaddedsincetherewillnotbe
anycancellationsduetonegativevalues,asopposedtotheafore- mentionedcases.Therefore,thisstrategypromotessparsityonthe factorssince mostoftheentries inthefactors wouldbecloseto zeroinorder themodelto beable tofitthe data,andmoreim- portantly the estimated factors will havephysical interpretations that might be essentialin manyfields,such asaudioprocessing.
Ontheotherhand,aswewilldescribeinmoredetailinSections3 and4.2,bymodelingtensorswithprobabilistictensorfactorization models,weessentiallydecomposetheparametersofaprobabilistic modelthat arenon-negativeby definition(e.g.,theintensityofa Poissondistributionorthemeanofagammadistribution)andare constructedasthesumofnon-negativesources[9].Inthismodel- ing strategy,thenon-negativityconstraintonthefactors israther anecessitythananoption.
In audio processing, which is the main application focus of thisstudy,wemodeltheenergiesofsignalsinthetime-frequency domain that are known as the magnitudeor power spectra. For modeling these spectra, the non-negativity constraint turns out to be very natural, since realistic sounds can be viewed as be- ing composed ofpurely additive components,aswill be detailed in Section 5. For example, music signals consist of multiple in- struments,andthesignal ofeach instrumentconsistsofmultiple notes played by the instrument. Speech signals consist of basic units such as phonemes and words.Cancellation of sounds hap- pensonlyintentionallyandinveryspecificscenarios,forexample inechocancellationsystems.
3. Probabilisticmodelingofnon-negativetensorfactorizations
Inapplications,aswe willdemonstrateinSection 6,we often needtocomeup withcustommodeltopologies,whereeitherthe observedtensorsorthelatentfactorsinvolvemultipleentitiesand cannotberepresentedbyusingmatriceswithoutlossofimportant structuralinformation.Inordertobeabletomodeltherealworld datasetsthatmightconsist ofseveraltensorsandrequirecustom factorizationmodels,weneedtohandleabroadvarietyofmodel topologies.
The Generalized Coupled Tensor Factorization (GCTF) frame- work [21] is a generalization of matrix and tensor factorization modelsto jointlyfactorizemultipletensors. Theformal definition oftheGCTFframeworkisasfollows:
Xν
(
uν) ≈ ˆ
Xν(
uν) =
¯ uν
α
Zα
(
vα)
Rν,α (5)where
ν
= 1,. . . ,Nx is the observed tensor index andα
= 1,. . . ,Nzisthefactorindex.Inthisframework,thegoalistocom-Table 1
IllustrationofdifferentfactorizationmodelsintheGCTFnotation.HereNxisthenumberofobservedtensors,Nzisthenumberoflatentfactors,V isthesetofallindices inthemodel,Uν arethesetofindicesofXν,VαarethesetofindicesofZα,R isthecouplingmatrix.
Nx Nz V Uν Vα R
MF (Eq.(1)) 1 2 {i,j,k} {i,j} {i,k},{k,j} [1,1]
PARAFAC (Eq.(2)) 1 3 {i,j,k,r} {i,j,k} {i,r},
[1,1,1] {j,r},{k,r}
TUCKER (Eq.(3)) 1 4 {i,j,k,p,q,r} {i,j,k} {{ki,,pr}},,{p{,j,qq,}r,} [1,1,1,1]
MF-PARAFAC (Eq.(4)) 2 4 {i,j,k,m,r} {i,j}, {i,k},{k,j},
1 1 0 0 1 0 1 1
{i,m,r} {m,k},{r,k}
Table 2
Tweediedistributionswithcorrespondingnormalizingconstantsanddivergenceforms.Thegeneralformofthedistributionisgiven inEq.(7).
p Divergence Distribution Normalizing constant Divergence form
2− β β-divergence Tweedie K(x, φ,p) dp(x||ˆx)
0 Euclidean Gaussian (2πφ)1/2 12(x− ˆx)2
1 Kullback–Leibler Poisson ex/φ(x/φ+1)
(x/φ)x/φ x logx
ˆ x
−x+ ˆx
2 Itakura–Saito Gamma (1/φ)(eφ)1/φx xˆx−logx
ˆ x
−1
3 – Inverse Gaussian (2πx3φ)1/2 12(x−ˆxxˆx2)2
puteanapproximatefactorizationofgivenobservedtensors Xν in terms of a product of individual factors Zα, some of which are possiblyshared.Here,wedefine
V asthesetofallindicesfoundinamodel, Uν asthesetofvisibleindicesofthetensorXν , V α asthesetofindicesinZα ,
Uν¯ =V\Uν as the set of invisible indices that are not present in Xν .
We usesmall letters as vα torefer to a particular settingof in- dicesin V α (similarly uν∈Uν anduν¯ ∈ ¯Uν ). Furthermore, R isa couplingmatrix thatisdefinedasfollows: Rν,α=1 if Xν and Zα are connected and Rν,α=0 otherwise. In other words, the cou- plingmatrixRν,α specifiesthefactors Zα thataffecttheobserved tensor Xν . As the product
α Zα(vα) is collapsed over a set of indices,thefactorizationislatent.Here,weconsidernon-negative observationsandfactors: Xν(uν)≥0 and Zα(vα)≥0. Thisrather abstract notation is needed to express increasingly complicated tensormodelswithoutlimitingonetoaparticulartopology.
In order to illustrate the framework, we define the coupled PARAFAC-MF model of Eq. (4) in the GCTF notation as follows.
TheobservedindexsetsaregivenasU1= {i,j}andU2= {i,m,r}. Theindexsetsofthefactors aregivenas: V1= {i,k}, V2= {k,j}, V3 = {m,k}, V4 = {r,k}. The coupling matrix is given as R= [1100;1011]andindicatesthat Xˆ1 isafunctionofZ1and Z2 and
ˆ
X2 is afunction of Z1, Z3,and Z4.Table 1 listsall thefactoriza- tionmodelsdescribedearlierinthepaperasspecificinstancesof theGCTFnotation.
TheGCTFframeworkassumesthefollowingprobabilisticmodel overeachscalarelementoftheobservedtensors[21]:
Xν
(
uν) ∼
T WpνXν
(
uν) ; ˆ
Xν(
uν), φ
ν, ν =
1, . . . ,
Nx (6) where Xˆ1:Nx are the model output tensors that are defined in Eq. (5) and T W denotes the so-called Tweedie distribution.Tweedie densities T Wp(x;xˆ,φ) can be written in the following momentform:
P(
x; ˆ
x, φ,
p) =
1 K(
x, φ,
p)
exp−
1φ
dp(
x||ˆ
x)
(7)
wherex isˆ themean,φisthedispersion,p isthepowerparameter anddp(·)denotestheβ-divergencedefinedasfollows:
dp
(
x||ˆ
x) =
x2−p(
1−
p)(
2−
p) −
xˆ
x1−p 1−
p+ ˆ
x2−p2
−
p (8)The Tweedie distribution is an important special case of expo- nentialdispersion models,characterizedby threeparameters:the mean, dispersion, and power. This model is also known as the powervariance model, since thevariance ofthe data hasthe fol- lowingform:var(x)= φˆxp.
By takingappropriate limits,we can verifythat therather ar- bitrary looking divergence dp(·) defined in Eq. (8) yields a lot morefamiliardivergencefunctionssuchastheEuclideandistance square,Kullback–Leibler(KL)divergence,andItakura–Saito(IS)di- vergenceforp=0,1,2,respectively.Usingasuitabledivergenceis important in applications; forexample the ISdivergence is scale invariant,wherethedivergencebetweentwo pointsx and y does not change when both points are multiplied by the same con- stant,i.e.d2(x||y)=d2(ax||ay)[9].Physically,thistranslatestoour intuitive notion that the discrepancy betweentwo sound signals shouldnotchangebyjustturninguptheirvolume.
From theprobabilistic perspective, differentchoicesof p yield importantdistributions suchasGaussian(p=0), Poisson(p=1), compoundPoisson(1<p<2),gamma(p=2) andinverseGaus- sian (p=3) distributions. Duetoa rathertechnicalcondition, no Tweedie modelexists fortheinterval 0<p<1,butforallother values of p, one obtains the very rich family of Tweedie stable distributions [22].Table 2 illustrates theTweedie distributionfor p∈ {0,1,2,3}.
An importantpropertyoftheTweedie modelsisthat thenor- malizingconstant K(·)doesnotdependonthemeanparameterx.ˆ Therefore,provided that p andφ are given, itiseasy tosee that solvingamaximumlikelihoodproblemforx isˆ equivalenttomin- imization oftheβ-divergence.Forinstance,fortheGaussiancase (seeTable 2),thedivergencefunctionisthesquaredEuclideandis- tanceandthedispersion issimplythevariance.Forallpossible p values,wehaveasimilarform;theTweediemodelsgeneralizethe established theoryofleastsquareslinearregressiontomoregen- eralnoisemodels.
Forregularizationandincorporatingpriorknowledge,weplace priordistributionsoverthelatentfactors.Dependingontheappli- cation,wemightconsiderdifferentpriordistributionsonthelatent factors.Forexample,wecanchooseanexponentialprioroverthe latent factorsthat canbe usedforallthe casesofthepowerpa- rameter p,shownasfollows:
Zα
(
vα) ∼
E(
Zα(
vα);
Aα(
vα))
whereE denotestheexponentialdistribution.Wecanalsochoose moresophisticatedpriorsforindividualcasesofp.Forinstance,in thecaseofPoissonobservations(p=1),wecanchooseagamma priormodel
Zα
(
vα) ∼
G(
Zα(
vα);
Aα(
vα),
Bα(
vα))
where G denotes the gamma distribution. The definitions of the distributions that are mentioned in this paper are given in Ap- pendix A.
4. Inference
Onceweobservethetensors X1:Nx,wewouldliketomakein- ferenceintheprobabilisticmodeldefinedinEq.(6).Dependingon theapplication,wemightbeinterestedinobtaining
•Point estimates, such asmaximum likelihood (ML) or maxi- muma-posteriori(MAP):
– ML: Z1:N
z=argmax
Z1:Nz log
P(X1:Nx|Z1:Nz)
– MAP:Z1:N
z=argmax
Z1:Nz
log
P(X1:Nx|Z1:Nz)P(Z1:Nz)
•Thefullposteriordistribution:P(Z1:Nz|X1:Nx)
whereX1:Nxand Z1:Nz denotealltheobservedtensorsandallthe latentfactors,respectively.
Inthissection,wewillexplaininferencealgorithmsforeachof theseproblems.Firstly,wewillexplainagradient-basedinference algorithmformaximumlikelihoodora-posterioriestimation.Then we will explain a Markov Chain Monte Carloprocedure, namely theGibbssampler,forsamplingtheposteriordistributionoverthe latentfactors.
4.1.Maximumlikelihoodandmaximuma-posterioriestimation
Giventhedispersionφν andpowerparameters pν ,MLestima- tionofthefactors Z1:Nz reducestotheproblemofminimizingthe β-divergencebetweentheobservationsandtheproductofthela- tentfactors,givenasfollows:
minimize
ν
uν
1
φ
νdpν Xν(
uν)||
¯ uν
α
Zα
(
vα)
Rν,αsubject to Zα
(
vα) ≥
0, ∀ α ∈ [
Nz],
vα∈
Vα (9) where,the power parameter pν determines the cost function to be usedfor Xν and the dispersion parameter φν determines the relativeweightoftheapproximationerrorto Xν .MLestimationofthelatentfactors Zα canbeachievedviaiter- ativemethods,byfixingallfactors Zα for
α
=α
butone Zα and updatinginanalternatingfashion.Fornon-negativedataandfac- tors,wepresentmultiplicativeupdaterulesthathavethefollowing form[21]:Zα
←
Zα◦
νRν,α
φ
−ν1α,ν
( ˆ
X−νpν◦
Xν)
νRν,α
φ
ν−1α,ν
( ˆ
X1ν−pν) ,
(10) where◦istheelement-wiseproduct andthedivision operatoris alsoelement-wise.Thekeyquantityintheaboveupdateequation isthe α,ν functionthatisdefinedasfollows:α,ν
(
A) =
⎡
⎣
uν∩¯vα
A
(
uν)
¯ uν∩¯vα
α =α
Zα
(
vα)
Rν,α⎤
⎦
(11)For updating Zα, we need to compute this function twice for arguments A = ˆXν−pν ◦Xν and A = ˆXν1−pν. Even though this function seems complicated, α,ν( ˆX1ν−pν)− α,ν( ˆX−νpν ◦Xν) is nothing but the gradient of dβ(Xν|| ˆXν) with respect to Zα, as detailed in Appendix B. Intuitively, the terms ( ˆX−νpν ◦Xν) and Xˆν1−pν comefromthederivativeoftheβ-divergenceandtheterm
u¯ν∩¯vα
α =α Zα (vα )Rν,α is just the derivative of Xν withˆ re- spectto Zα,wheretheproductisoverallthefactorsbutZα (i.e.,
α
=α
).ThemonotonicityoftheseupdaterulesforNx=1 isana- lyzedin[23].ForMAPinference, theobjective giveninEq.(9)will havean additionalregularizationtermthatinvolvesZα .Theresultingalgo- rithmforMAPinferenceturnsouttobesimilartotheMLschema.
For exponential priors over the factors, the update equation be- comesasimplemodification:
Zα
←
Zα◦
νRν,α
φ
ν−1α,ν
( ˆ
X−νpν◦
Xν)
Aα+
νRν,α
φ
ν−1α,ν
( ˆ
Xν1−pν) .
(12) For other conjugate priors, the update rules have similar forms [24].The benefit of the multiplicative updates is that they can be applied on any tensor factorizationmodel andare rather simple toimplement. The downside ofthemultiplicative updatesisthat theytypicallyrequirealargenumberofiterationstoconvergence.
Alternativeoptimizationmethodsbasedonsecond-orderoptimiza- tion[25–27]andactive-setmethods[28,27]canalsobeappliedfor specifictypesofmodels.
4.2. FullBayesianinferenceviatheGibbssampler
Maximumlikelihoodanda-posterioriestimationmethodspro- vide ususefulandpracticaltoolsthat canbe usedinvariousap- plications.However,incertaincases,theyarepronetoover-fitting sincetheyfallshortatcapturinguncertaintiesthatariseinthein- ferenceprocess.Insteadofaimingtoobtainsinglepointestimates ofthelatentvariables,fullBayesianinferenceaimstocharacterize thefullposteriordistributionoverthelatentvariables.Apartfrom beingabletohandletheuncertainties andthereforebeingrobust to over-fitting, full Bayesian inference hasmany advantages over thepoint estimationmethods invarioustasks suchasthe model selection problem.Inthissection, wewill developa MonteCarlo methodforcharacterizingthefullposterioroverthelatentfactors.
MonteCarlomethodsareasetofnumericaltechniquestoesti- mateexpectationsoftheform:
ϕ (
x)
π(x)=
ϕ (
x) π (
x)
dx≈
1 N N i=1ϕ (
x(i))
(13)where
ϕ
(x)π(x) denotestheexpectationofthefunctionϕ
(x)un- derthedistributionπ
(x)andx(i) areindependentsamplesdrawn fromthetargetdistributionπ
(x),thatwillbetheposteriordistri- butioninourcase. Undermildconditionsonthe testfunctionϕ
, this estimate convergesto the true expectationas N goes to in- finity.Thechallengehereisobtainingindependentsamplesfroma nonstandardtargetdensityπ
.The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samplesfroma Markovchaindefinedby a transition kernel T,that is,one generates x(i+1) conditioned onx(i) asfol- lows:
x(i+1)
∼
T(
x|
x(i)).
(14)The transitionkernelT doesnotneed tobe formed explicitlyin practice; we onlyneed a procedurethat samples anew configu- ration, giventhe previous one. Perhapssurprisingly,even though
Fig. 3. GraphicalrepresentationoftheGCTFframework. Thenodesrepresentthe randomvariablesandthearrowsrepresenttheconditionalindependencestructure.
these subsequent samples are correlated, Eq. (13) remains still valid, and estimated expectations converge to their true values whennumberof samplesi goes to infinity,provided that T sat- isfiescertain ergodicityconditions.Inordertodesigna transition kernelT such that thedesired distributionis thestationarydis- tribution, that is,
π
(x)=T (x|x )
π
(x )dx ,various strategies can be applied; the mostpopular one beingthe Metropolis–Hastings (MH)algorithm [29]. Oneparticularlyconvenient andsimpleMH strategy is the Gibbs sampler where one samples each block of variablesfromthesocalledfullconditionaldistributions.Derivingasingle Gibbs samplerforall thecasesofthepower parameter p isnot straightforward, therefore one needs to focus onindividual casesofthe Tweedie family.ForthePoissonmodel (p=1),wedefinethefollowingaugmentedgenerativemodel:
Zα
(
vα) ∼
G(
Zα(
vα) ;
Aα(
vα),
Bα(
vα))
factor priorsν
(
v) =
α
Zα
(
vα)
Rν,α intensitiesSν
(
v)|
Z1:Nz∼
PO(
Sν(
v); (
v))
sources Xν(
uν) =
¯ uν
Sν
(
v)
outputsInthe modeldefinedinEq.(6),theobserved tensors Xν directly dependon thelatentfactors Zα.Here, weforma socalledcom- positemodelwhereweaugmentthemodelinEq.(6)bydefining theintensitytensorsν andthesourcetensorsSν asintermediate layers,wherein thismodel,the observedtensorsare determinis- ticfunctionsofthesources.Fig. 3illustratesthismodel.Notethat, theTweediemodelswithp=0 andp=2,canalsoberepresented ascompositemodels[30];however,theothercaseshavenotbeen exploredintheliterature.
The Gibbs sampler forthe GCTF modelwithPoisson observa- tions canbe formed by iteratively drawingsamples fromthefull conditionaldistributionsasfollows:
S(νi+1)
∼
p(
S|
Z(1i:)Nz,
Xν, ) ν =
1. . .
Nx (15) Z(αi+1)∼
p(
Zα|
S(1i:)Nx,
Z ¬α,
X1:Nx, ) α =
1. . .
Nz (16) where Z ¬α denotes the mostrecent valuesofall the factors but Zα,denotesthepriordistributionparameters {Aα,Bα}Nα=z1,and thefullconditionalsaredefinedas:p
(
Sν|·) =
uν
M
Sν
(
uν, ¯
Uν);
ν(
uν, ¯
Uν)
Xˆ
ν(
uν)
p
(
Zα|·) =
vα
G
Zα
(
vα) ;
α(
vα),
α(
vα)
where
α
(
vα) =
Aα(
vα) +
⎡
⎣
ν Rν,α
⎛
⎝
¯ vα
Sν
(
v)
⎞
⎠
⎤
⎦
α
(
vα) =
Bα(
vα) +
⎡
⎣
ν
⎛
⎝
¯ vα
α =α
Zα
(
vα)
Rν,α⎞
⎠
⎤
⎦
Here, M denotes the multinomial distribution. Verbally, given a particular instanceof observedindices uν , thefull conditional of Sν isamultinomialdistributionoverallthelatentindicesUν .¯ Note that,thealgorithmspresentedin[10]and[31]arespecialcasesof thepresentedsamplingalgorithm.
The marginal likelihood of the observed data under a tensor factorization model p(X1:Nx) is oftennecessary for certain prob- lems suchasmodelselection.Byusingthesamplesgeneratedby the Gibbs sampler, the marginal likelihood P(X1:Nx) can be esti- mated by using Chib’s method[32].Besides, more efficientsam- plerscanbeformedbyusingspacealternatingdataaugmentation [33,31].
5. Audioprocessingapplications
Inadditiontotherichtheoreticalstructureofnon-negativeten- sorfactorizations,theyhaveaplentyofpracticalapplications.Be- causeoftheirabilitytoefficientlymodelthehigh-levelstructureof audiosignals, tensorfactorizationshavebeenused tosolvegrand signal processingchallengessuch assource separationandrobust pattern recognition.Theyhave alsoprovided completely newso- lutions to problems that are more specific to audiosignals, such asaudiobandwidthextension,dereverberation,andaudioupmix- ing.Wewillfirstdescribetheaudiorepresentationsthatareused intensorfactorizations,andthendescribedifferentaudioprocess- ingapplicationswheretensorfactorizationshavesuccessfullybeen applied.
5.1. Time-frequencyrepresentation
An audiosignal represents the airpressure asthefunction of time and contains both positive andnegative values. Inthe sim- plest scenario the signal is recorded with only one microphone, and the signal is therefore one dimensional. In order to enable modeling these kinds of audio signals with non-negative tensor factorization,wetypicallyrepresentthemusingaspectrogram that characterizes the intensityof sound as the function of time and frequency.Fig. 4representsanexampleaudiosignal anditsspec- trogram.
Unlikeraw audiosignalsthat canshow huge variations,spec- trogramrepresentationsofsoundspossesstypicallyeasilyperceiv- ablepatterns.Forexample,intheexamplefigure,onecanseethat thenotesareharmonic(therearelargeamplitudesatregularfre- quency intervals), andthe spectrum ofeach note is ratherstatic overtime.Thisallowsmodelingthemefficientlywithnon-negative tensor factorizations.Sound typeswith morediverse characteris- ticscan alsobe modeledwithtensorfactorizations,providedthat an appropriatemodelthat takesintoaccount thestructureofthe soundisused.
A standard procedureto calculate thespectrogram consistsof thefollowingsteps:
1. Segment the signal into fixed-length frames. Typical frame lengthrangebetween10msand100ms.Adjacentframesare typicallyoverlapping50%or25%.
2. WindoweachframewithawindowfunctionsuchastheHam- mingwindowtosmoothdiscontinuitiesatframeboundaries.
3. Calculatethespectrumwithineachframebyapplyingdiscrete Fouriertransform.
4. Calculatethemagnitudespectrumbytakingtheabsoluteval- uesofeach entryofthespectrum.The spectrumcanalso be decimatede.g.byintegratingmagnitudeswithinMelbands.
Fig. 4. Anexampleaudiosignal(toppanel)consistingofnotesequenceC4,G4,C4+ G4playedbypiano,anditsspectrogram(bottompanel).
Theresulting datamatrix X(i,j) isindexed by frequencyindexi andframeindex j.Inthecaseofmultichannelaudiowheremul- tiple microphones are available, the above procedure would be appliedseparately to each channel, soresult ina 3D datatensor indexedbytime,frequency,andchannel.
Insome applicationssuchassignal classification[34] orpoly- phonicmusictranscription [35] it suffices to apply tensorfactor- izations on the data matrices, without the need to reconstruct time-domainaudiosignals.However,inmanyapplicationssuchas signal denoisingand sourceseparation [36,37], the target output isan audiosignal,andthere isneedto reconstructan audiosig- nalbased on the factormatrices. Steps 1–3above are invertible, butStep4discardsthephaseinformation,andisthereforenotin- vertible.Thereexist methodsforreconstructingsignalsfrommag- nitudespectrograms [38,39], buta simple andefficient approach uses simply the phases of the complex spectrogram obtained at Step3,andassignsthemfortheoutputdatamatrices.
5.2.Sourceseparationandsignaldenoising
Manyreal-world audiosignalsare mixtures consisting ofmul- tiplesources.Forexample,inmusicrecordingstherearetypically multipleinstrumentsplaying,andspeechsignalscapturedbymo- bilephonescontainsomeinterferingsourcesfromthebackground.
Manysignalprocessingalgorithms (classification, coding,etc.)as- sumeasinglesource,andthereforethereisalargeneedforsource separation algorithmsthat extractthesignal producedbyan indi- vidualsourcefromamixture.
In the context of tensor factorizations, mixture data matrix X(i,j)ismodeledasthesumofsourcesS(i,j,n)(seeSection4.2) as
X
(
i,
j) =
N n=1S
(
i,
j,
n),
(17)wheren is thesourceindexandN isthenumberofsources. For examplein the caseof speechdenoising, we wouldhave N=2, andsource n=1 would correspond to speech, andn=2 would correspondtonoise[40].
Eachsourceismodeledwithtensorfactorization.Forexample, theNMFfactorizationforsourcen isgivenas
S
(
i,
j,
n) ≈ ˆ
S(
i,
j,
n) =
k
Z1
(
i,
k,
n)
Z2(
k,
j,
n).
(18)Bycombiningthe aboveequations,we can convenientlywrite themodelforthe mixturedatamatrixalso asa tensorfactoriza- tionas
X
(
i,
j) ≈ ˆ
X(
i,
j) =
N n=1k
Z1
(
i,
k,
n)
Z2(
k,
j,
n).
(19)Provided that the data matricesof individual sources become correctly estimated, each source can be reconstructed separately accordingtoEq.(18).
Dependinghowmuch aprioriknowledgeaboutthesources is utilized,asourceseparationproblemcanbe termedasbeingun- supervised,supervised,orsemisupervised:
• Unsupervised: all the factormatrices are estimated using the mixturesignalonly[37],andnotrainingdataisusedtoesti- matethem.
• Supervised: isolatedtrainingdataofeachsourceexistsandhas beenusedtoobtainfactormatricesZ1(i,k,n)representingthe source spectraat a training stage, whichare then kept fixed [41,40].
• Semi-supervised: isolatedtrainingdataexistsatleastforoneof thesourceswhichisusedtoestimateentriesof Z1(i,k,n)for thosen forwhichtrainingdataisavailable.However,isolated training datadoesnotexist forall thesources and Z1(i,k,n) for the rest of the sources needs to be estimated from the mixturedatamatrix[42,43].Furthermore,insomecasesthere mightbesideinformationavailableforsome sources,suchas acollectionofsymbolicmusicdatathatisrelatedtoacertain source.Incorporatingsuch side informationtothe estimation process viacoupled factorizationmodelscan furtherimprove theseparationaccuracy[44].
5.3. Dereverberation
Innaturalenvironmentsthesignalrecordedbyamicrophoneis aconvolutionofasourcesignalandtheimpulseresponsefromthe source to the microphone. Large amounts of reverberationmake thesignallessintelligibleanddifficulttoprocessandanalyzewith algorithmsthatassumeundistortedsourcesignals.Thereforethere isneedforalgorithmsthateitherdereverberate thesignal,orthose that are able to process and analyze reverberant signals. Tensor factorizationscanbeusedforbothpurposes.
Thedatamatrix X(i,j)ofreverberantsignalcanbemodeledas
X
(
i,
j) ≈ ˆ
X(
i,
j) =
t
U
(
i,
t)
H(
i,
dj
−
t)
=
t
d
U
(
i,
t)
H(
i,
d)δ(
d−
j+
t)
=
t,d
U
(
i,
t)
H(
i,
d)
Z3(
d,
i,
t)
(20)whereδ(x)=1 ifx=0 and δ(x)=0 otherwise,U(i,t)isthedata matrixofunreverberant,drysignal,andH(i,d)istheconvolution filter in the time-frequency domain [45,46]. Tensor Z3(d,i,t)= δ(d−j+t)isfixed.
Notethat abovewe assumethat theunreverberantsignalma- trix U(i,t) andthe convolution filterresponse matrix H(i,d)are