• Sonuç bulunamadı

processing Non-negative tensor factorization models for Bayesian audio Digital Signal Processing

N/A
N/A
Protected

Academic year: 2021

Share "processing Non-negative tensor factorization models for Bayesian audio Digital Signal Processing"

Copied!
14
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

Non-negative tensor factorization models for Bayesian audio processing

Umut ¸Sim ¸sekli

a,

, Tuomas Virtanen

b

, Ali Taylan Cemgil

a

aDepartmentofComputerEngineering,Bo˘gaziçiUniversity,34342,Bebek,˙Istanbul,Turkey bDepartmentofSignalProcessing,TampereUniversityofTechnology,33720Tampere,Finland

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonlinexxxx

Keywords:

Nonnegativematrixandtensorfactorization Coupledfactorization

Bayesianaudiomodeling Bayesianinference

We provide an overview of matrix and tensor factorization methods from a Bayesian perspective, giving emphasisonboththe inferencemethods and modelingtechniques. Factorization basedmodels and their many extensions such as tensor factorizations have proved useful in a broad range of applications,supportingapracticalandcomputationallytractableframeworkformodeling.Especiallyin audio processing, tensor models helpin aunified mannerthe use ofpriorknowledge about signals, the data generation processes as well as available data from different modalities. After a general reviewoftensormodels,wedescribethegeneralstatistical framework,giveexamplesofseveralaudio applicationsanddescribemodelingstrategiesforkeyproblemssuchasdeconvolution,sourceseparation, andtranscription.

©2015ElsevierInc.All rights reserved.

1. Introduction

Withtherecenttechnological advancesofsensorandcommu- nicationtechnologies, the cost of data acquisition andstorage is significantlyreduced.Consequently,thelast decadehaswitnessed thedramaticincreaseintheamountofdatathatcanbeeasilycol- lected.Oneimportantfacetofdataprocessingisextractingmean- ingfulinformationfromhighly structured datasets that canbe of interestforscientific,financial,ortechnologicalpurposes.

The key to exploiting the potential of large datasets lies in developing computational techniques that can efficiently extract meaningful information. These computational methods must be scalableand tailored for the specifics of an application, butstill be versatileenough to be useful inseveral scenarios. Inthis pa- per, wewill focusonaudio processingandreviewone particular class of such models, that provide a favorable balance between highmodelingaccuracy,easeofimplementationandeaseofman- agementof requiredcomputational resources. This class ofmod- els, coined underthe name of tensor factorization models along withtheir Bayesianinterpretations,willbe thefocusofthistuto- rialpaper.Themathematicalsetupmaylooksomewhatabstractat a first sight, but the generic nature ofthe approach makes ten- sorssuitable fora broadrangeofapplications wherecomplicated

*

Correspondingauthor.

E-mailaddresses:umut.simsekli@boun.edu.tr(U. ¸Sim ¸sekli),

tuomas.virtanen@tut.fi(T. Virtanen),taylan.cemgil@boun.edu.tr(A.T. Cemgil).

structured datasets need to be analyzed. In particular, we will show examples inthe domainof audioprocessingwhere signifi- cantprogresshasbeenachievedusingtensormethods.Whilethe modeling and inference strategies can be applied inthe broader contextofgeneralaudioandothernon-stationarytimeseriesanal- ysis,thehierarchicalBayesiannatureoftheframeworkmakes the approachparticularlysuitablefortheanalysisofacousticalsignals.

In audioprocessing, an increasing number of applicationsare developed that can handle challenging acoustical conditions and highly variable sound sources. Here, one needs to exploit the inherent structure of acoustic signals to address some of the key problemssuch asdenoising, restoration,interpolation, source separation, transcription, bandwidth extension, upmixing, coding, eventrecognitionandclassification.Notsurprisingly,manydiffer- entmodelingtechniqueshavebeendevelopedforthosepurposes.

However,asisthecaseforcomputationalmodelingofallphysical phenomena, weface herewithatradeoff: accuracyversus com- putational tractability –a physically realistic andaccurate model maybe toocomplexto meetthedemands ofa givenapplication tobeusefulinpractice.

Typically, there is a lot of a-priori knowledge available for acoustic signals. This includes knowledge ofthe physical or cog- nitive mechanisms by which sounds are generated or perceived, as well as the hierarchical nature by which they are organized in an acoustical scene. In more specific domains, such as mu- sictranscription or audioeventrecognition, more specializedas- sumptionsaboutthehierarchicalorganizationareneeded.Yet,the http://dx.doi.org/10.1016/j.dsp.2015.03.011

1051-2004/©2015ElsevierInc.All rights reserved.

(2)

resulting models often possess complex statistical structure and highlyadaptiveandpowerfulcomputationaltechniquesareneeded toperforminference.

Factorization-basedmodelinghasbeenusefulinaddressingthe modelingaccuracy versuscomputational requirementtradeoff in various domainsbeyond audiosignal processing[1],withpromi- nentexamplessuchastextprocessing[2],bioinformatics[3],com- putervision[4],socialmediaanalysis[5],andnetworktrafficanal- ysis[6]. Theaimin suchmodeling strategiesis todecompose an observed matrix or tensor (multidimensional array) into seman- tically meaningful factors in order to obtain useful predictions.

Meanwhile, the factors themselves also provide a useful feature representationaboutthespecificsofthedomain.

Inthispaper,wereviewtensorbasedstatisticalmodelsandas- sociatedinference methodsdevelopedrecentlyforaudioandmu- sicprocessinganddescribe variousextensionsandapplicationsof thesemodels.InSection2,weillustratetheideas offactorization basedmodeling, andthen in Section 3 we describe a probabilis- ticinterpretationofthesemodels.The probabilisticinterpretation opensupthewayforafullBayesiantreatmentviaBayesianhierar- chicalmodeling.Thisleadstoaverynaturalmeansforunification, allowingtheformulationofhighly structuredprobabilisticmodels foraudio data atthe various levels of abstraction, aswe will il- lustrateinSection6.Thepaperconcludeswithremarksonfuture researchdirections.

2. Factorization-baseddatamodeling

In this section, we will describe the basics of factorization based modeling, anddescribe extensionssuch ascoupled tensor factorizations and nonnegative decompositions. This section will describethemainstructureandthenotation.

Inmanyapplications,datacanbe representedasamatrix,for example,thespectrogram ofan audiosignal (frequency vstime), adatasetofimages(pixelcoordinatesvsinstances),wordfrequen- cies among different documents (words vs documents), and the adjacency structure of a graph (nodes vsnodes) to name a few.

Heretheindicesofthematrixcorrespondtotheentities,andthe matrixelementsdescribearelationbetweenthetwoentities.Ma- trix Factorization (MF) models are one of the most widely used methods for analyzing the data that involve two entities [7–10].

Thegoalinthesemodelsistocalculateafactorizationoftheform:

X1

(

i

,

j

) ≈ ˆ

X1

(

i

,

j

) = 

k

Z1

(

i

,

k

)

Z2

(

k

,

j

)

(1)

where X1 is the given data matrix, Xˆ1 is an approximation to X1, and Z1, and Z2 are factor matrices to be estimated. Even though we have a single observed matrixin this model,we use asubscriptin X1 sincewe willconsiderfactorizationmodelsthat involvemorethanoneobservedmatrixortensor,laterinthissec- tion. Here, X1 is expressed as the product of Z1 and Z2, where Z1 isconsideredasthedictionary matrixand Z2 containsthecor- respondingweights.Fromanotherperspective, X1 isapproximated asthesumofinnerproducts ofthecolumns of Z1 andthe rows of Z2, as illustrated atthe top of Fig. 2. Note that, if Z1 would havebeenfixed, the problemwouldhavebeen equivalent toba- sis regression where the weights (expansion coefficients) Z2 are estimated[11].In contrast, inmatrix factorizationthe dictionary (the setofbasis vectors)isestimatedalong withthe coefficients.

Thismodeling strategy has been shownto be successful invari- ousfieldsincludingsignalprocessing, finance,bioinformatics,and naturallanguageprocessing[8].

Matrixfactorization modelsare applicable when theobserved dataencapsulatestherelationoftwodifferententities(e.g.,i and j in Eq. (1)). However, when the data involves multiple entities

Fig. 1. Illustrationofa)avectorX(i):anarraywithoneindex,b) a matrixX(i,j)an arraywithtwoindices,c) atensorX(i,j,k):anarraywiththreeormoreindices.In thisstudy,werefervectorsastensorswithonemodeandmatricesastensorswith twomodes.

ofinterest, such asternaryorhigherorder relationsit cannotbe represented without loss of structure by using matrices. For ex- ample a multichannel sound library of severalinstances maybe representedinthetime-frequencydomainconvenientlyasanob- ject withseveralentities,saythe powerateach(frequency, time, channel,instance).Onecouldinprinciple‘concatenate’eachspec- trogramacrosstimeandinstancestoobtainabigmatrix,say(fre- quency×channel,time×instance)butthisrepresentationwould obscure important structuralinformation – comparesimply with representingamatrixwithacolumnvector.Henceoneneedsnat- urallymultiway tables, the so-calledtensors, whereeach element is denotedby T(i,j,k,. . .).Here, T is thetensorandthe indices i,j,k,. . .are theentities.The numberofdistinctentities dictates themode ofatensor.Hence avectorandamatrixare tensorsof modeoneandtworespectively.TensorsareillustratedinFig. 1and wewillgiveamorepreciseandcompactdefinitioninSection3.

For modeling multiway arrays with more than two entities thecanonicalpolyadicdecomposition[12,13](alsoreferredas,CP, PARAFAC,orCANDECOMP)isoneofthemostpopularfactorization models.Themodel,forthreeentities,isdefinedasfollows:

X2

(

i

,

m

,

r

) ≈ ˆ

X2

(

i

,

m

,

r

) = 

k

Z1

(

i

,

k

)

Z3

(

m

,

k

)

Z4

(

r

,

k

)

(2)

wheretheobservedtensorX2 isdecomposedasaproductofthree different matrices. Analogous to MF models, thismodel approxi- mates X2 asthesumof‘innerproducts’ofthecolumnsofZ1, Z3, and Z4 asillustrated atthebottomofFig. 2.Thismodelhasbeen showntobeusefulinchemometrics[14],psychometrics[12],and signalprocessing[8].

Tucker model [15] is another important model for analyz- ing tensors with three modes, which is a generalization of the PARAFACmodel.Themodelisdefinedasfollows:

X3

(

i

,

j

,

k

) ≈ ˆ

X3

(

i

,

j

,

k

)

= 

p



q



r

Z1

(

i

,

p

)

Z2

(

j

,

q

)

Z3

(

k

,

r

)

Z4

(

p

,

q

,

r

)

(3)

where X3 isexpressedastheproductofthreematrices( Z1:3)and a ‘core tensor’( Z4). When thecoretensor Z4 ischosen assuper diagonal ( Z4(p,q,r)=0 onlyif p=q=r), Tuckerdecomposition reducestoPARAFAC.

2.1. Coupledfactorizationmodels

In certain applications, informationfrom differentsources are available and need to be combined for obtaining more accurate predictions [16–20]. In musicalaudio processing, one example is havinga largecollectionofannotatedaudiodataandacollection of symbolic musicscores asside information. Similarly, in prod- uct recommendation systems, a customer–product rating matrix

(3)

Fig. 2. CoupledMF-PARAFACillustration.Theobservedmatrix X1isapproximatedasthesumofinnerproductsofthecolumnsofZ1andtherowsofZ2.Similarly,X2 is approximatedasthesumof‘innerproducts’ofthecolumnsofZ1, Z3,and Z4.Theoverallmodeliscoupledsincethematrix Z1issharedinbothfactorizations.HereK denotesthesizeoftheindexk:k∈ {1,. . . ,K}.

canbeenhancedwithconnectivityinformationfromasocialnet- workanddemographic information fromthecustomer.Forthese problems,a singlefactorizationmodelwouldnotbesufficientfor exploiting all the information in the data and we need to de- velopmorecomprehensivemodelingstrategiesinordertobeable tocombinedifferentdatasources ina singlefactorizationmodel.

Such models are called as coupled tensorfactorization methods, where the aim is to simultaneously factorize multiple observed tensorsthatshareasetoflatentfactors.

Letusconsideranexamplecoupledmatrix–tensor factorization modelwheretwoobservedtensors X1 and X2 arecollectivelyde- composedas

X1

(

i

,

j

) ≈ ˆ

X1

(

i

,

j

) = 

k

Z1

(

i

,

k

)

Z2

(

k

,

j

)

X2

(

i

,

m

,

r

) ≈ ˆ

X2

(

i

,

m

,

r

) = 

k

Z1

(

i

,

k

)

Z3

(

m

,

k

)

Z4

(

r

,

k

)

(4)

where X1 is decomposed by using an MF model and X2 is de- composedbyusingaPARAFACmodel.Thefactor Z1 isthe‘shared factor’inbothdecompositions,makingtheoverallmodelcoupled.

Fig. 2 illustrates this model. In Section 6, we will illustrate the usefulnessofvariousfactorizationmodelsonaudioprocessingap- plications.

2.2.Non-negativetensorfactorizations

Sofar, wehave describedsome of themostimportantmatrix andtensorfactorizationmodels.However,eveniftwofactorization models have the exact same topology (e.g., both models are MF modelswiththe samenumberofparameters),depending on the constraintsplaced overthelatentfactors,thefactorizationsmight havecompletelydifferent interpretations.For instance,inthe MF models,oneoptionisnottorestrictthefactorsbynotplacingany constraintsoverthem.Ontheotherhand,wecanhaveorthogonal- ityconstraintsonthe factors,wherethefactorizationwouldturn intotheprincipalcomponentanalysis.

Evenifwe placehighly restrictive constraints suchas orthog- onality, the estimated factors would be dense and their physi- cal interpretationsin applicationswouldbe quite limitedaslong astheir elements are allowed to take any positive and negative values. In this study, we will consider non-negative factorization models[8],wherewe willrestrict allthe elementsof thefactors tobe non-negative. Here,the non-negativity constraintimplicitly imposes an additive structure onthe model,where thecontribu- tionsofthelatentfactorsarealwaysaddedsincetherewillnotbe

anycancellationsduetonegativevalues,asopposedtotheafore- mentionedcases.Therefore,thisstrategypromotessparsityonthe factorssince mostoftheentries inthefactors wouldbecloseto zeroinorder themodelto beable tofitthe data,andmoreim- portantly the estimated factors will havephysical interpretations that might be essentialin manyfields,such asaudioprocessing.

Ontheotherhand,aswewilldescribeinmoredetailinSections3 and4.2,bymodelingtensorswithprobabilistictensorfactorization models,weessentiallydecomposetheparametersofaprobabilistic modelthat arenon-negativeby definition(e.g.,theintensityofa Poissondistributionorthemeanofagammadistribution)andare constructedasthesumofnon-negativesources[9].Inthismodel- ing strategy,thenon-negativityconstraintonthefactors israther anecessitythananoption.

In audio processing, which is the main application focus of thisstudy,wemodeltheenergiesofsignalsinthetime-frequency domain that are known as the magnitudeor power spectra. For modeling these spectra, the non-negativity constraint turns out to be very natural, since realistic sounds can be viewed as be- ing composed ofpurely additive components,aswill be detailed in Section 5. For example, music signals consist of multiple in- struments,andthesignal ofeach instrumentconsistsofmultiple notes played by the instrument. Speech signals consist of basic units such as phonemes and words.Cancellation of sounds hap- pensonlyintentionallyandinveryspecificscenarios,forexample inechocancellationsystems.

3. Probabilisticmodelingofnon-negativetensorfactorizations

Inapplications,aswe willdemonstrateinSection 6,we often needtocomeup withcustommodeltopologies,whereeitherthe observedtensorsorthelatentfactorsinvolvemultipleentitiesand cannotberepresentedbyusingmatriceswithoutlossofimportant structuralinformation.Inordertobeabletomodeltherealworld datasetsthatmightconsist ofseveraltensorsandrequirecustom factorizationmodels,weneedtohandleabroadvarietyofmodel topologies.

The Generalized Coupled Tensor Factorization (GCTF) frame- work [21] is a generalization of matrix and tensor factorization modelsto jointlyfactorizemultipletensors. Theformal definition oftheGCTFframeworkisasfollows:

Xν

(

uν

) ≈ ˆ

Xν

(

uν

) = 

¯



α

Zα

(

vα

)

,α (5)

where

ν

= 1,. . . ,Nx is the observed tensor index and

α

= 1,. . . ,Nzisthefactorindex.Inthisframework,thegoalistocom-

(4)

Table 1

IllustrationofdifferentfactorizationmodelsintheGCTFnotation.HereNxisthenumberofobservedtensors,Nzisthenumberoflatentfactors,V isthesetofallindices inthemodel,Uν arethesetofindicesofXν,VαarethesetofindicesofZα,R isthecouplingmatrix.

Nx Nz V Uν Vα R

MF (Eq.(1)) 1 2 {i,j,k} {i,j} {i,k},{k,j} [1,1]

PARAFAC (Eq.(2)) 1 3 {i,j,k,r} {i,j,k} {i,r},

[1,1,1] {j,r},{k,r}

TUCKER (Eq.(3)) 1 4 {i,j,k,p,q,r} {i,j,k} {{ki,,pr}},,{p{,j,qq,}r,} [1,1,1,1]

MF-PARAFAC (Eq.(4)) 2 4 {i,j,k,m,r} {i,j}, {i,k},{k,j}, 

1 1 0 0 1 0 1 1

 {i,m,r} {m,k},{r,k}

Table 2

Tweediedistributionswithcorrespondingnormalizingconstantsanddivergenceforms.Thegeneralformofthedistributionisgiven inEq.(7).

p Divergence Distribution Normalizing constant Divergence form

2− β β-divergence Tweedie K(x, φ,p) dp(x||ˆx)

0 Euclidean Gaussian (2πφ)1/2 12(x− ˆx)2

1 Kullback–Leibler Poisson ex(x/φ+1)

(x/φ)x x logx

ˆ x

−x+ ˆx

2 Itakura–Saito Gamma (1/φ)(eφ)1x xˆxlogx

ˆ x

−1

3 Inverse Gaussian (2πx3φ)1/2 12(x−ˆxxˆx2)2

puteanapproximatefactorizationofgivenobservedtensors Xν in terms of a product of individual factors Zα, some of which are possiblyshared.Here,wedefine

V asthesetofallindicesfoundinamodel, Uν asthesetofvisibleindicesofthetensorXν , V α asthesetofindicesinZα ,

¯ =V\Uν as the set of invisible indices that are not present in Xν .

We usesmall letters as vα torefer to a particular settingof in- dicesin V α (similarly Uν anduν¯ ∈ ¯Uν ). Furthermore, R isa couplingmatrix thatisdefinedasfollows: ,α=1 if Xν and are connected and ,α=0 otherwise. In other words, the cou- plingmatrix,α specifiesthefactors Zα thataffecttheobserved tensor Xν . As the product 

α Zα(vα) is collapsed over a set of indices,thefactorizationislatent.Here,weconsidernon-negative observationsandfactors: Xν()0 and Zα()0. Thisrather abstract notation is needed to express increasingly complicated tensormodelswithoutlimitingonetoaparticulartopology.

In order to illustrate the framework, we define the coupled PARAFAC-MF model of Eq. (4) in the GCTF notation as follows.

TheobservedindexsetsaregivenasU1= {i,j}andU2= {i,m,r}. Theindexsetsofthefactors aregivenas: V1= {i,k}, V2= {k,j}, V3 = {m,k}, V4 = {r,k}. The coupling matrix is given as R= [1100;1011]andindicatesthat Xˆ1 isafunctionofZ1and Z2 and

ˆ

X2 is afunction of Z1, Z3,and Z4.Table 1 listsall thefactoriza- tionmodelsdescribedearlierinthepaperasspecificinstancesof theGCTFnotation.

TheGCTFframeworkassumesthefollowingprobabilisticmodel overeachscalarelementoftheobservedtensors[21]:

Xν

(

uν

)

T W

Xν

(

uν

) ; ˆ

Xν

(

uν

), φ

ν

, ν =

1

, . . . ,

Nx (6) where Xˆ1:Nx are the model output tensors that are defined in Eq. (5) and T W denotes the so-called Tweedie distribution.

Tweedie densities T Wp(x;xˆ,φ) can be written in the following momentform:

P(

x

; ˆ

x

, φ,

p

) =

1 K

(

x

, φ,

p

)

exp

1

φ

dp

(

x

||ˆ

x

)

(7)

wherex isˆ themean,φisthedispersion,p isthepowerparameter anddp(·)denotestheβ-divergencedefinedasfollows:

dp

(

x

||ˆ

x

) =

x2p

(

1

p

)(

2

p

)

x

ˆ

x1p 1

p

+ ˆ

x2p

2

p (8)

The Tweedie distribution is an important special case of expo- nentialdispersion models,characterizedby threeparameters:the mean, dispersion, and power. This model is also known as the powervariance model, since thevariance ofthe data hasthe fol- lowingform:var(x)= φˆxp.

By takingappropriate limits,we can verifythat therather ar- bitrary looking divergence dp(·) defined in Eq. (8) yields a lot morefamiliardivergencefunctionssuchastheEuclideandistance square,Kullback–Leibler(KL)divergence,andItakura–Saito(IS)di- vergenceforp=0,1,2,respectively.Usingasuitabledivergenceis important in applications; forexample the ISdivergence is scale invariant,wherethedivergencebetweentwo pointsx and y does not change when both points are multiplied by the same con- stant,i.e.d2(x||y)=d2(ax||ay)[9].Physically,thistranslatestoour intuitive notion that the discrepancy betweentwo sound signals shouldnotchangebyjustturninguptheirvolume.

From theprobabilistic perspective, differentchoicesof p yield importantdistributions suchasGaussian(p=0), Poisson(p=1), compoundPoisson(1<p<2),gamma(p=2) andinverseGaus- sian (p=3) distributions. Duetoa rathertechnicalcondition, no Tweedie modelexists fortheinterval 0<p<1,butforallother values of p, one obtains the very rich family of Tweedie stable distributions [22].Table 2 illustrates theTweedie distributionfor p∈ {0,1,2,3}.

An importantpropertyoftheTweedie modelsisthat thenor- malizingconstant K(·)doesnotdependonthemeanparameterx.ˆ Therefore,provided that p andφ are given, itiseasy tosee that solvingamaximumlikelihoodproblemforx isˆ equivalenttomin- imization oftheβ-divergence.Forinstance,fortheGaussiancase (seeTable 2),thedivergencefunctionisthesquaredEuclideandis- tanceandthedispersion issimplythevariance.Forallpossible p values,wehaveasimilarform;theTweediemodelsgeneralizethe established theoryofleastsquareslinearregressiontomoregen- eralnoisemodels.

Forregularizationandincorporatingpriorknowledge,weplace priordistributionsoverthelatentfactors.Dependingontheappli- cation,wemightconsiderdifferentpriordistributionsonthelatent factors.Forexample,wecanchooseanexponentialprioroverthe latent factorsthat canbe usedforallthe casesofthepowerpa- rameter p,shownasfollows:

(5)

Zα

(

vα

)

E

(

Zα

(

vα

);

Aα

(

vα

))

whereE denotestheexponentialdistribution.Wecanalsochoose moresophisticatedpriorsforindividualcasesofp.Forinstance,in thecaseofPoissonobservations(p=1),wecanchooseagamma priormodel

Zα

(

vα

)

G

(

Zα

(

vα

);

Aα

(

vα

),

Bα

(

vα

))

where G denotes the gamma distribution. The definitions of the distributions that are mentioned in this paper are given in Ap- pendix A.

4. Inference

Onceweobservethetensors X1:Nx,wewouldliketomakein- ferenceintheprobabilisticmodeldefinedinEq.(6).Dependingon theapplication,wemightbeinterestedinobtaining

Point estimates, such asmaximum likelihood (ML) or maxi- muma-posteriori(MAP):

– ML: Z1:N

z=argmax

Z1:Nz log

P(X1:Nx|Z1:Nz)



– MAP:Z1:N

z=argmax

Z1:Nz

log

P(X1:Nx|Z1:Nz)P(Z1:Nz)



Thefullposteriordistribution:P(Z1:Nz|X1:Nx)

whereX1:Nxand Z1:Nz denotealltheobservedtensorsandallthe latentfactors,respectively.

Inthissection,wewillexplaininferencealgorithmsforeachof theseproblems.Firstly,wewillexplainagradient-basedinference algorithmformaximumlikelihoodora-posterioriestimation.Then we will explain a Markov Chain Monte Carloprocedure, namely theGibbssampler,forsamplingtheposteriordistributionoverthe latentfactors.

4.1.Maximumlikelihoodandmaximuma-posterioriestimation

Giventhedispersionφν andpowerparameters pν ,MLestima- tionofthefactors Z1:Nz reducestotheproblemofminimizingthe β-divergencebetweentheobservationsandtheproductofthela- tentfactors,givenasfollows:

minimize



ν



1

φ

νd



Xν

(

uν

)|| 

¯



α

Zα

(

vα

)

,α



subject to Zα

(

vα

)

0

,α ∈ [

Nz

],

vα

Vα (9) where,the power parameter pν determines the cost function to be usedfor Xν and the dispersion parameter φν determines the relativeweightoftheapproximationerrorto Xν .

MLestimationofthelatentfactors Zα canbeachievedviaiter- ativemethods,byfixingallfactors for

α

=

α

butone Zα and updatinginanalternatingfashion.Fornon-negativedataandfac- tors,wepresentmultiplicativeupdaterulesthathavethefollowing form[21]:

Zα

Zα



νRν,α

φ

ν1

α

( ˆ

Xν

Xν

)



νRν,α

φ

ν1

α

( ˆ

X1ν

) ,

(10) where◦istheelement-wiseproduct andthedivision operatoris alsoelement-wise.Thekeyquantityintheaboveupdateequation isthe α,ν functionthatisdefinedasfollows:

α

(

A

) =

⎣ 

∩¯

A

(

uν

) 

¯ ∩¯



α =α

Zα

(

vα

)

,α

(11)

For updating Zα, we need to compute this function twice for arguments A = ˆXνpνXν and A = ˆXν1pν. Even though this function seems complicated, α,ν( ˆX1νpν)− α,ν( ˆXνpνXν) is nothing but the gradient of dβ(|| ˆXν) with respect to Zα, as detailed in Appendix B. Intuitively, the terms ( ˆXνpνXν) and Xˆν1pν comefromthederivativeoftheβ-divergenceandtheterm

u¯ν∩¯vα



α =α Zα ( )Rν,α is just the derivative of Xν withˆ re- spectto Zα,wheretheproductisoverallthefactorsbutZα (i.e.,

α

=

α

).ThemonotonicityoftheseupdaterulesforNx=1 isana- lyzedin[23].

ForMAPinference, theobjective giveninEq.(9)will havean additionalregularizationtermthatinvolvesZα .Theresultingalgo- rithmforMAPinferenceturnsouttobesimilartotheMLschema.

For exponential priors over the factors, the update equation be- comesasimplemodification:

Zα

Zα



νRν,α

φ

ν1

α,ν

( ˆ

Xν

Xν

)

Aα

+ 

νRν,α

φ

ν1

α,ν

( ˆ

Xν1

) .

(12) For other conjugate priors, the update rules have similar forms [24].

The benefit of the multiplicative updates is that they can be applied on any tensor factorizationmodel andare rather simple toimplement. The downside ofthemultiplicative updatesisthat theytypicallyrequirealargenumberofiterationstoconvergence.

Alternativeoptimizationmethodsbasedonsecond-orderoptimiza- tion[25–27]andactive-setmethods[28,27]canalsobeappliedfor specifictypesofmodels.

4.2. FullBayesianinferenceviatheGibbssampler

Maximumlikelihoodanda-posterioriestimationmethodspro- vide ususefulandpracticaltoolsthat canbe usedinvariousap- plications.However,incertaincases,theyarepronetoover-fitting sincetheyfallshortatcapturinguncertaintiesthatariseinthein- ferenceprocess.Insteadofaimingtoobtainsinglepointestimates ofthelatentvariables,fullBayesianinferenceaimstocharacterize thefullposteriordistributionoverthelatentvariables.Apartfrom beingabletohandletheuncertainties andthereforebeingrobust to over-fitting, full Bayesian inference hasmany advantages over thepoint estimationmethods invarioustasks suchasthe model selection problem.Inthissection, wewill developa MonteCarlo methodforcharacterizingthefullposterioroverthelatentfactors.

MonteCarlomethodsareasetofnumericaltechniquestoesti- mateexpectationsoftheform:

ϕ (

x

)

π(x)

=



ϕ (

x

) π (

x

)

dx

1 N



N i=1

ϕ (

x(i)

)

(13)

where

ϕ

(x)π(x) denotestheexpectationofthefunction

ϕ

(x)un- derthedistribution

π

(x)andx(i) areindependentsamplesdrawn fromthetargetdistribution

π

(x),thatwillbetheposteriordistri- butioninourcase. Undermildconditionsonthe testfunction

ϕ

, this estimate convergesto the true expectationas N goes to in- finity.Thechallengehereisobtainingindependentsamplesfroma nonstandardtargetdensity

π

.

The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samplesfroma Markovchaindefinedby a transition kernel T,that is,one generates x(i+1) conditioned onx(i) asfol- lows:

x(i+1)

T

(

x

|

x(i)

).

(14)

The transitionkernelT doesnotneed tobe formed explicitlyin practice; we onlyneed a procedurethat samples anew configu- ration, giventhe previous one. Perhapssurprisingly,even though

(6)

Fig. 3. GraphicalrepresentationoftheGCTFframework. Thenodesrepresentthe randomvariablesandthearrowsrepresenttheconditionalindependencestructure.

these subsequent samples are correlated, Eq. (13) remains still valid, and estimated expectations converge to their true values whennumberof samplesi goes to infinity,provided that T sat- isfiescertain ergodicityconditions.Inordertodesigna transition kernelT such that thedesired distributionis thestationarydis- tribution, that is,

π

(x)=

T (x|x )

π

(x )dx ,various strategies can be applied; the mostpopular one beingthe Metropolis–Hastings (MH)algorithm [29]. Oneparticularlyconvenient andsimpleMH strategy is the Gibbs sampler where one samples each block of variablesfromthesocalledfullconditionaldistributions.

Derivingasingle Gibbs samplerforall thecasesofthepower parameter p isnot straightforward, therefore one needs to focus onindividual casesofthe Tweedie family.ForthePoissonmodel (p=1),wedefinethefollowingaugmentedgenerativemodel:

Zα

(

vα

)

G

(

Zα

(

vα

) ;

Aα

(

vα

),

Bα

(

vα

))

factor priors

ν

(

v

) = 

α

Zα

(

vα

)

,α intensities

Sν

(

v

)|

Z1:Nz

PO

(

Sν

(

v

); (

v

))

sources Xν

(

uν

) = 

¯

Sν

(

v

)

outputs

Inthe modeldefinedinEq.(6),theobserved tensors Xν directly dependon thelatentfactors Zα.Here, weforma socalledcom- positemodelwhereweaugmentthemodelinEq.(6)bydefining theintensitytensors ν andthesourcetensorsSν asintermediate layers,wherein thismodel,the observedtensorsare determinis- ticfunctionsofthesources.Fig. 3illustratesthismodel.Notethat, theTweediemodelswithp=0 andp=2,canalsoberepresented ascompositemodels[30];however,theothercaseshavenotbeen exploredintheliterature.

The Gibbs sampler forthe GCTF modelwithPoisson observa- tions canbe formed by iteratively drawingsamples fromthefull conditionaldistributionsasfollows:

S(νi+1)

p

(

S

|

Z(1i:)Nz

,

Xν

, ) ν =

1

. . .

Nx (15) Z(αi+1)

p

(

Zα

|

S(1i:)Nx

,

Z ¬α

,

X1:Nx

, ) α =

1

. . .

Nz (16) where Z ¬α denotes the mostrecent valuesofall the factors but Zα, denotesthepriordistributionparameters {Aα,}Nα=z1,and thefullconditionalsaredefinedas:

p

(

Sν

|·) = 

M

Sν

(

uν

, ¯

Uν

);

ν

(

uν

, ¯

Uν

)

X

ˆ

ν

(

uν

)

p

(

Zα

|·) = 

G



Zα

(

vα

) ;

α

(

vα

), 

α

(

vα

)



where

α

(

vα

) =

Aα

(

vα

) +

⎣ 

ν Rν

⎝ 

¯

Sν

(

v

)



α

(

vα

) =

Bα

(

vα

) +

⎣ 

ν

⎝ 

¯



α =α

Zα

(

vα

)

,α

Here, M denotes the multinomial distribution. Verbally, given a particular instanceof observedindices uν , thefull conditional of Sν isamultinomialdistributionoverallthelatentindicesUν .¯ Note that,thealgorithmspresentedin[10]and[31]arespecialcasesof thepresentedsamplingalgorithm.

The marginal likelihood of the observed data under a tensor factorization model p(X1:Nx) is oftennecessary for certain prob- lems suchasmodelselection.Byusingthesamplesgeneratedby the Gibbs sampler, the marginal likelihood P(X1:Nx) can be esti- mated by using Chib’s method[32].Besides, more efficientsam- plerscanbeformedbyusingspacealternatingdataaugmentation [33,31].

5. Audioprocessingapplications

Inadditiontotherichtheoreticalstructureofnon-negativeten- sorfactorizations,theyhaveaplentyofpracticalapplications.Be- causeoftheirabilitytoefficientlymodelthehigh-levelstructureof audiosignals, tensorfactorizationshavebeenused tosolvegrand signal processingchallengessuch assource separationandrobust pattern recognition.Theyhave alsoprovided completely newso- lutions to problems that are more specific to audiosignals, such asaudiobandwidthextension,dereverberation,andaudioupmix- ing.Wewillfirstdescribetheaudiorepresentationsthatareused intensorfactorizations,andthendescribedifferentaudioprocess- ingapplicationswheretensorfactorizationshavesuccessfullybeen applied.

5.1. Time-frequencyrepresentation

An audiosignal represents the airpressure asthefunction of time and contains both positive andnegative values. Inthe sim- plest scenario the signal is recorded with only one microphone, and the signal is therefore one dimensional. In order to enable modeling these kinds of audio signals with non-negative tensor factorization,wetypicallyrepresentthemusingaspectrogram that characterizes the intensityof sound as the function of time and frequency.Fig. 4representsanexampleaudiosignal anditsspec- trogram.

Unlikeraw audiosignalsthat canshow huge variations,spec- trogramrepresentationsofsoundspossesstypicallyeasilyperceiv- ablepatterns.Forexample,intheexamplefigure,onecanseethat thenotesareharmonic(therearelargeamplitudesatregularfre- quency intervals), andthe spectrum ofeach note is ratherstatic overtime.Thisallowsmodelingthemefficientlywithnon-negative tensor factorizations.Sound typeswith morediverse characteris- ticscan alsobe modeledwithtensorfactorizations,providedthat an appropriatemodelthat takesintoaccount thestructureofthe soundisused.

A standard procedureto calculate thespectrogram consistsof thefollowingsteps:

1. Segment the signal into fixed-length frames. Typical frame lengthrangebetween10msand100ms.Adjacentframesare typicallyoverlapping50%or25%.

2. WindoweachframewithawindowfunctionsuchastheHam- mingwindowtosmoothdiscontinuitiesatframeboundaries.

3. Calculatethespectrumwithineachframebyapplyingdiscrete Fouriertransform.

4. Calculatethemagnitudespectrumbytakingtheabsoluteval- uesofeach entryofthespectrum.The spectrumcanalso be decimatede.g.byintegratingmagnitudeswithinMelbands.

(7)

Fig. 4. Anexampleaudiosignal(toppanel)consistingofnotesequenceC4,G4,C4+ G4playedbypiano,anditsspectrogram(bottompanel).

Theresulting datamatrix X(i,j) isindexed by frequencyindexi andframeindex j.Inthecaseofmultichannelaudiowheremul- tiple microphones are available, the above procedure would be appliedseparately to each channel, soresult ina 3D datatensor indexedbytime,frequency,andchannel.

Insome applicationssuchassignal classification[34] orpoly- phonicmusictranscription [35] it suffices to apply tensorfactor- izations on the data matrices, without the need to reconstruct time-domainaudiosignals.However,inmanyapplicationssuchas signal denoisingand sourceseparation [36,37], the target output isan audiosignal,andthere isneedto reconstructan audiosig- nalbased on the factormatrices. Steps 1–3above are invertible, butStep4discardsthephaseinformation,andisthereforenotin- vertible.Thereexist methodsforreconstructingsignalsfrommag- nitudespectrograms [38,39], buta simple andefficient approach uses simply the phases of the complex spectrogram obtained at Step3,andassignsthemfortheoutputdatamatrices.

5.2.Sourceseparationandsignaldenoising

Manyreal-world audiosignalsare mixtures consisting ofmul- tiplesources.Forexample,inmusicrecordingstherearetypically multipleinstrumentsplaying,andspeechsignalscapturedbymo- bilephonescontainsomeinterferingsourcesfromthebackground.

Manysignalprocessingalgorithms (classification, coding,etc.)as- sumeasinglesource,andthereforethereisalargeneedforsource separation algorithmsthat extractthesignal producedbyan indi- vidualsourcefromamixture.

In the context of tensor factorizations, mixture data matrix X(i,j)ismodeledasthesumofsourcesS(i,j,n)(seeSection4.2) as

X

(

i

,

j

) =



N n=1

S

(

i

,

j

,

n

),

(17)

wheren is thesourceindexandN isthenumberofsources. For examplein the caseof speechdenoising, we wouldhave N=2, andsource n=1 would correspond to speech, andn=2 would correspondtonoise[40].

Eachsourceismodeledwithtensorfactorization.Forexample, theNMFfactorizationforsourcen isgivenas

S

(

i

,

j

,

n

) ≈ ˆ

S

(

i

,

j

,

n

) = 

k

Z1

(

i

,

k

,

n

)

Z2

(

k

,

j

,

n

).

(18)

Bycombiningthe aboveequations,we can convenientlywrite themodelforthe mixturedatamatrixalso asa tensorfactoriza- tionas

X

(

i

,

j

) ≈ ˆ

X

(

i

,

j

) =



N n=1



k

Z1

(

i

,

k

,

n

)

Z2

(

k

,

j

,

n

).

(19)

Provided that the data matricesof individual sources become correctly estimated, each source can be reconstructed separately accordingtoEq.(18).

Dependinghowmuch aprioriknowledgeaboutthesources is utilized,asourceseparationproblemcanbe termedasbeingun- supervised,supervised,orsemisupervised:

Unsupervised: all the factormatrices are estimated using the mixturesignalonly[37],andnotrainingdataisusedtoesti- matethem.

Supervised: isolatedtrainingdataofeachsourceexistsandhas beenusedtoobtainfactormatricesZ1(i,k,n)representingthe source spectraat a training stage, whichare then kept fixed [41,40].

Semi-supervised: isolatedtrainingdataexistsatleastforoneof thesourceswhichisusedtoestimateentriesof Z1(i,k,n)for thosen forwhichtrainingdataisavailable.However,isolated training datadoesnotexist forall thesources and Z1(i,k,n) for the rest of the sources needs to be estimated from the mixturedatamatrix[42,43].Furthermore,insomecasesthere mightbesideinformationavailableforsome sources,suchas acollectionofsymbolicmusicdatathatisrelatedtoacertain source.Incorporatingsuch side informationtothe estimation process viacoupled factorizationmodelscan furtherimprove theseparationaccuracy[44].

5.3. Dereverberation

Innaturalenvironmentsthesignalrecordedbyamicrophoneis aconvolutionofasourcesignalandtheimpulseresponsefromthe source to the microphone. Large amounts of reverberationmake thesignallessintelligibleanddifficulttoprocessandanalyzewith algorithmsthatassumeundistortedsourcesignals.Thereforethere isneedforalgorithmsthateitherdereverberate thesignal,orthose that are able to process and analyze reverberant signals. Tensor factorizationscanbeusedforbothpurposes.

Thedatamatrix X(i,j)ofreverberantsignalcanbemodeledas

X

(

i

,

j

) ≈ ˆ

X

(

i

,

j

) = 

t

U

(

i

,

t

)

H

(

i

,



d

j

t

)

= 

t



d

U

(

i

,

t

)

H

(

i

,

d

)δ(

d

j

+

t

)

= 

t,d

U

(

i

,

t

)

H

(

i

,

d

)

Z3

(

d

,

i

,

t

)

(20)

whereδ(x)=1 ifx=0 and δ(x)=0 otherwise,U(i,t)isthedata matrixofunreverberant,drysignal,andH(i,d)istheconvolution filter in the time-frequency domain [45,46]. Tensor Z3(d,i,t)= δ(dj+t)isfixed.

Notethat abovewe assumethat theunreverberantsignalma- trix U(i,t) andthe convolution filterresponse matrix H(i,d)are

Referanslar

Benzer Belgeler

In this section, we give the evaluation results for the pro- posed SMC sampler based multiresolution multiple audio align- ment method and compare the results with the baseline

In Section 2, signal models for audio are developed in the time domain, including some examples of their inference for a musical acoustics problem.. Section 3 describes models in

In this study, we propose a semi-supervised speech-music sep- aration method which uses the speech, music and speech-music segments in a given segmented audio signal to separate

In this paper we describe some models developed recently for these tasks, which also have utility in audio and general signal processing applications; and investigate hybrid

In this paper, we propose a DDoS attack detection framework based on the Bayesian multiple change model, which can detect different types of flooding attacks.. Additionally, we propose

Because the sounds in the material used to train the basis vectors are not identical to the sounds in the mix- ture, the second algorithm patches the mismatches by representing parts

• Time domain – state space, dynamical models. • Transform domain – Fourier representations, Generalised

The goal now is to decompose the magnitude