• Sonuç bulunamadı

sequences using sequential Monte Carlo samplers Multiresolution alignment for multiple unsynchronized audio Digital Signal Processing

N/A
N/A
Protected

Academic year: 2021

Share "sequences using sequential Monte Carlo samplers Multiresolution alignment for multiple unsynchronized audio Digital Signal Processing"

Copied!
9
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

Multiresolution alignment for multiple unsynchronized audio sequences using sequential Monte Carlo samplers

Dogac Basaran

a,

, Ali Taylan Cemgil

b

, Emin Anarim

c

aSignalandImageProcessingDepartment,Telecom-ParistechUniversity,46RueBarrault,Paris,France bComputerEngineeringDepartment,BogaziciUniversity,34342Bebek,Istanbul,Turkey

cElectricalandElectronicsEngineeringDepartment,BogaziciUniversity,34342Bebek,Istanbul,Turkey

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonlinexxxx

Keywords:

Multipleaudioalignment Multiresolutionalignment Audiofingerprint Bayesianinference

SequentialMonteCarlosamplers Sequentialalignment

Itisincreasinglymorecommonthatanoccasionisrecordedbymultipleindividualswiththeproliferation ofrecordingdevicessuchassmartphones.Whenproperlyaligned,theserecordingsmayprovideseveral audioandvisualperspectivestoascenewhichleadstoseveralapplicationsinrestoring,remasteringand remixingframeworksinvariousfields.Inthiswork,weproposeamultiresolutionalignmentalgorithm foraligningmultipleunsynchronizedaudiosequencesusingSequentialMonteCarlosamplers.Weemploy a model based approach and a score functionanalogous to similarity based methods.The optimum alignments are obtained ina course to finestructure with multiresolution sampling and aheuristic sequentialsearchmethod.Theproposed methodisevaluatedwithareal-lifedatasetfromJikuMobile VideoDatasets.Thesimulationresultssuggestthatourmethodiscompetitivewiththebaselinemethods intermsofaccuracywithsuitablechoiceofparameters.

©2017ElsevierInc.Allrightsreserved.

1. Introduction

Withtheproliferationofrecordingdevicesandapplicationsfor user generated content sharing, an increasing number of people regularly capture audio and video in special occasions like con- certs, conferences and sports competitions. As a consequence, a singleeventcanbesimultaneouslyrecordedbymultipleindividu- als(suchasusingsmartphones)creatingwidecoverage,multiple visual and listening perspectives to a scene. If privacy is not a concern,these user generated multimedia (audio/video) data are typicallymadeaccessiblethroughsocialmediasharingsiteshow- everinunorganizedform.Temporallyaligningandcombiningsuch datacould lead toa wide rangeof applications.In [1],audience generatedvideoclipsfromaconcerteventarealignedusingaudio featurestoobtainafull-clipofonesong.In[2],over700YouTube videosrelatedto a U.S. presidential inaugurationare used to re- storetheU.S.President’sspeech.Anapplicationofautomaticvideo remixingcanbefoundin[3],wherethesystemautomaticallycre- atesremixesfromvideosrecordedbymobiledevices.Multimedia alignment also has applications in forensics field [4] and lately, 360-degreesvideocreationbecameverypopular[5].Therearealso

*

Correspondingauthor.

E-mailaddresses:dogac.basaran@telecom-paristech.fr(D. Basaran), taylan.cemgil@boun.edu.tr(A.T. Cemgil),anarim@boun.edu.tr(E. Anarim).

commercially available video synchronization tools such as Plu- ralEyes[6]andDualEyes.

1.1. Problemstatement

We describe the problem setup with the following example.

Imagine you attend a concert of your favorite band. During the concert, several people from the audience record some parts of the concert with their smart phones, fromdifferent perspectives and independent from each other. These recordings do not nec- essarily contain the same audio/visual content i.e., recordings of entirelydifferentsongs,andprobablynoneofrecordingscoverthe entire concert. The quality of each recordingdevice might differ depending on the hardware, compression distortions as well as theenvironmentalcontaminatingnoise.Inthissetting,theaimis totemporallysynchronizethesemultimediarecordingsrelative to eachotheronacommontimelineutilizingtheaudiocontent.

Weformally definethe alignmentproblemasfollowing.There is a dataset of K user generated, unsynchronized recordings de- noted as x= {xk}Kk=1. We denote the offsets of sequences refer- enced to universal (generic) time line asr= {rk}kK=1. Ifa pair of sequences (xi,xj) are overlapping onthe universal time line, we call these sequences asconnected (connected and overlapping are interchangeablyusedinthetext).Thesetofconnectedsequences formaclusterC= {Cm}mM=1 i.e.,xi andxjareconnected,xj andxl areconnectedthentheyformtheclusterCm= {xi,xj,xl}.Thereare 1≤MK disjoint(notconnected)clusters. The alignmentprob-

https://doi.org/10.1016/j.dsp.2017.10.024

1051-2004/©2017ElsevierInc.Allrightsreserved.

(2)

lemisthentodetermineeachconnectedpairofsequences(xi,xj) inthedataset,eachdisjointclusterandfurtherdeterminetherel- ativetime offsetsr= {ri j= |rirj|}i,j forall connectedpairs (xi,xj).

Notethatthe sequencesx are usually discretetime–frequency representations of raw audio signals such as short-time Fourier transform (STFT). Then each sequence is represented as xk= {xf n}Ff,=N1k,n=1where f denotesthefrequencybandindex,n denotes theframeindex, F denotesthenumberoffrequencybandsandNk denotesthelengthofthesequencexk(inframes).Inthismanner, theoffsetsofsequencesr are alsotreatedinframesinsteadofin seconds.

1.2. Relatedwork

In state-of-the-art, audio alignment problem is tackled in a twofoldmanner: First theaudio signalsare represented withro- bustfeatures against various types ofnoise anddistortions, then searchalgorithmsareemployedoverthoserepresentationstofind the best offset setting of sequences usually utilizing exact hash (fingerprint)matches,similarityorcostfunctions.

Audiorepresentationsinexistingmethodsmostlyinvolveaudio fingerprints [7,8,1,9–14], transformdomain (time–frequency) fea- turessuchaschroma[15,16],spectralflatness[17],spectralenergy [18]andaudioonsets[19,20].

Although audio fingerprints, as compact signature represen- tations of audio, were originally developed for music identifica- tionservices,query-by-examplebasedindexingschemesforaudio identification are utilized for audio alignment purposes in [1,9, 11,12,14,20,21]. Thesemethods are usually requirelow computa- tional time as an advantage. The usual approach requires to ex- tracthashes(fingerprints)fromaudiosequences andthenumber of exact hash matches between sequences is used to determine ifthereisa match. Then foreachmatchingpair, therelative off- set is computed1 and connected sequences are grouped to form clusters[14].A similar query-by-example methodis proposed in [15] whereaudiochromafeatures areused insteadofbinary fin- gerprints. A descriptoris definedas a classifier to find matching sequences.

Onemajordrawbackwiththefingerprintingapproachisthatin real-lifeconditions,twomatchingaudiosequencesmayhavevery fewornoexactmatchingfingerprintsundersomedistortionsand lowSNRconditions.Besidesthat,amatchingdecisionbetweentwo audiosequences isachievedviathresholdingthenumberofexact fingerprintmatcheswhichishardtosetforaglobalsolution.

Astraightforward wayto tacklethealignment problemisuti- lizing similarity measures such as cross-correlation [9,17,13] or Hamming distance [13]. Methods utilizing such similarity mea- suresusually applymatching foreach pairofsequences by using thresholdingmethods.Thenforthematchingsequences,therela- tiveoffsetwithhighestvalue(mostsimilar)isacceptedasthebest estimate.

In[13],thedatasetispre-classifiedintoclassessuchassilence, music,speechandnoise,beforealigningwithcross-correlation.

These measures are easy to implement, fast androbust how- ever they have two major drawbacks. First of all, similar to fin- gerprinting based approaches, matching sequences are estimated using ad-hoc thresholds that depends highly on data. Secondly, these methods do not provide finding the amount of similar- ity of a sequence against a cluster ofpre-aligned sequences i.e., eachpairofsequenceshavetovisitedtofindmatchings.Toover- comethisproblem,agreedymergingmethodisapplied in[9]to

1 Mostoftheaudiofingerprintshavethetime-stampinformationhencetherela- tivetimedifferenceisabletobecomputed.

form clustersof alignedsequences so that another sequence can be matched withthecluster. In [18],scoring functionssimilar to cross-correlation andHammingdistanceare proposedthat solves how to aligna sequence against acluster andhow to determine matchingsequencesautomatically.

Inthispaper,weproposeacoursetofinestructure,multireso- lutionaudioalignmentschemethatcanbeappliedtoanarbitrary numberofsequences (K>2)usingSequentialMonteCarlo(SMC) samplers.Notethat the initialphase ofthisworkis presentedin [22] where themultiresolution schemeis consideredfor aligning pairsofsequences.Themainintuitionofthemultiresolutionalign- mentis thataligningthesequences inacourserlevelwithalow computationaleffortandsequentiallyrefiningtheestimatedalign- ment.

Here,weextendtheideaofmultiresolutionalignmentviaSMC samplers to a multiple audio alignment setting where we draw samples (alignment estimates) from a multidimensional, multi- modallikelihood surface definedin[18] that penalizesthealign- mentofK sequences.SMCsamplerparticularlyfittothemultires- olution settingbecause itsamples the target surface sequentially throughasequenceofintermediatedistributionseachdistribution beingknownuptoanormalizingconstant[23].TheSMCmethod isbasedonsequentialimportancesampling[24–26]anditisflex- ible in design i.e., intermediate distributions can have different resolutions.

Themaincontributionsofthisworkcanbelistedasfollows:

Toourknowledge,thisisthefirststudythatproposesamul- tiresolution alignmentmethodforaligning multiple usergen- eratedmultimediacontent.

ASMCsamplermechanismisdefinedformultipleaudioalign- mentsettingthatisabletosamplefromthelikelihoodofany alignmentsettingofK sequences.

The proposed method is evaluated with a real-life dataset from JikuMobileVideoDataset[27].Theresultsarecomparedtoafin- gerprintingbasedbaselinemethodintermsofwell-knownmetrics.

The accompanying softwarearchitecture and impactare given in thejointmanuscript[28]andthesoftwareisavailableonline.2

The rest of the paper is organized as follows: In Section 2, we brieflyexplain thescorefunctionin[18] andtheprobabilistic modelthatitisderivedfromaswellastheinterpretationinamul- tiresolutionsetup.TheninSection3,themultiresolutionalignment using SMC samplers is explained extending to a multiple audio alignmentsetting.InSection4,theexperimentalsetup,implemen- tationissues,theevaluationresultsanddiscussionaregiven.Then inSection5,conclusionsaregivenandsomefuturedirectionsare discussed.

2. Modelbasedapproach

Inthissection,we explaintheprobabilisticmodelingapproach [18,29] and how it can be used in a multiresolution fashion in multipleaudioalignmentsetting.

2.1. Model

In addition to the definitions given in Section 1.1, we further define a random variable λ= {λfτ}Ff=,T1,τ=1 where f is the fre- quencybinindex,

τ

istheframeindex(onthegenerictime line) andT isthelengthofthesequenceinframes.Here,λ denotesan unobserved parameter sequence but common with the observed sequences. The central theme of the model is as follows: Given

2 https://github.com/dogacbasaran/Multiple-Audio-Alignment.

(3)

Fig. 1. Illustration of low resolution representations.

thecorrectalignmentr,observedfeaturesequences(fingerprints)x are noisyrealizationsfromacommonbutunobservedparametersequence λ. To clarify the idea, assume a scenario where the unsynchro- nizedrecordingsaregatheredfromtheaudienceinaconcertevent as in [1]. Then the λ would be the features of the unobserved (hidden) clean recording of the concert. Intuitively, aligned and connectedsequences shouldshow someresemblanceattheover- lapping parts with each other due to the common unobserved sourceλ.

The generative model is given in (1). Here, we choose the prior p(λfτ)asaBernoulli distribution(BE) andtheobservation modelp(x(f nk)|rkfτ)ischosenasaconditionalBernoullidistribu- tion(P)asgivenin(2).δ[·]istheKroneckerdeltafunctionwhich isequal to one onlyif the expression inside is zero.This model statesthat,acoefficientx(f nk) isgeneratedbyλfτ ifandonlyifthe nth frame of sequence xk is aligned to

τ

th frame of hiddense- quence λ. In thiswork, we assume

α

λ=0.5 and the alignments r tobe uniformly distributed since in general, no information is available about the offsets of the sequences.3 Following the ob- servationmodel,wechoosebinaryfingerprintsasfeatures forthe sequences.

λ

fτ

BE

fτ

; α

λ

)

x(f nk)

|

rk

, λ

fτ



T

τ=1

P

(

x(f nk)

; λ

fτ

,

w

)

δ[n−(τrk)] (1) P

(

x(fk,)n

; λ

fτ

,

w

) =



BE

(

x(f nk)

;

w

)

if

λ

fτ

=

1

BE

(

x(f nk)

;

1

w

)

if

λ

fτ

=

0 (2)

Note that each λfτ is assumed to be a-priori independent of each other hence the joint distribution of p(λ) can be factor- ized as the multiplication of marginal distributions p(λfτ) i.e., p(λ)=F

f=1

T

τ=1p(λfτ).

Defining (r)=log(p(x|r)), one can achieve the optimum alignment r by a maximum likelihood (ML) solution as in (3).

(r)isderivedfollowing(4)in[18]asin(5)benefitingfromcon- jugate prior distributions. Note that the joint distribution of the model p(x,r,λ) can be factorized as the inside of the integral in(4).

r

=

arg max

r

(

r

)

(3)

p

(

x

|

r

)

p

(

x

,

r

, ) =



d

λ

p

(

x

|λ,

r

)

p

(λ)

p

(

r

)

(4)

(r)

=

T τ=1

F f=1

log (1−w)

K

k=1δ[n−(τ −rk)]δ[x(f nk)0]w

K

k=1δ[n−(τ −rk)]δ[x(f nk)1] +wKk=1δ[n−(τ −rk)]δ[x(f nk)0](1−w)

K

k=1δ[n−(τ −rk)]δ[x(f nk)1]

(5)

3 Ifthereexistssomeaudiosequencesthatarerecordedbythesamedevice,then rkhavesomeconstraints(seeSection4.2).

(r) function canbe interpreted asan advanced similarityfunc- tionwithtwomainadvantagesoversimplesimilaritymetricssuch ascorrelation

ItispossibletoscorethequalityofalignmentofK sequences withK>2 with(r)function.

• (r) function is ableto give score foralignments where se- quencesarenotmatching(notoverlapping).

It is also importantto mention that for two unconnected se- quencesorasequencexknotconnectedtoaclusterCjingeneral,

(ˆrk,rCj) gives the same score for all non-overlapping settings.

This is dueto fact that the time slices where no sequence exist on thegeneric time line, doesnot contribute to thescore which canobservedfrom(5).

Foramoredetaileddescriptionofthemodelpleasesee[18].

2.2. Multiresolutionmodeling

Beforedescribing themultiresolutionsettingofthemodel,we firstdescribethelowresolutionrepresentationsandhowtheyare obtained.Wedefinex˙(nk)= {x(f nk)}Ff=1asthecoefficientvectorofse- quence xk at time frame n. The low resolution representation is thenachievedbyreshaping xk by groupingL consecutivex˙(nk) co- efficientvectorsassinglecolumnvector.Inthesequel,lowresolu- tionrepresentationofasequencexkisdenotedasx˜k,L= {˜˙x(n˜kL)}Nn˜˜Lk,L=1 where 1/L represents the resolution, N˜k,L and n˜L represent the length,theframeindexofsequencex˜k,L respectively.Wealsode- noter˜k,L and

τ

˜L theoffsetofthesequencex˜k,L andgenericframe indexforresolutionL respectively.AtoyexampleisgiveninFig. 1 wherethegroupingofcoefficientsandlowresolutionrepresenta- tionsareillustratedforL=1,L=2 andL=4.Thesequenceindex k andtheresolution L aredroppedfortheeaseofrepresentation.

Note that groupingthedata insuch a waydecreases the resolu- tioni.e., thelength ofthedatadecreaseshoweverwithoutlosing anyinformation.4

Having described the low resolution representation of se- quences,thegenerativemodelforresolutionL andthecorrespond- ing log-likelihoodLr)aregivenin(6)and(7)respectively.The representation of the unobserved variable λfτ in resolution L is denotedas˜λfτ˜L,L.

˜λ

fτ˜L,L

BE

(˜λ

fτ˜L,L

; α

λ

)

˜

x(fkn˜)

L,L

rk,L

, ˜λ

fτ˜L,L

˜



T

˜ τL=1

P

x(fkn˜),L

; ˜λ

fτ˜L,L

,

w

)

δ[˜nL−( ˜τL−˜rk,L)] (6)

4 Itisalsopossibletoapplysubsamplingorchangingthehopsizeandwindow- sizeinthefeatureextractiontodecreasetheresolutionhoweverintheexpenseof informationloss.

(4)

Fig. 2. IllustrationofmovingasamplefromresolutionL toL/2 withforwardkernel.Thedashedandred-solidwindowsbothrepresentcandidatepositionsformovingthe sampletotheonelevelhigherintermediatedistribution.Thered-solidwindowrepresentsthewindowthatthesampleismovedto.(Forinterpretationofthereferencesto colorinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

LrL)

=

˜

T τL=1

F f=1

log (1w)

K

k=1δ[˜nL−( ˜τL−˜rk,L)]δ[x(k) f n,L0]

w

K

k=1δ[˜nL−( ˜τL−˜rk,L)]δ[x(k) f n,L1]

+wKk=1δ[˜nL−( ˜τL−˜rk,L)]δ[x(f nk),L0](1w)

K

k=1δ[˜nL−( ˜τL−˜rk,L)]δ[x(f nk),L1]

(7) Inthemodel,theδ[˜nL− ( ˜

τ

L− ˜rk,L)]expressionintheobservation modelindicates that ifthe group of L consecutivecoefficients of sequencex˜k,L i.e.,x˜˙n(˜k)

L attimen˜L,arealignedtotime

τ

˜L,thenthese L coefficients are generatedfromunobserved coefficient ˜λτL,L.At theoriginalresolution(L=1),thereisnogroupingofcoefficients hencexk= ˜xk,L and(r)= LrL).

3. MultiresolutionalignmentusingSMCsamplersinmultiple audioalignmentsetting

In this section, we introduce an SMC sampler based solution forthemultipleaudioalignmentproblemthatuseslowresolution

LrL)functionsasbridges.

IdeallyforaligningK unsynchronizedaudiosequences,asearch mechanismshouldbedefinedonthe K dimensionaldiscrete(r) surface(loglikelihood)tofindtheoptimumalignmentsr1:K.How- ever, our preliminary experiments with batch methods such as Gibbs sampling has not proven very effective due to the multi- modal and very rough structure of the surface[29]. That’s why, weresorttoamoreadvancedapproximateinferencemethod,SMC samplers.Apreviousworkforpairwisecasesisconsideredin[22].

3.1. SMCsamplers

SMCsamplerscanbeusedforsamplingfromdensitiesthatare otherwisedifficulttosamplefrom.Themethodsamplesfromase- quence ofintermediate distributions,denoted by

γ

i [23]. Ateach step, samplesare drawn fromthe next intermediate distribution andin the last step, the algorithm samples from the target dis- tributionwhichis(r)inourcase. Themainideabehind sucha mechanismisthatiftheconsecutiveintermediatedistributionsare chosentobeclosetoeachother,theywouldact likeabridgeand guidethesamplesthroughmodesofthetargetdensity[22].

AforwardMarkovtransitionkernelKi+1(rs(i+1),r(si))isapplied todrawsamples,wherer(si)representsthesthsampleatithstep.

Theimportanceofeachsampleisdecided accordingtoitsweight whichiscomputedasin(8) [23]:

wi

(r

1s:i

) =

wi1

(r

1s:i1

)

Bi1

(r

is

, r

si1

) γ

i

(r

is

)

Ki

(r

is

, r

is1

) γ

i1

(r

is1

)

(8)

where Bi1(ris,rsi1) is a backward Markov kernel. A resampling stagecanbeappliedwhentheweightsbecomeveryunevenlydis- tributed;onlyafewsamplesenduphavingnon-negligibleweight.

A common criterion to measure this degeneracy is the effective samplesize(E S S)[23,30,31].

3.2. ForwardMarkovtransitionkerneldesign

We choose the intermediate distributions as low resolution scorefunctionsLrk,L)where L=2l,l=11,10,· · · ,0.Notethat thelengthofeachL/2rL/2)istwicethelengthofonesteplower resolutionLrL),i.e.,2r2)isofhalfthelengthof1r1).Inad- dition,the samplesare going tobe moved fromlower resolution (2L)tohigherresolution(L)throughsomesmootheddistributions ofL.Hence,oneneedstodesignaforwardkernelsuchthatsam- plesaremoved fromlowerresolutiontohigherresolutionaswell as through the same resolution depending on the step index. In SMCsamplerframework,thedesignoftheforwardandbackward kernelsareflexible sothatanyproposalmechanismispossibleat any step of the algorithm, i.e., Ki(.)do not haveto be equal to Kj(.).

Toobtain smoothintermediate distributions,a sparsesmooth- ing kernel Q is applied several times to L(.), i.e., QnL, Qn1L,· · · ,QL,L. Note that by choosing a sparse smooth- ing kernel,an explicitcomputation ofall valuesin QnL is not needed. Weappliedaveraging kernelforsmoothingpurposesand backward kernel is chosen to be equal to forward kernel in the weightupdate.InFig. 2,weillustratehowthesamplerismoved from a low resolution L to one step higherresolution L/2 using theforwardkernel.Weassumethatthesampleswouldbearound 2r1 duetotwiceresolution.Accordingly,thereare4proposed valuesinthehigherresolutionL/2 i.e.,2r1,2r,2r+1,2r+2 in thisexample.Ateach step,the scorevaluesare averaged over two overlappingwindows.Thenfollowingtheforwardkernel,the sample is moved to one of the windows with probabilities pro- portionaltothewindowaverage.Thered-solidwindowrepresents the window that thesample ismoved to. As an example,in the firststep(fromL to Q2L/2),thescorevaluesareaveragedover two overlappingwindowsofsize3.Followingtheforwardkernel, thesample r in L ismoved totherightwindow.Inthelast step (from QL toL/2),kernelproposestwovaluesandoneofthem is chosen witha probability proportional to its scorevalue. Note that,duetotheflexibilityofforwardkerneldesign,onecanchoose differentwindowlengthandnumberofsmootheddistributionsin moving froma lower resolution to higher. See [28] Section 2.1.2 forimplementationdetailsofthismethod.

3.3. Initialcomputationofsamples

The initialtarget distribution,

γ

1 isequal totheLrL)where L isthe lowestresolution.In theSMCsampler,thefirst distribu- tion is chosen as a simple enough distribution to draw samples

(5)

from[23] howeveratthisstage, insteadofdrawing samplesfrom

LrL),wechooseall possiblealignments forr˜L asseparate sam- plesorparticlesandtheinitialweightofeachsampleisinitialized astheir respective probabilities. As an example, assume two se- quencesaretobealignedwithlengthsNL,1 andNL,2 respectively atresolutionL.Similartocorrelation,thereareNL,1+NL,21 lags foroverlappingalignmentswhichissetasthenumberofsamples.

We apply two criteriato determine the initial resolution that also directly effects the number of samples: The length of each sequence to be alignedhas to be larger than one i.e., NL,k1>1, NL,k2>1 andthenumberofsampleshastobelargerthanapre- definedthresholdvalueTsi.e., NL,k1+NL,k21>Ts.

Notethatweexcludethenon-overlappingalignmentsfromthe samplesinthe procedure.Besides thefact thismakes the search spaceinfinite (see section 2), it isnot necessary to compute the non-overlappingalignmentscoreattheintermediate levels.Com- putinginthehighestresolution,onecanachieveoverlapping/non- overlapping decision by comparing with the scores of sample alignments.

3.4.ExtendingmultiresolutionSMCsamplertomultipleaudio alignmentsetting

Foraligningmultipleaudiosequences,searchingforallpossible alignmentsettings ofsequences on(r) isnot feasiblesince the searchspaceishuge.In[18],anad-hocsequentialsearchmethod isproposedwherewe takeadvantage ofthefactthat asequence canbealignedagainstagroupofpre-alignedsequencesusing(r) function.

More formally, starting with K=2 sequences and sequen- tiallyaligning onesequenceatatime,we solvethemaximization problemr1:K=arg maxr1:K(r1:K)whereK<K (lessnumberof sequences).At eachepoch,themethodscansthroughall not-yet- alignedsequences,groupthesequences(formclusters)thatmatch witheachotherandfreezetherelativeshifts.

Morespecifically,sequentialalgorithmvisitseachsequencethat isnot-yet-alignedone-by-oneandalignsthesequenceagainstthe cluster of already aligned sequences. Assume we denote the se- quencetobealignedwithxk,theclusterofpre-alignedsequences withC andtheirrespectivealignmentswithrC.Thenateachstep, themethodsolvesthemaximizationproblemin(9).

rk

=

arg max

rk

(

rk

,

rC

)

(9)

HerethemaximizationisachievedviaamultiresolutionSMCsam- plerwhichisdesignedtosamplefromthescorefunction(rk,rC). Lowresolution functionsL(˜rk,r˜C)functions act asintermediate distributionsthatthe samplesare movedthrough asexplainedin Section 3.2. The best alignment rk is then estimated by simply choosing the sample at the original resolution that satisfies the maximization(9)amongothersamples.

Notethat there is noinformation available forthe time posi- tionsonthegenerictimeline,thereforeweassumethepre-aligned group of sequences are aligned to

τ

=Nk+1 where Nk repre- sentsthe length of the sequence to be aligned, xk. By thisway, rk=1 representsa non-overlapping alignment, 2≤rkNk+NC representsalltheoverlappingalignmentswhereNC representsthe lengthof thecluster. Thisrestricts thesearch spaceof thealign- mentofsequencek inthe[1,Nk+NC]interval.

InSMCsamplerframework,intermediatedistributionsareusu- allyannealedsothattheybecomemoresimilar[23].Asexplained insection3.2,thesamplesmovefromlowertohigherresolutions throughsmoothedintermediatedistributionswhichactsasanan- nealingprocess.Itispossibletochangethenumberofintermedi- atedistributions betweenresolutions andthe lengthof theaver- agingkernelapplied toeach intermediatedistribution duetothe

Table 1

CharacteristicsofGT_090912 setofJikuMobileVideoDatasets[27].

Camera 1 Camera 2 Camera 3 Camera 4 Total Numberof

Recordings

19 15 8 8 50

Length 58.93 m 53.68 m 49.15 m 55.2 m 3 h 37 m

Table 2

Featureextractionparameters.

Parameter values Sampling Rate (Fs) 16000 Hz

Mono/Stereo Mono

Window Length 0.064 sec

Window Type Hamming

Hop Length 0.032 sec

Subband division Logarithmic

Min–Max Frequency 100 Hz–8000 Hz Number of bits per window 32

flexibility oftheforwardkernel.Besides movingsamplesthrough smoothedintermediatedistributions betweenresolutions,we also adjust the precision parameter fordifferent resolutions asa sec- ond methodofannealing. Whenthe sequences areto be aligned inlowerresolutions,theresultingscorefunctionbecomesacoarse versionofthescorefunctioninhigherresolution.Thisisasimilar situation asaligning sequences withhighnoise wherethe preci- sionvalue shouldchosen small[18].Hence theprecision valueis chosen small for low resolutions and gradually increased asthe samplerreachesthehighresolutions.

4. Experimentalresults

4.1. Experimentalsetupanddataset

The accuracy of the multiresolution SMC Sampler method is evaluatedontheJikumobilevideodataset,GT_090912 event[27].

The event is recorded by the audience using mobile devices of different quality and noise conditions. The characteristics of the dataset are given in Table 1. The binary features foreach audio recordingare obtained followingthe fingerprintingscheme given in [7]. Notethat thisbinary feature is inherentlyimmune to en- ergydifferencesi.e., volumechanges,betweendifferentsequences duetodifferencingprocedure onSTFTovertime. Apreprocessing stepis appliedforresampling, silenceremovalandnormalization ofsignalsbeforeextractingfeatures.Thechoice ofparameters for theextractionprocedureisgiveninTable 2.Thegroundtruthsyn- chronization ofthedataisgivenin[32] wheretheoffsetofeach recordingisobtainedbymanuallisteningtestsandrepresentedon acommontimeline. Notethattheground-truthin[32] isnot di- rectly compatiblewithourevaluationsystem, we furtherconvert thedatatoanotherformat.

NotethatasgiveninTable 1,eachrecordingdevice isusedto record more than one recording in GT_090912 event. This situa- tion imposesadditionalconstraints onthealignments. Toexplain precisely,wereplacesequenceindexk withapairofindices(c,l) wherec representsthecameraindexandl representstherecord- ingindexforthecamerac.Nc,l representsthetime lengthofthe sequenceinfeaturedomain.

Constraint1, given in(10), basically states that the recordings ofthesamecamerai.e., xc,i,xc,j fori=j,cannotoverlaponthe generic timeline.Constraint2 asgivenin(11)statesthatnot only two recordings (xc,l,xc,lm) ofthe same camera can not overlap butalsotherehastobeacertainamountofdistancebetweentheir offsets.Thisamountisequaltothesumofallsequencelengthsin betweenxc,l andxc,lm. Onthe other hand,theseconstraints do not prevent thesequences withthe same camera index c,to be

(6)

Fig. 3. Analysis of proposed SMC sampler based method w.r.t. precision value wmax.

alignedinthesamecluster,justanon-overlappingalignmentwith eachother.

rc,l

>

rc,l1

+

Nc,l1 (10)

rc,l

>

rc,lm

+



1 j=m

Nc,j where m

>

0 (11)

Asa baseline, thefingerprintingbased alignmentalgorithm in [12] is applied with an implementation of [8] fingerprinting in [10].The query-by-examplefingerprintingimplementationin[10]

isparticularlyappropriateforalignmenttaskbecausetheextracted landmarkshasalsotimestampinformationrelativetotheoffsetof thefile.Thealignmentalgorithmin[12]canbesummarizedasfol- lows:

1. Generatethereferencedatabaseoffingerprintsusingallaudio filesinthedataset.

2. For each query file, compute the number of exact matching fingerprintsbetweenthequeryfileanddatabase.

3. Findthemostsimilarsequencesfromthedatabasebythresh- oldingthenumberofmatchingfingerprints.

4. Computetherelativeoffsetsofsimilarsequencesbyusingthe timestampinthefingerprints.

Note that the alignment of two matching sequences is com- putedtwiceinthisschemeandforsomepairstheyarenotequal.

In thiscase, we choose the alignment withthe higherlandmark hit.Inadditiontothat,wedon’ttakethesequenceswiththesame cameraindexc inthesimilarsequencelist.

Itisalsoimportanttomentionthat thechoiceforthethresh- old foreliminating lesssimilar sequences, dependson the target dataset.Here,wetuned thethresholdtoobtainthehighestaccu- racy.

4.2. Evaluationcriteria

Weutilizeaccuracy, precision,recall and Fmeasure aseval- uationmetrics.Whilstwebenefitfromformaldefinitionsofthose metrics, we exclude the results for the pairs of sequences that are recorded by the same camera because they are always truly aligned(see section 4.1). Then we define truepositive (TP), true negative(TN),falsepositive (FP)andfalsenegative(FN)measures asfollows:

T P – True Positive: Two sequences overlap in the ground- truthandestimatedasoverlappingwithatruerelativeoffset.

T N – True Negative: Two sequences do not overlap in the ground-truthandestimatedasnotoverlapping

F P – False Positive: Two sequences do not overlap in the ground-truthbutestimatedasoverlapping.

F N – False Negative: Twosequences overlap inthe ground- truthbutestimatedasnotoverlappingortwosequencesover- lap in the ground-truth, also estimated as overlapping but theirrelativeoffsetisnottrue.

4.3. Evaluationresults

In this section, we give the evaluation results for the pro- posed SMC sampler based multiresolution multiple audio align- ment method andcompare the results with the baseline finger- printingbasedalignmentmethodin[12].

As explained in section 3.4, annealingis applied by gradually increasing theprecision parameterw throughintermediatedistri- butions.Here,westartwithw=0.51 atthelowestresolutionand increasetheprecisionuptowmax.Toanalyzetheeffectofwmaxon theperformanceoftheproposedmethod,weevaluatethemethod forvariousvaluesof wmaxandtheresultsaregiveninFig. 3.

An importantparameter oftheSMCsampleristhenumberof samples in the procedure which is actually based on the initial resolution. One of the criteriato determine the initial resolution istheminimumnumberofsamplesTs.Hereweanalyzetheeffect oftheminimumnumberofsamplesTsbyevaluatingtheproposed methodforvariousvaluesofTsandtheresultsaregiveninFig. 4.

The best results for SMC based multiresolution method are obtained with wmax =0.64 and Ts = 100. To determine the best threshold parameter for the fingerprinting based alignment method [12],we applied a grid search for which the results are given in Fig. 5. The best results for both the proposed and the baseline methodsaregiveninTable 3 andTable 4.The computa- tiontimeforfingerprintingbasedexperimentsisaround6minutes onaverage(foreach threshold).Ontheother hand,thecomputa- tion time for SMCsampler based systemis around 3.5 hours on average.

4.4. Discussion

The resultsin Table 4 show that with a proper configuration of parameters,SMCsampler based multipleaudioalignment sys-

(7)

Fig. 4. Analysis of proposed SMC sampler based method w.r.t. minimum number of samples Ts.

Fig. 5. Analysis of fingerprinting based alignment method[12]w.r.t. threshold.

Table 3

Evaluationresults:FP,FN,TPandTNfortheproposedSMCsamplerbasedsystem andthebaselinefingerprintingbasedsystem[12].

F P F N T P T N

SMCsamplerbasedmethod 0 47 40 806

Fingerprintingbased method[12]

0 63 24 806

Table 4

Evaluationresults:Precision,Recall,F -measure andAccuracy fortheproposedSMC samplerbasedsystemandthebaselinefingerprintingbasedsystem[12].

Precision Recall F -measure Accuracy

SMCsampler basedmethod

1.0 0.4598 0.6299 0.947

Fingerprinting basedmethod [12]

1.0 0.2758 0.4324 0.929

temisableoutperform thebaseline fingerprintingsystem[12]in accuracy.

Forbothsystems,the precision iscomputedas1.0,sincethere are no F P errors for eithermethod. Ascore of1.0 can be inter-

preted asfollows;if thesystem estimatesan alignment between twosequenceswhichsuggeststhatthesequencesmatchwitheach other,withan estimatedamountof overlap,thenthisestimation isconsistentwiththegroundtruth. Notethat ifan alignmentes- timate is not a F P then it is a T N. In other words, all pair of sequences that donot matchor not overlapinthe ground-truth, aresuccessfullyestimatedasnotoverlappingbybothsystems.

Inpractice,theSMCbasedmultipleaudioalignmentsystemis moresensitive to F P typeerrors thanthe baseline system.Since theoptimization isdone sequentiallywithaligning onesequence againstaclusteratatime,anyfalsepositivealignmentwouldpos- siblycauseanerrorpropagationthroughtherestofthesequences.

Constraints onalignments (see section 4.1) causethe sameeffect because if a sequence from camera c is wrongly aligned during thesequentialprocedure,therestofthesequencesfromthesame camera arealignedaccordinglyhence theerroragainpropagates.

Such error propagationdoes not happen for the baseline system sinceeachpairofsequencesaresearchedindependently.

Animmediatesolutiontoprevent ortominimize F P errors in theproposedsystem,istochoosetheprecisionparameterw ofthe modelhighenoughsothatonlysimilarsequencesareestimatedas matching.Similarly,choosingathresholdhighenoughleadstoless

(8)

F P errorsforthebaselinesystemi.e, thenumberofexactfinger- print matchesbetweensequences havetobe high(more similar) tobematched.

On the other hand, the recall result of the SMC based align- ment systemis higherthan the baseline method.A higher recall valuesuggeststhattheSMCbasedalignmentestimationsaremore accurate thanthe baselinemethod forpairofsequences that are knowntomatchwithacertainamountofoverlapintheground- truth. This can also be observed fromthe F N results in Table 3 wherethenumberof F N forthebaselineishigherthantheSMC basedalignment system. Notethat sum of F N and T P givesthe totalnumberofpairofsequencesthatmatchintheground-truth.

ThereforeifanalignmentestimateforsuchsequencesisnotaT P thenithastobea F N.Ahighervalue ofF N directly resultsina lowrecall value.

We observe fromFig. 3 that forhigh values of w, F P errors decrease but F N errors increase dueto the reasonthat only se- quenceswithhighsimilaritywouldbeestimatedasmatching.This results in a high precision value but a low recall value. On the other hand, for low values of w, sequences with less similarity could be estimated as matching hence F N errors decreases but F P increases. Hence there is a trade of in choosing the preci- sionparameter w ofthemodelbetweenprecision andrecall mea- sures.F -measure decreasesquicklyifprecision orrecall isdominant againsttheother.Hence F -measure canbe utilizedfortuningthe precisionparameteroftheSMCbasedalignmentsystem.Asimilar behaviorcanbeobservedforthebaselinesysteminFig. 5.Asthe thresholdincreases, the recall decreasessince the numberof F N errorsdecrease.

TheevaluationresultsinFig. 4revealthat theminimumnum- ber ofthe samples Ts doesnot havea criticaleffect onthe per- formance. This mightbe dueto the fact that the other criterion indeterminingtheresolutioni.e.,length ofeachsequenceshould be higherthan 1 inlow resolution ismore dominantduring the sequentialprocedure.Thetwolongestandtheshortestsequences areoflengthN1=33655,N2=18155 andN3=499 frames(17 m 57 s,9 m41 s and16 s respectively)intheJikudataset.Assuming Ts=100,foraligning longestand shortestsequences, thelowest resolutionis chosen asL=7 sothelength ofsequences become N7,1=262 and N7,3=3 frames and the number of samples is equalto262+31=264.Thus,lengthsofsequencesdetermine theresolution levelrather then Ts inthiscase. However fortwo longestsequences,thelowestresolutionischosenasL=9 sothe lengthofsequences become N9,1=66 andN9,2=35 framesand thenumberofsamplesisequalto66+351=100.Forthesetwo sequences, it is possibleto increase theresolution up to L=10, howevertheresolutionisdeterminedbyTsinthiscase.

5. Conclusionandfuturedirections 5.1. Conclusion

In thiswork, an SMC sampler based multiresolution multiple audio sequence alignment scheme is proposed. In the design of thesystem, we benefit fromthe flexible optionsof SMCsampler fortheforwardkernelandthediscretescorefunctionthatcanbe computedindifferentresolutions.Theresultsshow thatthe SMC samplerbasedsystemcanoutperformafingerprintbasedbaseline systemwithproperchoiceofparametersi.e.,precision,numberof samples,lowresolutionlevels,lengthofsmoothingkernel.

Oneofthemainchallengesofalignmentproblemisthevaria- tionof noiseamong therecordings that matchwitheach onthe timeline.State oftheartfingerprintingbased alignmentmethods countsthe number of exactly matching fingerprintsbetween se- quencesasameasureofsimilarity.Thus,underlowSNRconditions thisapproachtends tofail.Ontheother hand,withtheproposed

methoditispossibletotunethemodelusingtheprecisionparam- eterw todeterminethesimilaritybetweentwosequences.

Thechallengeforthealignmentmethodsthatarebasedonsim- ilarity functionssuch ascorrelation isthat it isnot clearhow to use these measures for multiple audio alignment since they are defined for measuring the similarity of a pair of sequences. The proposed method solves this problem as sequentially aligning a sequence to a cluster by applying a SMC based multiresolution samplingmethod.

From a practical point of view, for some applications of au- dio alignment such as audio restoration, camera synchronization andsourceseparation,requirehighprecision rather thanhighac- curacy.Forexample,inanaudiorestorationapplication, F P errors mightcauseseveredegradationintherestoredaudioorinacam- erasynchronizationapplication, F P errorsmightjoinclustersthat are unrelated. For such applications, one can maximize precision bytuningtheparametersaccordingly,eventhoughtheoverallsys- temaccuracy decreases.Theproposedsystemisadjustabletosuch situations simplybyincreasing theprecision parameter w,ofthe model.

5.2. Futuredirections

A drawback of the proposed systemis the computation time.

Even though theSMC sampler isa fast andefficientmechanism, the algorithm still suffers from the sequential alignment of the sequences. An ideal optimization method for the model based approach would be a samplingmechanism that directly samples from the multiresolution(r) that is yet to be researched. Such a methodwouldalsoovercometheerrorpropagationproblemof thesequentialprocess.

One straightforwardsolution tofurtherenhance thecomputa- tional timeis usingparallelprocessing. Note thateach sample in the SMC sampler moves independent from each other, thus the systemiscompletelyparallelizable.Inadditiontothat,paralleling strategiessuchasdividingthesequencesintogroupsandrunning thealignmentprocedureinparallel.

Another issue that effects the computation time is the initial ordering of the sequences. With no prior information about the alignmentsofsequences,orderingaccordingtosequencelengthis feasible.Howeverforsomedatasets,certainattributesoftheaudio datacould beexploitedsuchastempoinformationformusicsig- nals. Thisinformation couldfurther be usedfor pre-classification orpre-groupingofsequencesasa pre-processingsteptothepro- posedsystemsimilarto[13].

By incorporatingthe precision parameter w into the Bayesian setting, itis possible tojointly estimate w with alignments r1:K. Thiswaytheoptimumparametersettingwouldbeachievedauto- maticallyrespectingtheinputdatawhichcanfurtherimprovethe performanceoftheproposedsystem.

References

[1]L. Kennedy, M. Naaman, Less talk, more rock: automated organization of community-contributedcollectionsofconcertvideos,in:Proceedingsofthe 18thInternationalConferenceonWorldWideWeb,2009,pp. 311–320.

[2]C.V.Cotton, D.P.Ellis,Audiofingerprintingtoidentifymultiplevideos ofan event,in:2010IEEEInternationalConferenceonAcousticsSpeechandSignal Processing(ICASSP),IEEE,2010,pp. 2386–2389.

[3]J.Ojala,S.Mate,I.D.D.Curcio,A.Lehtiniemi,K.Väänänen-Vainio-Mattila,Au- tomatedcreationofmobilevideoremixes:usertrialinthreeeventcontexts, in:Proceedingsofthe13thInternationalConferenceonMobileandUbiquitous Multimedia,MUM’14,ACM,NewYork,NY,USA,2014,pp. 170–179.

[4]AnilAlexander, OscarForth,DonaldTunstall,Musicandnoise fingerprinting andreferencecancellationappliedtoforensicaudio enhancement,in:Audio EngineeringSocietyConference:46thInternationalConference:AudioForen- sics,2012.

[5] Y.Mizushina,W.Fujiwara,T.Sudou,C.L.Fernando,K. Minamizawa,S. Tachi, Interactiveinstantreplay:sharingsportsexperienceusing360-degreesspher- ical images and haptic sensation based on the coupled body motion,

Referanslar

Benzer Belgeler

The increase in the accuracy for tandem employed models at lower SNR values between stream-tied MSHMM trained with two meth- ods shows that training emission parameters together

Bu araştırmada ekran başında şekerli içecek tüketen adölesanların günlük şekerli içecek tüketim miktarı istatiksel olarak anlamlı düzeyde daha fazladır

We modify the model for multi resolution case and the matching is achieved with a Sequential Monte Carlo Sampler (SMCS) which uses low resolution models as bridge distribu- tions..

In this work, we proposed a model based approach for the multiple audio sequence alignment problem and defined 4 generative mod- els for different feature sets. We derived proper

In Section 2, signal models for audio are developed in the time domain, including some examples of their inference for a musical acoustics problem.. Section 3 describes models in

In this paper, we consider approximate computation of the conditional marginal likelihood in a multiplicative exponential noise model, which is the generative model for latent

[23] raise concern about trajectory privacy in LBS and data publication, and empha- size that protecting user location privacy for continuous LBS is more challenging than snapshot

have emphasize (10), we also suggest that no chromosomal analysis is necessary unless there is an evidence for another sex development disorder, as clinical, laboratory,