sequences using sequential Monte Carlo samplers Multiresolution alignment for multiple unsynchronized audio Digital Signal Processing

(1)

Contents lists available atScienceDirect

Digital Signal Processing

www.elsevier.com/locate/dsp

Multiresolution alignment for multiple unsynchronized audio sequences using sequential Monte Carlo samplers

Dogac Basaran

^a^,^∗

, Ali Taylan Cemgil

^b

, Emin Anarim

^c

aSignalandImageProcessingDepartment,Telecom-ParistechUniversity,46RueBarrault,Paris,France bComputerEngineeringDepartment,BogaziciUniversity,34342Bebek,Istanbul,Turkey

cElectricalandElectronicsEngineeringDepartment,BogaziciUniversity,34342Bebek,Istanbul,Turkey

a r t i c l e i n f o a b s t ra c t

Articlehistory:

Availableonlinexxxx

Keywords:

Multipleaudioalignment Multiresolutionalignment Audioﬁngerprint Bayesianinference

SequentialMonteCarlosamplers Sequentialalignment

Itisincreasinglymorecommonthatanoccasionisrecordedbymultipleindividualswiththeproliferation ofrecordingdevicessuchassmartphones.Whenproperlyaligned,theserecordingsmayprovideseveral audioandvisualperspectivestoascenewhichleadstoseveralapplicationsinrestoring,remasteringand remixingframeworksinvariousﬁelds.Inthiswork,weproposeamultiresolutionalignmentalgorithm foraligningmultipleunsynchronizedaudiosequencesusingSequentialMonteCarlosamplers.Weemploy a model based approach and a score functionanalogous to similarity based methods.The optimum alignments are obtained ina course to ﬁnestructure with multiresolution sampling and aheuristic sequentialsearchmethod.Theproposed methodisevaluatedwithareal-lifedatasetfromJikuMobile VideoDatasets.Thesimulationresultssuggestthatourmethodiscompetitivewiththebaselinemethods intermsofaccuracywithsuitablechoiceofparameters.

©²⁰¹⁷ÊlsevierÎnc.Âll^rights^reserved.

1. Introduction

Withtheproliferationofrecordingdevicesandapplicationsfor user generated content sharing, an increasing number of people regularly capture audio and video in special occasions like con- certs, conferences and sports competitions. As a consequence, a singleeventcanbesimultaneouslyrecordedbymultipleindividu- als(suchasusingsmartphones)creatingwidecoverage,multiple visual and listening perspectives to a scene. If privacy is not a concern,these user generated multimedia (audio/video) data are typicallymadeaccessiblethroughsocialmediasharingsiteshow- everinunorganizedform.Temporallyaligningandcombiningsuch datacould lead toa wide rangeof applications.In [1],audience generatedvideoclipsfromaconcerteventarealignedusingaudio featurestoobtainafull-clipofonesong.In[2],over700YouTube videosrelatedto a U.S. presidential inaugurationare used to re- storetheU.S.President’sspeech.Anapplicationofautomaticvideo remixingcanbefoundin[3],wherethesystemautomaticallycre- atesremixesfromvideosrecordedbymobiledevices.Multimedia alignment also has applications in forensics ﬁeld [4] and lately, 360-degreesvideocreationbecameverypopular[5].Therearealso

*

Correspondingauthor.

E-mailaddresses:[email protected](D. Basaran), [email protected](A.T. Cemgil),[email protected](E. Anarim).

commercially available video synchronization tools such as Plu- ralEyes[6]andDualEyes.

1.1. Problemstatement

We describe the problem setup with the following example.

Imagine you attend a concert of your favorite band. During the concert, several people from the audience record some parts of the concert with their smart phones, fromdifferent perspectives and independent from each other. These recordings do not nec- essarily contain the same audio/visual content i.e., recordings of entirelydifferentsongs,andprobablynoneofrecordingscoverthe entire concert. The quality of each recordingdevice might differ depending on the hardware, compression distortions as well as theenvironmentalcontaminatingnoise.Inthissetting,theaimis totemporallysynchronizethesemultimediarecordingsrelative to eachotheronacommontimelineutilizingtheaudiocontent.

Weformally definethe alignmentproblemasfollowing.There is a dataset of K user generated, unsynchronized recordings denoted as x= {^xk}^K_k₌₁^. ^We ^denote ^the ôffsets ôf ^sequences ^refer- enced to universal (generic) time line asr= {^rk}_k^K₌₁^. Îfâ ^pair ôf sequences (xi,xj) are overlapping onthe universal time line, we call these sequences asconnected (connected and overlapping are interchangeablyusedinthetext).Thesetofconnectedsequences formaclusterC= {Cm}_m^M₌₁ î.e.,^xi andx_jareconnected,x_j andx_l areconnectedthentheyformtheclusterCm= {^xi,xj,x_l}^.^Thereâre 1≤^M≤^{K disjoint}^(not^connected)^clusters. ^The âlignment^prob-

https://doi.org/10.1016/j.dsp.2017.10.024

1051-2004/©²⁰¹⁷ÊlsevierÎnc.Âll^rights^reserved.

(2)

lemisthentodetermineeachconnectedpairofsequences(x_i,x_j) inthedataset,eachdisjointclusterandfurtherdeterminetherel- ativetime offsetsr= {^ri j= |^ri−^rj|}i,j forall connectedpairs (x_i,x_j).

Notethatthe sequencesx are usually discretetime–frequency representations of raw audio signals such as short-time Fourier transform (STFT). Then each sequence is represented as x_k= {^xf n}^F_f^,₌^N₁^k_,_n₌₁^where ^{f denotes}^the^frequency^band^index,^{n denotes} theframeindex, F denotesthenumberoffrequencybandsandNk denotesthelengthofthesequencex_k(inframes).Inthismanner, theoffsetsofsequencesr are alsotreatedinframesinsteadofin seconds.

1.2. Relatedwork

In state-of-the-art, audio alignment problem is tackled in a twofoldmanner: First theaudio signalsare represented withro- bustfeatures against various types ofnoise anddistortions, then searchalgorithmsareemployedoverthoserepresentationstoﬁnd the best offset setting of sequences usually utilizing exact hash (ﬁngerprint)matches,similarityorcostfunctions.

Audiorepresentationsinexistingmethodsmostlyinvolveaudio ﬁngerprints [7,8,1,9–14], transformdomain (time–frequency) fea- turessuchaschroma[15,16],spectralﬂatness[17],spectralenergy [18]andaudioonsets[19,20].

Although audio fingerprints, as compact signature representations of audio, were originally developed for music identifica- tionservices,query-by-examplebasedindexingschemesforaudio identification are utilized for audio alignment purposes in [1,9, 11,12,14,20,21]. Thesemethods are usually requirelow computa- tional time as an advantage. The usual approach requires to ex- tracthashes(fingerprints)fromaudiosequences andthenumber of exact hash matches between sequences is used to determine ifthereisa match. Then foreachmatchingpair, therelative offset is computed¹ and connected sequences are grouped to form clusters[14].A similar query-by-example methodis proposed in [15] whereaudiochromafeatures areused insteadofbinary fingerprints. A descriptoris definedas a classifier to find matching sequences.

Onemajordrawbackwiththefingerprintingapproachisthatin real-lifeconditions,twomatchingaudiosequencesmayhavevery fewornoexactmatchingfingerprintsundersomedistortionsand lowSNRconditions.Besidesthat,amatchingdecisionbetweentwo audiosequences isachievedviathresholdingthenumberofexact fingerprintmatcheswhichishardtosetforaglobalsolution.

Astraightforward wayto tacklethealignment problemisuti- lizing similarity measures such as cross-correlation [9,17,13] or Hamming distance [13]. Methods utilizing such similarity mea- suresusually applymatching foreach pairofsequences by using thresholdingmethods.Thenforthematchingsequences,therela- tiveoffsetwithhighestvalue(mostsimilar)isacceptedasthebest estimate.

In[13],thedatasetispre-classiﬁedintoclassessuchassilence, music,speechandnoise,beforealigningwithcross-correlation.

These measures are easy to implement, fast androbust however they have two major drawbacks. First of all, similar to fingerprinting based approaches, matching sequences are estimated using ad-hoc thresholds that depends highly on data. Secondly, these methods do not provide finding the amount of similarity of a sequence against a cluster ofpre-aligned sequences i.e., eachpairofsequenceshavetovisitedtofindmatchings.Toover- comethisproblem,agreedymergingmethodisapplied in[9]to

1 Mostoftheaudioﬁngerprintshavethetime-stampinformationhencetherela- tivetimedifferenceisabletobecomputed.

form clustersof alignedsequences so that another sequence can be matched withthecluster. In [18],scoring functionssimilar to cross-correlation andHammingdistanceare proposedthat solves how to aligna sequence against acluster andhow to determine matchingsequencesautomatically.

Inthispaper,weproposeacoursetoﬁnestructure,multireso- lutionaudioalignmentschemethatcanbeappliedtoanarbitrary numberofsequences (K>2)usingSequentialMonteCarlo(SMC) samplers.Notethat the initialphase ofthisworkis presentedin [22] where themultiresolution schemeis consideredfor aligning pairsofsequences.Themainintuitionofthemultiresolutionalign- mentis thataligningthesequences inacourserlevelwithalow computationaleffortandsequentiallyreﬁningtheestimatedalign- ment.

Here,weextendtheideaofmultiresolutionalignmentviaSMC samplers to a multiple audio alignment setting where we draw samples (alignment estimates) from a multidimensional, multi- modallikelihood surface definedin[18] that penalizesthealign- mentofK sequences.SMCsamplerparticularlyfittothemultires- olution settingbecause itsamples the target surface sequentially throughasequenceofintermediatedistributionseachdistribution beingknownuptoanormalizingconstant[23].TheSMCmethod isbasedonsequentialimportancesampling[24–26]anditisflex- ible in design i.e., intermediate distributions can have different resolutions.

Themaincontributionsofthisworkcanbelistedasfollows:

•^Toôur^knowledge,^thisîs^the^first^study^that^proposesâ^multiresolution alignmentmethodforaligning multiple usergen- eratedmultimediacontent.

•Â^SMC^sampler^mechanismîs^defined^for^multipleâudioâlign- mentsettingthatisabletosamplefromthelikelihoodofany alignmentsettingofK sequences.

The proposed method is evaluated with a real-life dataset from JikuMobileVideoDataset[27].Theresultsarecomparedtoaﬁn- gerprintingbasedbaselinemethodintermsofwell-knownmetrics.

The accompanying softwarearchitecture and impactare given in thejointmanuscript[28]andthesoftwareisavailableonline.²

The rest of the paper is organized as follows: In Section 2, we brieﬂyexplain thescorefunctionin[18] andtheprobabilistic modelthatitisderivedfromaswellastheinterpretationinamul- tiresolutionsetup.TheninSection3,themultiresolutionalignment using SMC samplers is explained extending to a multiple audio alignmentsetting.InSection4,theexperimentalsetup,implemen- tationissues,theevaluationresultsanddiscussionaregiven.Then inSection5,conclusionsaregivenandsomefuturedirectionsare discussed.

2. Modelbasedapproach

Inthissection,we explaintheprobabilisticmodelingapproach [18,29] and how it can be used in a multiresolution fashion in multipleaudioalignmentsetting.

2.1. Model

In addition to the deﬁnitions given in Section 1.1, we further deﬁne a random variable λ= {λfτ}^F_f₌^,^T₁_,_τ₌₁ ^where ^{f is} ^the ^fre- quencybinindex,

τ

istheframeindex(onthegenerictime line) andT isthelengthofthesequenceinframes.Here,λ denotesan unobserved parameter sequence but common with the observed sequences. The central theme of the model is as follows: Given

2 https://github.com/dogacbasaran/Multiple-Audio-Alignment.

(3)

Fig. 1. Illustration of low resolution representations.

thecorrectalignmentr,observedfeaturesequences(ﬁngerprints)x are noisyrealizationsfromacommonbutunobservedparametersequence λ. To clarify the idea, assume a scenario where the unsynchro- nizedrecordingsaregatheredfromtheaudienceinaconcertevent as in [1]. Then the λ would be the features of the unobserved (hidden) clean recording of the concert. Intuitively, aligned and connectedsequences shouldshow someresemblanceattheover- lapping parts with each other due to the common unobserved sourceλ.

The generative model is given in (1). Here, we choose the prior p(λfτ)asaBernoulli distribution(BE⁾ ând^theobservation modelp(x⁽_{f n}^k⁾|^rk,λfτ)ischosenasaconditionalBernoullidistribu- tion(P⁾âs^givenⁱⁿ^(2).δ[·]îs^the^Kronecker^delta^function^which isequal to one onlyif the expression inside is zero.This model statesthat,acoefficientx⁽_{f n}^k⁾ isgeneratedbyλfτ ifandonlyifthe nth frame of sequence x_k is aligned to

τ

th frame of hiddense- quence λ. In thiswork, we assume

α

λ=⁰.5 and the alignments r tobe uniformly distributed since in general, no information is available about the offsets of the sequences.³ Following the ob- servationmodel,wechoosebinaryﬁngerprintsasfeatures forthe sequences.

λ

_f_τ

∼

BE

(λ

_f_τ

; α

_λ

)

x⁽_{f n}^k⁾

|

^rk

, λ

fτ

∼

T

τ=1

P

(

x⁽_{f n}^k⁾

; λ

fτ

,

w

)

^δ[ⁿ⁻⁽^τ⁻^r^k^)] ⁽¹⁾ P

(

x⁽_f^k_,⁾_n

; λ

fτ

,

w

) =

BE

(

x⁽_{f n}^k⁾

;

^w

)

if

λ

fτ

=

¹

BE

(

x⁽_{f n}^k⁾

;

¹

−

^w

)

if

λ

_f_τ

=

⁰ ⁽²⁾

Note that each λfτ is assumed to be a-priori independent of each other hence the joint distribution of p(λ) can be factorized as the multiplication of marginal distributions p(λ_f_τ) i.e., p(λ)=_F

f=1

T

τ=1p(λfτ).

Deﬁning (r)=^log(p(x|^r)), one can achieve the optimum alignment r^∗ by a maximum likelihood (ML) solution as in (3).

(r)isderivedfollowing(4)in[18]asin(5)beneﬁtingfromcon- jugate prior distributions. Note that the joint distribution of the model p(x,r,λ) can be factorized as the inside of the integral in(4).

r^∗

=

arg max

r

(

r

)

(3)

p

(

x

|

^r

) ∝

^p

(

x

,

r

, ) =

d

λ

p

(

x

|λ,

^r

)

p

(λ)

p

(

r

)

(4)

(r)

=

T τ=¹

F f=¹

log (1−^w)

K

k=1δ[ⁿ−(τ −^rk)]δ[^x⁽_{f n}^k⁾−⁰]w

K

k=1δ[ⁿ−(τ −^rk)]δ[^x⁽_{f n}^k⁾−¹] +^w^K^k⁼¹^δ^[ⁿ^{−(τ −}^r^k⁾^]δ[^x⁽^{f n}^k⁾⁻⁰^](1−^w)

K

k=1δ[ⁿ−(τ −^rk)]δ[^x⁽_{f n}^k⁾−¹]

(5)

3 Ifthereexistssomeaudiosequencesthatarerecordedbythesamedevice,then rkhavesomeconstraints(seeSection4.2).

(r) function canbe interpreted asan advanced similarityfunc- tionwithtwomainadvantagesoversimplesimilaritymetricssuch ascorrelation

• Îtîs^possible^to^score^the^qualityôfâlignmentôfK sequences withK>2 with(r)function.

• (^r) function is ableto give score foralignments where se- quencesarenotmatching(notoverlapping).

It is also importantto mention that for two unconnected se- quencesorasequencexknotconnectedtoaclusterCjingeneral,

(ˆ^rk,r_C_j) gives the same score for all non-overlapping settings.

This is dueto fact that the time slices where no sequence exist on thegeneric time line, doesnot contribute to thescore which canobservedfrom(5).

Foramoredetaileddescriptionofthemodelpleasesee[18].

2.2. Multiresolutionmodeling

Beforedescribing themultiresolutionsettingofthemodel,we firstdescribethelowresolutionrepresentationsandhowtheyare obtained.Wedefinex˙⁽_n^k⁾= {^x⁽_{f n}^k⁾}^F_f₌₁âs^the^coefficient^vectorôf^sequence x_k at time frame n. The low resolution representation is thenachievedbyreshaping x_k by groupingL consecutive^x˙⁽n^k⁾ co- efficientvectorsassinglecolumnvector.Inthesequel,lowresolu- tionrepresentationofasequencex_kisdenotedas^x˜k,L= {˜˙^x⁽_n_˜^k_L⁾}^N_n_˜^˜_L^k,L₌₁ where 1/L represents the resolution, N˜_k_,_L and n˜L represent the length,theframeindexofsequencex˜k,L respectively.Wealsode- noter˜k,L and

τ

˜L theoffsetofthesequencex˜k,L andgenericframe indexforresolutionL respectively.AtoyexampleisgiveninFig. 1 wherethegroupingofcoeﬃcientsandlowresolutionrepresenta- tionsareillustratedforL=¹,L=^{2 and}^L=^4.^The^sequence^index k andtheresolution L aredroppedfortheeaseofrepresentation.

Note that groupingthedata insuch a waydecreases the resolu- tioni.e., thelength ofthedatadecreaseshoweverwithoutlosing anyinformation.⁴

Having described the low resolution representation of sequences,thegenerativemodelforresolutionL andthecorrespond- ing log-likelihoodL(˜^r)aregivenin(6)and(7)respectively.The representation of the unobserved variable λ_f_{τ in} resolution L is denotedas˜λ_f_τ_˜_L_,_L.

˜λ

_f_τ_˜_L_,_L

∼

BE

(˜λ

_f_τ_˜_L_,_L

; α

λ

)

˜

x⁽_f^k_n_˜⁾

L,L

|˜

rk,L

, ˜λ

_f_τ_˜_L_,_L

∼

˜

T

˜ τL=¹

P

(˜

x⁽_f^k_n_˜⁾_,_L

; ˜λ

fτ˜L,L

,

w

)

^δ[˜ⁿ^L^{−( ˜}^τ^L^−˜^r^k^,^L^)] ⁽⁶⁾

4 Itisalsopossibletoapplysubsamplingorchangingthehopsizeandwindow- sizeinthefeatureextractiontodecreasetheresolutionhoweverintheexpenseof informationloss.

(4)

Fig. 2. IllustrationofmovingasamplefromresolutionL toL/2 withforwardkernel.Thedashedandred-solidwindowsbothrepresentcandidatepositionsformovingthe sampletotheonelevelhigherintermediatedistribution.Thered-solidwindowrepresentsthewindowthatthesampleismovedto.(Forinterpretationofthereferencesto colorinthisﬁgurelegend,thereaderisreferredtothewebversionofthisarticle.)

L(˜rL)

=

˜

T τL=1

F f=1

log (1−w)

_K

k=1δ[˜n_L−( ˜τL−˜r_k_,_L)]δ[x⁽^k⁾ f n,L−0]

w

_K

k=1δ[˜n_L−( ˜τL−˜r_k_,_L)]δ[x⁽^k⁾ f n,L−1]

+^w^K^k⁼¹^δ[˜ⁿ^L^{−( ˜τ}^L^−˜^r^k^,^L^)]δ[^x⁽^{f n}^k⁾^,^L⁻⁰^](1−^w)

K

k=1δ[˜ⁿL−( ˜τL−˜^rk,L)]δ[^x⁽_{f n}^k⁾_,_L−¹]

(7) Inthemodel,theδ[˜ⁿL− ( ˜

τ

L− ˜^rk,L)]^expressionⁱⁿ^theobservation modelindicates that ifthe group of L consecutivecoeﬃcients of sequencex˜k,L i.e.,x˜˙_n⁽_˜^k⁾

L attimen˜L,arealignedtotime

τ

˜L,thenthese L coefficients are generatedfromunobserved coefficient ˜λ_τ_L_,L.At theoriginalresolution(L=^1),^thereîs^no^groupingôf^coefficients hencex_k= ˜^xk,L and(r)= L(˜^rL).

3. MultiresolutionalignmentusingSMCsamplersinmultiple audioalignmentsetting

In this section, we introduce an SMC sampler based solution forthemultipleaudioalignmentproblemthatuseslowresolution

L(˜rL)functionsasbridges.

IdeallyforaligningK unsynchronizedaudiosequences,asearch mechanismshouldbedeﬁnedonthe K dimensionaldiscrete(r) surface(loglikelihood)toﬁndtheoptimumalignmentsr₁^∗_:_K.How- ever, our preliminary experiments with batch methods such as Gibbs sampling has not proven very effective due to the multi- modal and very rough structure of the surface[29]. That’s why, weresorttoamoreadvancedapproximateinferencemethod,SMC samplers.Apreviousworkforpairwisecasesisconsideredin[22].

3.1. SMCsamplers

SMCsamplerscanbeusedforsamplingfromdensitiesthatare otherwisediﬃculttosamplefrom.Themethodsamplesfromase- quence ofintermediate distributions,denoted by

γ

i [23]. Ateach step, samplesare drawn fromthe next intermediate distribution andin the last step, the algorithm samples from the target dis- tributionwhichis(r)inourcase. Themainideabehind sucha mechanismisthatiftheconsecutiveintermediatedistributionsare chosentobeclosetoeachother,theywouldact likeabridgeand guidethesamplesthroughmodesofthetargetdensity[22].

AforwardMarkovtransitionkernelKi+¹(r_s⁽ⁱ⁺¹⁾,r⁽_sⁱ⁾)isapplied todrawsamples,wherer⁽_sⁱ⁾representsthesthsampleatithstep.

Theimportanceofeachsampleisdecided accordingtoitsweight whichiscomputedasin(8) [23]:

w_i

(r

¹_s^:ⁱ

) =

^wi−1

(r

¹_s^:ⁱ⁻¹

)

^Bⁱ⁻¹

(r

ⁱ_s

, r

_sⁱ⁻¹

) γ

i

(r

ⁱ_s

)

Ki

(r

ⁱs

, r

ⁱs⁻¹

) γ

i−¹

(r

ⁱs⁻¹

)

⁽⁸⁾

where Bi−¹(rⁱ_s,r_sⁱ⁻¹) is a backward Markov kernel. A resampling stagecanbeappliedwhentheweightsbecomeveryunevenlydis- tributed;onlyafewsamplesenduphavingnon-negligibleweight.

A common criterion to measure this degeneracy is the effective samplesize(E S S)[23,30,31].

3.2. ForwardMarkovtransitionkerneldesign

We choose the intermediate distributions as low resolution scorefunctionsL(˜rk,L)where L=²^l^,^l=¹¹,10,· · · ,^0.^Note^that thelengthofeachL/2(˜^rL/2)istwicethelengthofonesteplower resolutionL(˜rL),i.e.,2(˜r2)isofhalfthelengthof1(˜r1).Inad- dition,the samplesare going tobe moved fromlower resolution (2L)tohigherresolution(L)throughsomesmootheddistributions ofL.Hence,oneneedstodesignaforwardkernelsuchthatsam- plesaremoved fromlowerresolutiontohigherresolutionaswell as through the same resolution depending on the step index. In SMCsamplerframework,thedesignoftheforwardandbackward kernelsareﬂexible sothatanyproposalmechanismispossibleat any step of the algorithm, i.e., Ki(.)do not haveto be equal to Kj(.).

Toobtain smoothintermediate distributions,a sparsesmooth- ing kernel Q is applied several times to L(.), i.e., QⁿL, Qⁿ⁻¹L,· · · ,^QL,L. Note that by choosing a sparse smooth- ing kernel,an explicitcomputation ofall valuesin QⁿL is not needed. Weappliedaveraging kernelforsmoothingpurposesand backward kernel is chosen to be equal to forward kernel in the weightupdate.InFig. 2,weillustratehowthesamplerismoved from a low resolution L to one step higherresolution L/2 using theforwardkernel.Weassumethatthesampleswouldbearound 2r−^{1 due}^to^twiceresolution.Accordingly,thereare4proposed valuesinthehigherresolutionL/2 i.e.,2r−¹,2r,2r+¹,2r+² in thisexample.Ateach step,the scorevaluesare averaged over two overlappingwindows.Thenfollowingtheforwardkernel,the sample is moved to one of the windows with probabilities pro- portionaltothewindowaverage.Thered-solidwindowrepresents the window that thesample ismoved to. As an example,in the ﬁrststep(fromL to Q²L/2),thescorevaluesareaveragedover two overlappingwindowsofsize3.Followingtheforwardkernel, thesample r in L ismoved totherightwindow.Inthelast step (from QL toL/2),kernelproposestwovaluesandoneofthem is chosen witha probability proportional to its scorevalue. Note that,duetotheﬂexibilityofforwardkerneldesign,onecanchoose differentwindowlengthandnumberofsmootheddistributionsin moving froma lower resolution to higher. See [28] Section 2.1.2 forimplementationdetailsofthismethod.

3.3. Initialcomputationofsamples

The initialtarget distribution,

γ

1 isequal totheL(˜rL)where L isthe lowestresolution.In theSMCsampler,theﬁrst distribution is chosen as a simple enough distribution to draw samples

(5)

from[23] howeveratthisstage, insteadofdrawing samplesfrom

L(˜rL),wechooseall possiblealignments forr˜L asseparate sam- plesorparticlesandtheinitialweightofeachsampleisinitialized astheir respective probabilities. As an example, assume two se- quencesaretobealignedwithlengthsNL,1 andNL,2 respectively atresolutionL.Similartocorrelation,thereareNL,1+^NL,2−^{1 lags} foroverlappingalignmentswhichissetasthenumberofsamples.

We apply two criteriato determine the initial resolution that also directly effects the number of samples: The length of each sequence to be alignedhas to be larger than one i.e., NL,k1>1, N_L_,_k₂>1 andthenumberofsampleshastobelargerthanapre- deﬁnedthresholdvalueTsi.e., N_L_,_k₁+^NL,k2−¹>Ts.

Notethatweexcludethenon-overlappingalignmentsfromthe samplesinthe procedure.Besides thefact thismakes the search spaceinﬁnite (see section 2), it isnot necessary to compute the non-overlappingalignmentscoreattheintermediate levels.Com- putinginthehighestresolution,onecanachieveoverlapping/non- overlapping decision by comparing with the scores of sample alignments.

3.4.ExtendingmultiresolutionSMCsamplertomultipleaudio alignmentsetting

Foraligningmultipleaudiosequences,searchingforallpossible alignmentsettings ofsequences on(r) isnot feasiblesince the searchspaceishuge.In[18],anad-hocsequentialsearchmethod isproposedwherewe takeadvantage ofthefactthat asequence canbealignedagainstagroupofpre-alignedsequencesusing(r) function.

More formally, starting with K=2 sequences and sequen- tiallyaligning onesequenceatatime,we solvethemaximization problemr^∗₁_:_K=^{arg max}r₁_:K(r1:^K)whereK<K (lessnumberof sequences).At eachepoch,themethodscansthroughall not-yet- alignedsequences,groupthesequences(formclusters)thatmatch witheachotherandfreezetherelativeshifts.

Morespeciﬁcally,sequentialalgorithmvisitseachsequencethat isnot-yet-alignedone-by-oneandalignsthesequenceagainstthe cluster of already aligned sequences. Assume we denote the se- quencetobealignedwithx_k,theclusterofpre-alignedsequences withC andtheirrespectivealignmentswithr_C.Thenateachstep, themethodsolvesthemaximizationproblemin(9).

r^∗_k

=

^{arg max}

r_k

(

r_k

,

r_C

)

(9)

HerethemaximizationisachievedviaamultiresolutionSMCsam- plerwhichisdesignedtosamplefromthescorefunction(r_k,rC). Lowresolution functionsL(˜r_k,r˜C)functions act asintermediate distributionsthatthe samplesare movedthrough asexplainedin Section 3.2. The best alignment r^∗_k is then estimated by simply choosing the sample at the original resolution that satisﬁes the maximization(9)amongothersamples.

Notethat there is noinformation available forthe time posi- tionsonthegenerictimeline,thereforeweassumethepre-aligned group of sequences are aligned to

τ

₌Nk+^{1 where} ^Nk representsthe length of the sequence to be aligned, x_k. By thisway, r_k=1 representsa non-overlapping alignment, 2≤^rk≤^Nk+^N^C representsalltheoverlappingalignmentswhereN^C representsthe lengthof thecluster. Thisrestricts thesearch spaceof thealign- mentofsequencek inthe[¹,N_k+^N^C]^interval.

InSMCsamplerframework,intermediatedistributionsareusu- allyannealedsothattheybecomemoresimilar[23].Asexplained insection3.2,thesamplesmovefromlowertohigherresolutions throughsmoothedintermediatedistributionswhichactsasanan- nealingprocess.Itispossibletochangethenumberofintermedi- atedistributions betweenresolutions andthe lengthof theaver- agingkernelapplied toeach intermediatedistribution duetothe

Table 1

CharacteristicsofGT_090912 setofJikuMobileVideoDatasets[27].

Camera 1 Camera 2 Camera 3 Camera 4 Total Numberof

Recordings

19 15 8 8 50

Length 58.93 m 53.68 m 49.15 m 55.2 m 3 h 37 m

Table 2

Featureextractionparameters.

Parameter values Sampling Rate (Fs) 16000 Hz

Mono/Stereo Mono

Window Length 0.064 sec

Window Type Hamming

Hop Length 0.032 sec

Subband division Logarithmic

Min–Max Frequency 100 Hz–8000 Hz Number of bits per window 32

ﬂexibility oftheforwardkernel.Besides movingsamplesthrough smoothedintermediatedistributions betweenresolutions,we also adjust the precision parameter fordifferent resolutions asa sec- ond methodofannealing. Whenthe sequences areto be aligned inlowerresolutions,theresultingscorefunctionbecomesacoarse versionofthescorefunctioninhigherresolution.Thisisasimilar situation asaligning sequences withhighnoise wherethe preci- sionvalue shouldchosen small[18].Hence theprecision valueis chosen small for low resolutions and gradually increased asthe samplerreachesthehighresolutions.

4. Experimentalresults

4.1. Experimentalsetupanddataset

The accuracy of the multiresolution SMC Sampler method is evaluatedontheJikumobilevideodataset,GT_090912 event[27].

The event is recorded by the audience using mobile devices of different quality and noise conditions. The characteristics of the dataset are given in Table 1. The binary features foreach audio recordingare obtained followingthe ﬁngerprintingscheme given in [7]. Notethat thisbinary feature is inherentlyimmune to en- ergydifferencesi.e., volumechanges,betweendifferentsequences duetodifferencingprocedure onSTFTovertime. Apreprocessing stepis appliedforresampling, silenceremovalandnormalization ofsignalsbeforeextractingfeatures.Thechoice ofparameters for theextractionprocedureisgiveninTable 2.Thegroundtruthsyn- chronization ofthedataisgivenin[32] wheretheoffsetofeach recordingisobtainedbymanuallisteningtestsandrepresentedon acommontimeline. Notethattheground-truthin[32] isnot directly compatiblewithourevaluationsystem, we furtherconvert thedatatoanotherformat.

NotethatasgiveninTable 1,eachrecordingdevice isusedto record more than one recording in GT_090912 event. This situation imposesadditionalconstraints onthealignments. Toexplain precisely,wereplacesequenceindexk withapairofindices(c,l) wherec representsthecameraindexandl representstherecord- ingindexforthecamerac.Nc,l representsthetime lengthofthe sequenceinfeaturedomain.

Constraint1, given in(10), basically states that the recordings ofthesamecamerai.e., xc,i,xc,j fori=^j,^can^not^overlap^on^the generic timeline.Constraint2 asgivenin(11)statesthatnot only two recordings (x_c_,_l,x_c_,_l₋_m) ofthe same camera can not overlap butalsotherehastobeacertainamountofdistancebetweentheir offsets.Thisamountisequaltothesumofallsequencelengthsin betweenx_c_,_l andx_c_,_l₋_m. Onthe other hand,theseconstraints do not prevent thesequences withthe same camera index c,to be

(6)

Fig. 3. Analysis of proposed SMC sampler based method w.r.t. precision value wmax.

alignedinthesamecluster,justanon-overlappingalignmentwith eachother.

r_c_,_l

>

r_c_,_l₋₁

+

^Nc,l−¹ (10)

r_c_,_l

>

r_c_,_l₋_m

+

1 j=m

Nc,j where m

>

0 (11)

Asa baseline, thefingerprintingbased alignmentalgorithm in [12] is applied with an implementation of [8] fingerprinting in [10].The query-by-examplefingerprintingimplementationin[10]

isparticularlyappropriateforalignmenttaskbecausetheextracted landmarkshasalsotimestampinformationrelativetotheoffsetof theﬁle.Thealignmentalgorithmin[12]canbesummarizedasfol- lows:

1. Generatethereferencedatabaseofﬁngerprintsusingallaudio ﬁlesinthedataset.

2. For each query file, compute the number of exact matching fingerprintsbetweenthequeryfileanddatabase.

3. Findthemostsimilarsequencesfromthedatabasebythresh- oldingthenumberofmatchingﬁngerprints.

4. Computetherelativeoffsetsofsimilarsequencesbyusingthe timestampintheﬁngerprints.

Note that the alignment of two matching sequences is com- putedtwiceinthisschemeandforsomepairstheyarenotequal.

In thiscase, we choose the alignment withthe higherlandmark hit.Inadditiontothat,wedon’ttakethesequenceswiththesame cameraindexc inthesimilarsequencelist.

Itisalsoimportanttomentionthat thechoiceforthethresh- old foreliminating lesssimilar sequences, dependson the target dataset.Here,wetuned thethresholdtoobtainthehighestaccu- racy.

4.2. Evaluationcriteria

Weutilizeaccuracy, precision,recall and F−^{measure as}êval- uationmetrics.Whilstwebenefitfromformaldefinitionsofthose metrics, we exclude the results for the pairs of sequences that are recorded by the same camera because they are always truly aligned(see section 4.1). Then we define truepositive (TP), true negative(TN),falsepositive (FP)andfalsenegative(FN)measures asfollows:

• ^{T P –} ^True ^Positive: ^Two ^sequences ^overlap ⁱⁿ ^the ^ground- truthandestimatedasoverlappingwithatruerelativeoffset.

• ^{T N –} ^True ^Negative: ^Two ^sequences ^do ^not ^overlap ⁱⁿ ^the ground-truthandestimatedasnotoverlapping

• ^{F P –} ^False ^Positive: ^Two ^sequences ^do ^not ^overlap ⁱⁿ ^the ground-truthbutestimatedasoverlapping.

• ^{F N –} ^False ^Negative: ^Two^sequences ^overlap ⁱⁿ^the ^ground- truthbutestimatedasnotoverlappingortwosequencesover- lap in the ground-truth, also estimated as overlapping but theirrelativeoffsetisnottrue.

4.3. Evaluationresults

In this section, we give the evaluation results for the proposed SMC sampler based multiresolution multiple audio alignment method andcompare the results with the baseline ﬁnger- printingbasedalignmentmethodin[12].

As explained in section 3.4, annealingis applied by gradually increasing theprecision parameterw throughintermediatedistri- butions.Here,westartwithw=⁰.51 atthelowestresolutionand increasetheprecisionuptowmax.Toanalyzetheeffectofwmaxon theperformanceoftheproposedmethod,weevaluatethemethod forvariousvaluesof wmaxandtheresultsaregiveninFig. 3.

An importantparameter oftheSMCsampleristhenumberof samples in the procedure which is actually based on the initial resolution. One of the criteriato determine the initial resolution istheminimumnumberofsamplesTs.Hereweanalyzetheeffect oftheminimumnumberofsamplesTsbyevaluatingtheproposed methodforvariousvaluesofT_sandtheresultsaregiveninFig. 4.

The best results for SMC based multiresolution method are obtained with wmax =⁰.64 and Ts = ^100. ^To ^determine ^the best threshold parameter for the ﬁngerprinting based alignment method [12],we applied a grid search for which the results are given in Fig. 5. The best results for both the proposed and the baseline methodsaregiveninTable 3 andTable 4.The computa- tiontimeforﬁngerprintingbasedexperimentsisaround6minutes onaverage(foreach threshold).Ontheother hand,thecomputa- tion time for SMCsampler based systemis around 3.5 hours on average.

4.4. Discussion

The resultsin Table 4 show that with a proper conﬁguration of parameters,SMCsampler based multipleaudioalignment sys-

(7)

Fig. 4. Analysis of proposed SMC sampler based method w.r.t. minimum number of samples Ts.

Fig. 5. Analysis of ﬁngerprinting based alignment method[12]w.r.t. threshold.

Table 3

Evaluationresults:FP,FN,TPandTNfortheproposedSMCsamplerbasedsystem andthebaselineﬁngerprintingbasedsystem[12].

F P F N T P T N

SMCsamplerbasedmethod 0 47 40 806

Fingerprintingbased method[12]

0 63 24 806

Table 4

Evaluationresults:Precision,Recall,F -measure andAccuracy fortheproposedSMC samplerbasedsystemandthebaselineﬁngerprintingbasedsystem[12].

Precision Recall F -measure Accuracy

SMCsampler basedmethod

1.0 0.4598 0.6299 0.947

Fingerprinting basedmethod [12]

1.0 0.2758 0.4324 0.929

temisableoutperform thebaseline ﬁngerprintingsystem[12]in accuracy.

Forbothsystems,the precision iscomputedas1.0,sincethere are no F P errors for eithermethod. Ascore of1.0 can be inter-

preted asfollows;if thesystem estimatesan alignment between twosequenceswhichsuggeststhatthesequencesmatchwitheach other,withan estimatedamountof overlap,thenthisestimation isconsistentwiththegroundtruth. Notethat ifan alignmentes- timate is not a F P then it is a T N. In other words, all pair of sequences that donot matchor not overlapinthe ground-truth, aresuccessfullyestimatedasnotoverlappingbybothsystems.

Inpractice,theSMCbasedmultipleaudioalignmentsystemis moresensitive to F P typeerrors thanthe baseline system.Since theoptimization isdone sequentiallywithaligning onesequence againstaclusteratatime,anyfalsepositivealignmentwouldpos- siblycauseanerrorpropagationthroughtherestofthesequences.

Constraints onalignments (see section 4.1) causethe sameeffect because if a sequence from camera c is wrongly aligned during thesequentialprocedure,therestofthesequencesfromthesame camera arealignedaccordinglyhence theerroragainpropagates.

Such error propagationdoes not happen for the baseline system sinceeachpairofsequencesaresearchedindependently.

Animmediatesolutiontoprevent ortominimize F P errors in theproposedsystem,istochoosetheprecisionparameterw ofthe modelhighenoughsothatonlysimilarsequencesareestimatedas matching.Similarly,choosingathresholdhighenoughleadstoless

(8)

F P errorsforthebaselinesystemi.e, thenumberofexactﬁnger- print matchesbetweensequences havetobe high(more similar) tobematched.

On the other hand, the recall result of the SMC based alignment systemis higherthan the baseline method.A higher recall valuesuggeststhattheSMCbasedalignmentestimationsaremore accurate thanthe baselinemethod forpairofsequences that are knowntomatchwithacertainamountofoverlapintheground- truth. This can also be observed fromthe F N results in Table 3 wherethenumberof F N forthebaselineishigherthantheSMC basedalignment system. Notethat sum of F N and T P givesthe totalnumberofpairofsequencesthatmatchintheground-truth.

ThereforeifanalignmentestimateforsuchsequencesisnotaT P thenithastobea F N.Ahighervalue ofF N directly resultsina lowrecall value.

We observe fromFig. 3 that forhigh values of w, F P errors decrease but F N errors increase dueto the reasonthat only se- quenceswithhighsimilaritywouldbeestimatedasmatching.This results in a high precision value but a low recall value. On the other hand, for low values of w, sequences with less similarity could be estimated as matching hence F N errors decreases but F P increases. Hence there is a trade of in choosing the preci- sionparameter w ofthemodelbetweenprecision andrecall measures.F -measure decreasesquicklyifprecision orrecall isdominant againsttheother.Hence F -measure canbe utilizedfortuningthe precisionparameteroftheSMCbasedalignmentsystem.Asimilar behaviorcanbeobservedforthebaselinesysteminFig. 5.Asthe thresholdincreases, the recall decreasessince the numberof F N errorsdecrease.

TheevaluationresultsinFig. 4revealthat theminimumnum- ber ofthe samples Ts doesnot havea criticaleffect onthe per- formance. This mightbe dueto the fact that the other criterion indeterminingtheresolutioni.e.,length ofeachsequenceshould be higherthan 1 inlow resolution ismore dominantduring the sequentialprocedure.Thetwolongestandtheshortestsequences areoflengthN₁=^33655,^N2=^{18155 and}^N3=^{499 frames}^{(17 m} 57 s,9 m41 s and16 s respectively)intheJikudataset.Assuming Ts=^100,^forâligning ^longestând ^shortest^sequences, ^the^lowest resolutionis chosen asL=^{7 so}^the^length ôf^sequences ^become N7,1=^{262 and} ^N7,3=^{3 frames} ând ^the ^number ôf ^samples îs equalto262+³−¹=^264.^Thus,^lengthsôf^sequences^determine theresolution levelrather then T_s inthiscase. However fortwo longestsequences,thelowestresolutionischosenasL=^{9 so}^the lengthofsequences become N9,1=^{66 and}^N9,2=^{35 frames}ând thenumberofsamplesisequalto66+³⁵−¹=^100.^For^these^two sequences, it is possibleto increase theresolution up to L=^10, howevertheresolutionisdeterminedbyTsinthiscase.

5. Conclusionandfuturedirections 5.1. Conclusion

In thiswork, an SMC sampler based multiresolution multiple audio sequence alignment scheme is proposed. In the design of thesystem, we benefit fromthe flexible optionsof SMCsampler fortheforwardkernelandthediscretescorefunctionthatcanbe computedindifferentresolutions.Theresultsshow thatthe SMC samplerbasedsystemcanoutperformafingerprintbasedbaseline systemwithproperchoiceofparametersi.e.,precision,numberof samples,lowresolutionlevels,lengthofsmoothingkernel.

Oneofthemainchallengesofalignmentproblemisthevaria- tionof noiseamong therecordings that matchwitheach onthe timeline.State oftheartﬁngerprintingbased alignmentmethods countsthe number of exactly matching ﬁngerprintsbetween se- quencesasameasureofsimilarity.Thus,underlowSNRconditions thisapproachtends tofail.Ontheother hand,withtheproposed

methoditispossibletotunethemodelusingtheprecisionparam- eterw todeterminethesimilaritybetweentwosequences.

Thechallengeforthealignmentmethodsthatarebasedonsim- ilarity functionssuch ascorrelation isthat it isnot clearhow to use these measures for multiple audio alignment since they are deﬁned for measuring the similarity of a pair of sequences. The proposed method solves this problem as sequentially aligning a sequence to a cluster by applying a SMC based multiresolution samplingmethod.

From a practical point of view, for some applications of audio alignment such as audio restoration, camera synchronization andsourceseparation,requirehighprecision rather thanhighac- curacy.Forexample,inanaudiorestorationapplication, F P errors mightcauseseveredegradationintherestoredaudioorinacam- erasynchronizationapplication, F P errorsmightjoinclustersthat are unrelated. For such applications, one can maximize precision bytuningtheparametersaccordingly,eventhoughtheoverallsys- temaccuracy decreases.Theproposedsystemisadjustabletosuch situations simplybyincreasing theprecision parameter w,ofthe model.

5.2. Futuredirections

A drawback of the proposed systemis the computation time.

Even though theSMC sampler isa fast andeﬃcientmechanism, the algorithm still suffers from the sequential alignment of the sequences. An ideal optimization method for the model based approach would be a samplingmechanism that directly samples from the multiresolution(r) that is yet to be researched. Such a methodwouldalsoovercometheerrorpropagationproblemof thesequentialprocess.

One straightforwardsolution tofurtherenhance thecomputa- tional timeis usingparallelprocessing. Note thateach sample in the SMC sampler moves independent from each other, thus the systemiscompletelyparallelizable.Inadditiontothat,paralleling strategiessuchasdividingthesequencesintogroupsandrunning thealignmentprocedureinparallel.

Another issue that effects the computation time is the initial ordering of the sequences. With no prior information about the alignmentsofsequences,orderingaccordingtosequencelengthis feasible.Howeverforsomedatasets,certainattributesoftheaudio datacould beexploitedsuchastempoinformationformusicsig- nals. Thisinformation couldfurther be usedfor pre-classiﬁcation orpre-groupingofsequencesasa pre-processingsteptothepro- posedsystemsimilarto[13].

By incorporatingthe precision parameter w into the Bayesian setting, itis possible tojointly estimate w with alignments r₁_:_K. Thiswaytheoptimumparametersettingwouldbeachievedauto- maticallyrespectingtheinputdatawhichcanfurtherimprovethe performanceoftheproposedsystem.

References

[1]L. Kennedy, M. Naaman, Less talk, more rock: automated organization of community-contributedcollectionsofconcertvideos,in:Proceedingsofthe 18thInternationalConferenceonWorldWideWeb,2009,pp. 311–320.

[2]C.V.Cotton, D.P.Ellis,Audioﬁngerprintingtoidentifymultiplevideos ofan event,in:2010IEEEInternationalConferenceonAcousticsSpeechandSignal Processing(ICASSP),IEEE,2010,pp. 2386–2389.

[3]J.Ojala,S.Mate,I.D.D.Curcio,A.Lehtiniemi,K.Väänänen-Vainio-Mattila,Au- tomatedcreationofmobilevideoremixes:usertrialinthreeeventcontexts, in:Proceedingsofthe13thInternationalConferenceonMobileandUbiquitous Multimedia,MUM’14,ACM,NewYork,NY,USA,2014,pp. 170–179.

[4]AnilAlexander, OscarForth,DonaldTunstall,Musicandnoise ﬁngerprinting andreferencecancellationappliedtoforensicaudio enhancement,in:Audio EngineeringSocietyConference:46thInternationalConference:AudioForen- sics,2012.

[5] Y.Mizushina,W.Fujiwara,T.Sudou,C.L.Fernando,K. Minamizawa,S. Tachi, Interactiveinstantreplay:sharingsportsexperienceusing360-degreesspher- ical images and haptic sensation based on the coupled body motion,

sequences using sequential Monte Carlo samplers Multiresolution alignment for multiple unsynchronized audio Digital Signal Processing

Digital Signal Processing

Multiresolution alignment for multiple unsynchronized audio sequences using sequential Monte Carlo samplers

Dogac Basaran

, Ali Taylan Cemgil

, Emin Anarim

*

τ

τ

α

λ

∼

(λ

; α

)

|

, λ

∼

(

; λ

,

)

(

; λ

,

) =

(

;

)

λ

=

(

;

−

)

λ

=

=

(

)

(

|

) ∝

(

,

, ) =



λ

(

|λ,

)

(λ)

(

)

τ

˜λ

∼

(˜λ

; α

)

˜

|˜

, ˜λ

∼

(˜

; ˜λ

,

)

τ

τ

γ

(r

) =

(r

)

(r

, r

) γ

(r

)

(

(