• Sonuç bulunamadı

A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

N/A
N/A
Protected

Academic year: 2021

Share "A simple yet time-optimal and linear-space algorithm for shortest unique substring queries"

Copied!
13
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Contents lists available atScienceDirect

Theoretical

Computer

Science

www.elsevier.com/locate/tcs

A

simple

yet

time-optimal

and

linear-space

algorithm

for shortest

unique

substring

queries

Atalay Mert ˙Ileri

a

,

M. O˘guzhan Külekci

b

,

Bojian Xu

c

,

,

1

aDepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,MA02139,USA bDepartmentofBiomedicalEngineering,IstanbulMedipolUniversity,Turkey

cDepartmentofComputerScience,EasternWashingtonUniversity,WA99004,USA

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received30March2014 Accepted7November2014 Availableonline13November2014 CommunicatedbyG.Ausiello Keywords:

Uniquesubstring Shortestuniquesubstring Repetitiveness

Regularity

We revisitthe problemoffinding shortestunique substring (SUS)proposed recentlyby Pei et al. (2013)[12].We propose anoptimal O(n)time and spacealgorithm thatcan findanSUS forevery locationofastringofsizen andthussignificantlyimprove their

O(n2) timecomplexity. Ourmethodalsosupports finding all the SUSescovering every

location,whereastheirscanfindonlyoneSUSforeverylocation. Further,oursolution is simplerand easier to implementand ismore spaceefficient inpractice, since weonly use the inverse suffix arrayand the longest common prefix array of the string, while theiralgorithmusesthesuffixtreeofthestringandotherauxiliarydatastructures. Our theoreticalresultsarevalidatedbyanempiricalstudywithreal-worlddatathatshowsour methodisatleast 8 timesfasterand usesatleast20 times less memory.The speedup gained by our methodagainst Pei et al.’scan becomeeven more significant whenthe stringsizeincreasesduetotheirquadratictimecomplexity.We alsohavecomparedour methodwiththerecent Tsurutaet al.’s (2014)[14]proposal, anotherindependent O(n)

timeandspacealgorithmforSUSfinding.Theempiricalstudyshowsthatbothmethods havenearlythesameprocessingspeed.However,oursusesatleast4times lessmemory forfindingoneSUSandatleast2times lessmemoryforfindingallSUSes,bothcovering everystringlocation.

©2014ElsevierB.V.All rights reserved.

1. Introduction

Repetitivestructureandregularityfinding[2,1,13]hasreceivedmuchattentioninstringologyduetoits comprehensive applications in different fields, especially in computational biologyand bioinformatics research [11,10]. Finding shortest unique substrings(SUS)canbe an indirectwayforfindingrepetitive structuresof astring,becauseanyproper substring of ashortestuniquesubstringoccursmultipletimesinthestringandthusisarepeat[7].Shortestuniquesubstringshave beenpreviouslyusedincomparingDNAsequences[3].However,efficientmethod forfindingtheshortestuniquesubstring coveringa givenstringlocation wasnot studied,untilrecentlyit was proposedby Peiet al.[12].Aspointedout in[12],

Authornamesarelistedinalphabeticalorder.Apreliminaryversionofthisarticleappearedat[5].Partofthisworkwasdonewhileallauthorswere withTÜB˙ITAK-B˙ILGEM-UEKAEofTurkeyinSummer2013.

*

Correspondingauthor.

E-mailaddresses:atalay@mit.edu(A.M. ˙Ileri),okulekci@medipol.edu.tr(M.O. Külekci),bojianxu@ewu.edu(B. Xu). 1 SupportedinpartbyEWU’sFacultyGrantsforResearchandCreativeWorks.

http://dx.doi.org/10.1016/j.tcs.2014.11.004 0304-3975/©2014ElsevierB.V.All rights reserved.

(2)

SUS findingalsohasits ownother importantusageinsearch engines andbioinformatics. Wereferreadersto [12] forits detaileddiscussionontheapplicationsofSUSfinding.Peiet al.proposedasolutionthatcosts O

(

n2

)

timeand O

(

n

)

space

tofindan SUSforeverylocationofastringofsizen.Inthispaper,weproposeanoptimalO

(

n

)

timeandspacealgorithm for SUS finding. Ourmethod uses simpler data structuresthat include the suffix array, the inversesuffix array,and the longest commonprefixarray ofthegivenstring,whereas themethodin [12] isbuiltuponthe suffixtree datastructure. OuralgorithmalsoprovidesthefunctionalityoffindingalltheSUSescoveringevery location,whereasthemethodof[12]

searchesforonlyoneSUSforeverylocation.Ourmethodnotonlyimprovestheir resultstheoretically,theempiricalstudy also showsthat our method ismore spacesaving by a factor ofatleast 20 and is fasterby a factor of 4.The speedup gained byourmethodcan becomeevenmoresignificant whenthestringbecomes longerduetothe quadratictimecost of[12].Duetotheveryhighmemoryconsumptionof[12],wewerenotabletoruntheirmethodwithmassivedataonour machine.

Independenceofourwork AfterwepostedaninitialversionofthisproposalatarXiv[6],wewerecontactedviaemailsbythe coauthorsof[14]and[4],bothofwhichsolvedtheSUSfindingusingO

(

n

)

timeandspace.Bythetimewecommunicated, article[14]hadbeenacceptedbuthasnotbeenpublishedand[4]wasstillunderreview.Wewerealsoofferedwiththeir paperdraftsandthesourcecodeof[14].ThemethodsforSUSfindinginbothpapersarebasedonthesearchforminimum uniquesubstrings (MUS),aswhat[12]did.Ouralgorithmtakesa differentapproachanddoesnot needtosearchforMUS. The problemstudied by [4]is also more general, in that they want to findSUS covering a givenchunk of locations in the string,instead of a single location considered by [12,14] andour work. So, by all means, our work is independent andpresents adifferentoptimalalgorithm forSUS finding.Wealso haveincludedthe performancecomparison withthe algorithmof[14]intheempiricalstudy.Itshowsthatbothmethodshavenearlythesameprocessingspeed,butourmethod uses atleast4times lessmemoryforfindingoneSUSforeverystringlocationandusesatleast2times lessmemoryfor findingallSUSesforeverystringlocation.Thealgorithmfrom[4]cannotbeempiricallystudiedastheauthordidnotprefer toreleasethecodeuntiltheirpaperisaccepted.

2. Preliminary

We consider a string S

[

1

. . .

n

]

, where each character S

[

i

]

is drawn from an alphabet

Σ

= {

1

,

2

,

. . . , σ

}

.A substring S

[

i

. . .

j

]

of S represents S

[

i

]

S

[

i

+

1

]

. . .

S

[

j

]

if1

i

j

n, andisan empty stringif i

>

j.String S

[

i

. . .

j

]

isa proper substring of another string S

[

i

. . .

j

]

ifi

i

j

j and j

i

<

j

i. The length of a non-empty substring S

[

i

. . .

j

]

, denoted as

|

S

[

i

. . .

j

]|

,is j

i

+

1.We definethe lengthofan empty stringaszero.A prefix of S is asubstring S

[

1

. . .

i

]

forsome i,1

i

n.A properprefix S

[

1

. . .

i

]

isaprefixofS wherei

<

n.A suffix ofS isasubstring S

[

i

. . .

n

]

forsome i, 1

i

n. A propersuffix S

[

i

. . .

n

]

is asuffix of S where i

>

1. Wesay thecharacter S

[

i

]

occupiesthe string location i. We saythesubstring S

[

i

. . .

j

]

covers thekthlocationof S,ifi

k

j.Fortwostrings A and B,wewrite A

=

B (andsay A is equal to B),if

|

A

|

= |

B

|

and A

[

i

]

=

B

[

i

]

fori

=

1

,

2

,

. . . ,

|

A

|

.Wesay A islexicographicallysmallerthan B,denotedas A

<

B,if(1) A isaproper prefixof B,or(2) A

[

1

]

<

B

[

1

]

,or(3)thereexistsan integerk

>

1 suchthat A

[

i

]

=

B

[

i

]

forall 1

i

k

1 but A

[

k

]

<

B

[

k

]

.Asubstring S

[

i

. . .

j

]

of S is unique,iftheredoesnotexistanothersubstring S

[

i

. . .

j

]

of S, suchthat S

[

i

. . .

j

]

=

S

[

i

. . .

j

]

buti

=

i.Asubstringisa repeat ifitisnotunique.

Definition2.1. Fora particular string location k

∈ {

1

,

2

,

. . . ,

n

}

, the shortestuniquesubstring(SUS)coveringlocation k, denotedasSUSk,isauniquesubstring S

[

i

. . .

j

]

,suchthat(1)i

k

j,and(2)thereisnootheruniquesubstring S

[

i

. . .

j

]

of S,suchthati

k

jand j

i

<

j

i.

Foranystringlocationk, SUSk mustexist,becausethestring S itself canbeSUSk ifnoneofthepropersubstringsof S is SUSk.AlsotheremightbemultiplecandidatesforSUSk.Forexample,ifS

= abcbb

,thenSUS2 canbeeitherS

[

1

,

2

]

= ab

or S

[

2

,

3

]

= bc

.

Foraparticular stringlocationk

∈ {

1

,

2

,

. . . ,

n

}

,the left-boundedshortestuniquesubstring(LSUS)startingatlocation k, denotedasLSUSk,isauniquesubstring S

[

k

. . .

j

]

,such thateitherk

=

j oranyproperprefixof S

[

k

. . .

j

]

isnotunique.

NotethatLSUS1

=

SUS1always exists,becauseatleastthewholestringS isunique.However,foranarbitrarylocationk

2, LSUSk may notexist. Forexample,if S

= abcabc

, then noneofLSUS4, LSUS5,andLSUS6 exists.An up-to-j extensionof

LSUSk,denotedasLSUSjk,isthesubstring S

[

k

. . .

j

]

,wherek

+ |

LSUSk

|

j

n.

The suffixarray SA

[

1

. . .

n

]

ofthestring S isapermutationof

{

1

,

2

,

. . . ,

n

}

,suchthatforanyi and j,1

i

<

j

n, we have S

[

SA

[

i

]

. . .

n

]

<

S

[

SA

[

j

]

. . .

n

]

.Thatis,SA

[

i

]

isthestartinglocationoftheithsuffixinthesortedorderofallthesuffixes of S.The rankarray Rank

[

1

. . .

n

]

is theinverseofthesuffixarray.Thatis,Rank

[

i

]

=

j iffSA

[

j

]

=

i.The longestcommon prefix(lcp)array LCP

[

1

. . .

n

+

1

]

isanarrayofn

+

1 integers,suchthatfori

=

2

,

3

,

. . . ,

n,LCP

[

i

]

isthelengthofthelcpof the twosuffixes S

[

SA

[

i

1

]

. . .

n

]

and S

[

SA

[

i

]

. . .

n

]

.WesetLCP

[

1

]

=

LCP

[

n

+

1

]

=

0.Intheliterature, thelcparray isoften definedasanarray ofn integers. Weincludean extrazeroatLCP

[

n

+

1

]

just tosimplifythedescriptionofourupcoming algorithms.Table 1showsthesuffixarrayandthelcparrayoftheexamplestring

mississippi.

The next Lemma 2.1showsthat, byusing therankarray andthe lcp arrayof thestring S, itis easyto calculateany LSUSi ifitexistsortodetectthatitdoesnotexist.

(3)

Table 1

ThesuffixarrayandthelcparrayofanexamplestringS= mississippi.

i LCP[i] SA[i] suffixes 1 0 11 i 2 1 8 ippi 3 1 5 issippi 4 4 2 ississippi 5 0 1 mississippi 6 0 10 pi 7 1 9 ppi 8 0 7 sippi 9 2 4 sissippi 10 1 6 ssippi 11 3 3 ssissippi 12 0 – – Lemma2.1.Fori

=

1

,

2

,

. . . ,

n: LSUSi

=



S

[

i

. . .

i

+

Li

],

if i

+

Li

n

not existing

,

otherwise

whereLi

=

max

{

LCP

[

Rank

[

i

]],

LCP

[

Rank

[

i

]

+

1

]}

.

Proof. Notethatbythedefinitionofthelcparray,LiisthelengthofthelongestcommonprefixbetweenthesuffixS

[

i

. . .

n

]

andanyothersuffixofS.ThevalueofLicanbeanynumberfromtheset

{

0

,

1

,

. . . ,

n

i

+

1

}

.Ifi

+

Li

n,i.e.,Li

<

n

i

+

1,

itmeanssubstring S

[

i

. . .

i

+

Li

]

existsandisunique,whilesubstring S

[

i

. . .

i

+

Li

1

]

iseitheremptyorisarepeat.So,by

thedefinitionofLSUS,S

[

i

. . .

i

+

Li

]

isLSUSi.Ontheotherhand,ifi

+

Li

>

n,i.e., Li

=

n

i

+

1,itmeans S

[

i

. . .

i

+

Li

1

]

is

indeedthesuffix S

[

i

. . .

n

]

andisarepeat,soLSUSi doesnotexist.

2

3. SUSfindingforonelocation

Inthissection,we wanttofindtheSUScoveringa givenlocation k usingO

(

n

)

time andspace. Westart withfinding theleftmostoneifk hasmultipleSUSes.Intheend,wewillshowatrivialextensiontofindalltheSUSescoveringlocation k withthesametimeandspacecomplexities,ifk hasmultipleSUSes.

Lemma3.1.EverySUSiseitheranLSUSoranextensionofanLSUS.

Proof. Let’s saywe are looking atSUSk foranyk

∈ {

1

. . .

n

}

.We knowSUSk exists foranyk, so let’ssay SUSk

=

S

[

i

. . .

j

]

,

1

i

k

j

n. If S

[

i

. . .

j

]

isneither LSUSi noranextension ofLSUSi,itmeans S

[

i

. . .

j

]

is aproperprefix ofLSUSi and

thusisarepeat,whichcontradictsthefactthat S

[

i

. . .

j

]

=

SUSk isunique.

2

Example1: S

= abcbca

,thenSUS2

=

S

[

1

,

2

]

= ab

, whichisLSUS1.Example2: S

= abcbc

,then SUS2

=

S

[

1

,

2

]

= ab

,

whichisanextensionofLSUS1

=

S

[

1

]

tolocation2.

By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS, andthe starting location of that LSUS

mustbeonorbeforelocationk. ThenthealgorithmforfindingSUSk foranygivenstringlocationk issimply tocalculate LSUS1

,

LSUS2

,

. . . ,

LSUSk ifexisting,usingLemma 2.1.Duringthiscalculation,ifanyLSUSdoesnotcoverthelocationk,we

simplyextendthatLSUSup tolocationk.WewillpicktheshortestoneamongalltheLSUSesortheir up-to-k extensions asSUSk.Weresolvethetiebypickingtheleftmostone.ItispossiblethisprocedurecanearlystopifitfindsanLSUSdoes

not exist,becausethat indicates allthe otherremaining LSUSes donot existeither.Algorithm 1 givesthepseudocode of thisprocedure,wherewerepresentSUSk byitstwoattributes:

start

and

length,

thestartinglocationandthelengthof SUSk,respectively.

Lemma3.2.Givenastringlocationk andtherankandthelcparrayofthestringS,Algorithm 1canfindSUSkusingO

(

k

)

time.Ifthere aremultiplecandidatesforSUSk,theleftmostoneisreturned.

Proof. TheprocedurestartswiththecandidateS

[

1

. . .

n

]

,whichisindeedunique(Line1).Thenthe

For

loopcalculatesthe LSUSifori

=

1

,

2

,

. . . ,

k (Lemma 2.1).IfLSUSiexists(Line4)andthelengthofLSUSioritsup-to-k extensionislessthanthe

length ofthecurrentbestcandidate(Line5), thenwe willpickthat LSUSi orits up-to-k extension asthe newcandidate

forSUSk.Thisalso resolvesthepossibletiesby pickingtheleftmostcandidate.Intheendoftheprocedure,we willhave

the shortestone amongLSUS1

. . .

LSUSk or their up-to-k extensions,and that isSUSk.Early stop ismade atLine 7ifthe LSUS beingcalculateddoesnotexist,becausethatmeansalltheremainingLSUSes tobecalculateddonotexisteither.Each stepinthe

For

loopcosts O

(

1

)

timeandtheloopexecutesnomorethank steps,sotheproceduretakesatotalof O

(

k

)

(4)

Algorithm 1: FindSUSk.Returntheleftmostoneifk hasmultipleSUSes.

Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS

Output: SUSk.Theleftmostonewillbereturnedifk hasmultipleSUSes.

1 start←1;lengthn ; // Start location and length of the best candidate for SUSk.

2 for i=1,. . . ,k do

3 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};

4 if i+Ln then // LSUSi exists.

/* Extend LSUSi up to k if needed. Resolve the tie by picking the leftmost SUS. */

5 if max{L+1,ki+1}<length then 6 starti;length←max{L+1,ki+1};

7 else break; // Early stop.

8 PrintSUSk← (start,length);

Algorithm 2: FindalltheSUSescoveringagivenlocationk.

Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS

Output: AlltheSUSescoveringlocationk.

1 start←1;lengthn ; // Start location and length of the best candidate for SUSk.

/* Find the length of SUSk. */

2 for i=1,. . . ,k do

3 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};

4 if i+Ln then // LSUSi exists.

5 if max{L+1,ki+1}<length then // Extend LSUSi to location k if necessary.

6 starti;length←max{L+1,ki+1};

7 else break; // Early stop.

/* Find all SUSes covering location k. */

8 for i=1,. . . ,k do

9 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};

10 if i+Ln then // LSUSi exists.

11 if max{L+1,ki+1}=length then // Extend LSUSi to location k if necessary.

12 Print(i,max{L+1,ki+1});

13 else break; // Early stop.

Theorem3.1.Foranylocationk inthestringS,wecanfindSUSkusingO

(

n

)

timeandspace.IftherearemultiplecandidatesforSUSk, theleftmostoneisreturned.

Proof. Thesuffix arrayof S canbeconstructedbyexistingalgorithmsusing O

(

n

)

timeandspace(forexample,[9]).After thesuffixarrayisconstructed,therankarray(theinversesuffixarray)canbetriviallycreatedusinganotherO

(

n

)

timeand space.WecanthenusethesuffixarrayandtherankarraytoconstructthelcparrayusinganotherO

(

n

)

timeandspace[8]. Combining thetimecost ofAlgorithm 1(Lemma 3.2),thetotaltime costforfindingSUSk foranylocationk inthestring S of sizen is O

(

n

)

withatotal of O

(

n

)

spaceusage. IfmultiplecandidatesforSUSk exist,theleftmostcandidatewillbe

returnedasisprovidedbyAlgorithm 1(Lemma 3.2).

2

3.1. Extension:findingallSUSesforonelocation

It is trivialto extend Algorithm 1 to find all the SUSes covering a particular location k as follows.We can first use

Algorithm 1tofindtheleftmostSUSk.Thenwestart overagaintore-calculateLSUS1

. . .

LSUSkortheir up-to-k extensions,

and returnall ofthose whose length isequal to thelength of SUSk.Algorithm 2 showsthe pseudocode.This procedure

clearlycostsanextra O

(

k

)

time.CombiningtheresultsfromTheorem 3.1,wegetthefollowingtheorem. Theorem3.2.Foranylocationk inthestringS,wecanfindalltheSUSescoveringlocationk usingO

(

n

)

timeandspace.

4. SUSfindingforeverylocation

Inthissection,wewanttofindSUSk foreverylocationk

=

1

,

2

,

. . . ,

n.Ifk hasmultipleSUSes,theleftmostonewillbe

returned.Intheend,wewillshowanextensiontofindallSUSesforeverylocation.

Anaturalsolutionistoiteratively useAlgorithm 1asasubroutinetofindevery SUSk,fork

=

1

,

2

,

. . . ,

n.However, the

totaltimecostofthissolutionwillbe O

(

n

)

+



n

k=1O

(

k

)

=

O

(

n2

)

,whereO

(

n

)

capturesthetimecostfortheconstruction

oftherankarrayandthelcparrayand



nk=1O

(

k

)

isthetotaltimecostforthen instancesofAlgorithm 1.Wewanttohave asolutionthatcostsatotalof O

(

n

)

timeandspace,whichimpliesthattheamortizedcostforfindingeachSUSis O

(

1

)

.

(5)

ByLemma 3.1,weknowthateverySUSmustbe anLSUSoran extensionofanLSUS.ThenextLemma 4.1 furthersays ifSUSkisanextensionofanLSUS,ithassomespecialpropertiesandcanbequicklyobtainedfrom SUSk−1.

Lemma4.1.Foranyk

∈ {

2

,

3

,

. . . ,

n

}

,ifSUSkisanextensionofanLSUS,then(1)SUSk−1mustbeasubstringwhoserightboundaryis thecharacterS

[

k

1

]

,and(2)SUSkisthesubstringSUSk−1appendedbythecharacterS

[

k

]

.

Proof. Because SUSk is an extension of an LSUS, we have SUSk

=

S

[

i

. . .

k

]

for some i

<

k and LSUSi

=

S

[

i

. . .

j

]

forsome j

<

k. We alsoknow S

[

i

. . .

k

1

]

isunique, becausethe unique substring S

[

i

. . .

j

]

is a prefix of S

[

i

. . .

k

1

]

. Notethat anysubstringstartingfromalocationbeforei andcoveringlocationk

1 islongerthantheuniquesubstring S

[

i

. . .

k

1

]

, so SUSk−1 mustbe starting froma location betweeni and k

1,inclusive. Next, weshow SUSk−1 actuallymust start at

location i.ThefactSUSk

=

S

[

i

. . .

k

]

tellsusthat

|

LSUSt

|

≥ |

SUSk

|

=

k

i

+

1 foreveryt

=

i

+

1

,

i

+

2

,

. . . ,

k;otherwise,any LSUSt that isshorterthank

i

+

1 wouldbe abetter candidatethan S

[

i

. . .

k

]

asSUSk.Thatmeans,anyuniquesubstring

startingfromt

=

i

+

1

,

i

+

2

,

. . . ,

k

1 hasalengthatleastk

i

+

1.However,

|

S

[

i

. . .

k

1

]|

=

k

i

<

k

i

+

1 andS

[

i

. . .

k

1

]

isuniquealreadyandcoverslocationk

1 aswell,so S

[

i

. . .

k

1]istheonlycandidateforSUSk−1.ThisalsomeansSUSk

isindeedthesubstringSUSk−1appendedby S

[

k

]

.

2

4.1. Theoverallstrategy

We areready topresentthe overall strategy forfindingSUS ofevery location, by usingLemmas 3.1 and 4.1. Wewill calculate all the SUS in the orderof SUS1

,

SUS2

,

. . . ,

SUSn.That means when we want to calculate SUSk, k

2,we have

hadSUSk−1 calculatedalready.Notethat SUS1

=

LSUS1, whichiseasyto calculateusingLemma 2.1.Now let’slookatthe

calculation of a particular SUSk,k

2. By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS. By Lemma 4.1,wealsoknowifSUSk isanextensionofanLSUS,thentherightboundaryofSUSk−1 mustbe S

[

k

1

]

andSUSk

isjustSUSk−1 appendedby thecharacter S

[

k

]

.Suppose whenwe wanttocalculateSUSk,we havealreadycalculatedthe

shortestLSUScoveringlocationk orhaveknownthefactthatnoLSUScoverslocationk.Then,byusingSUSk−1,whichhas

beencalculatedbythen,andtheshortestLSUScoveringlocationk,wewillbeabletocalculateSUSkasfollows:

Case 1:If theright boundaryof SUSk−1 is not S

[

k

1

]

, then we knowSUSk cannot be an extension of an LSUS (the

contrapositiveofLemma 4.1).Thus,SUSk isjusttheshortestLSUScoveringlocationk,whichmustbeexistinginthiscase.

Case 2:If theright boundaryof SUSk−1 is S

[

k

1

]

, then SUSk mayor maynot be an extension of an LSUS. Wewill

considertwopossibilities:(1)IftheshortestLSUScoveringlocationk exists,wewillcompareitslengthwith

|

SUSk−1

|

+

1,

andpicktheshorteroneasSUSk.Ifbothhavethesamelength,weresolvethetiebypickingtheonewhosestartinglocation

indexissmaller.(2)IfnoLSUScoverslocationk,SUSkwilljustbeSUSk−1 appendedby S

[

k

]

.

Therefore,therealchallengehere,bythetimewewanttocalculateSUSk,k

2,istoensurethatwewouldhavealready

calculatedtheshortestLSUScoveringlocation k orwewouldhavealreadyknownthefactthat noLSUS coverslocationk. IfthereexistmultipleshortestLSUSescoveringlocationk,wewouldliketoknowtheleftmostone.

4.2. Preparation

We nowfocus on thecalculation of theleftmost shortestLSUS covering every string location k, denotedby SLSk. Let

Candidateki denotetheleftmost shortestoneamongthoseofLSUS1

,

. . . ,

LSUSkthatexistandcoverlocationi.Foranarbitrary k,1

k

n,SLSkmaynotexist,becausethelocationk maynotbecoveredbyanyLSUSatall.Forexample,ifS

= abcabc

,

then locations 5 and6 arenotcovered byanyLSUS, andthus SLS5 andSLS6 donot exist.However,ifSLSk exists,bythe

definitionofSLS andCandidate,wehavethefollowingfact. Fact4.1.SLSk

=

Candidatekk

=

Candidatekk+1

= · · · =

Candidate n

k,ifSLSkexists.

OurgoalistoensureSLSkwillhavebeenknownwhenwe wanttocalculateSUSk,sowecalculateevery SLSkfollowing

the sameorderk

=

1

,

2

,

. . . ,

n, atwhich wecalculate all SUSes.Becausewe need toknow every LSUSi,i

k in orderto

calculateSLSk (Fact 4.1), wewillwalk throughthestringlocationsk

=

1

,

2

,

. . . ,

n: ateachwalk stepk, wecalculateLSUSk

andmaintain Candidateki forevery stringlocation i thathasbeencovered byatleastoneofLSUS1

,

LSUS2

,

. . . ,

LSUSk.Note

that Candidateki

=

SLSi forevery i

k (Fact 4.1). ThoseCandidateki withi

k would havealreadybeenused asSLSi inthe

calculationofSUSi.So,aftereachwalkstepk,wewillonlyneedtomaintainthecandidatesforlocationsafterk.

Lemma4.2.(1)LSUS1alwaysexists.(2)IfLSUSkexists,thenLSUS1

,

LSUS2

,

. . . ,

LSUSkallexist.(3)IfLSUSkdoesnotexist,thennone ofLSUSk

,

LSUSk+1

,

. . . ,

LSUSnexist.

Proof. (1) LSUS1 mustexist,becausethestring S canbe LSUS1 ifeveryproper prefixof S isa repeat.(2)IfLSUSk exists,

sayLSUSk

=

S

[

k

. . .

γ

k

]

,thenLSUSi existsforevery i

k,becauseatleastS

[

i

. . .

γ

k

]

isuniqueduetothefactthat S

[

k

. . .

γ

k

]

isuniqueandalsoisasuffixof S

[

i

. . .

γ

k

]

.(3)IfLSUSk doesnot exist,itmeans S

[

k

. . .

n

]

isarepeat,andthus everysuffix S

[

i

. . .

n

]

of S

[

k

. . .

n

]

fori

k isalsoarepeat,i.e.,LSUSi doesnotexistforeveryi

k.

2

(6)

ThenextlemmashowsthattherightboundaryofLSUSiwillbeonoraftertherightboundaryofLSUSi−1,ifLSUSiexists.

Lemma4.3.Foreachi

=

2

,

3

,

. . . ,

n:

|

LSUSi

|

≥ |

LSUSi−1

|

1.

Proof. We prove the lemma by contradiction. Suppose LSUSi−1

=

S

[

i

1

. . .

j

]

for some j, i

1

j

n. If

|

LSUSi

|

<

|

LSUSi−1

|

1,it means LSUSi

=

S

[

i

. . .

k

]

,where i

k

<

j.Because S

[

i

. . .

k

]

is unique, S

[

i

1

. . .

k

]

is alsounique, whose

lengthhoweverisshorterthan S

[

i

1

. . .

j

]

.ThisisacontradictionbecauseS

[

i

1

. . .

j

]

isalreadyLSUSi−1.Thus,theclaim

inthelemmaistrue.

2

Now let’slookatthesituationattheendofthekthwalkstep.Bythen,wehavecalculatedLSUS1

,

LSUS2

,

. . . ,

LSUSk.By Lemma 4.2,weknowthatthereexistssome



k,1

≤ 

k

k,suchthatLSUS1

,

. . . ,

LSUSk allexist,butLSUSk+1

. . .

LSUSkdonot exist.If



k

=

k,thatmeansLSUS1

,

. . . ,

LSUSkallexist.Let

γ

kdenotetherightboundaryofLSUSk,i.e.,LSUSk

=

S

[

k

. . .

γ

k

]

.By Lemma 4.3,we know

γ

kisalsotherightboundaryofthestringlocationscovered byLSUS1

,

. . . ,

LSUSk.So,every location 1

,

2

,

. . . , γ

k is covered by at least one LSUS from LSUS1

,

. . . ,

LSUSk. Thatis, at the end of the kth walk step: (1) every location j

=

1

,

. . . , γ

khasitscandidateCandidatekj calculatedalready.(2)If

γ

k

<

n,everylocation j

=

γ

k

+

1

,

. . . ,

n stilldoes

not haveits candidatecalculated,becauseeverysuchlocation j hasnotbeencoveredbyanyLSUS fromLSUS1

,

. . . ,

LSUSk thatwehavecalculatedattheendofthekthwalkstep.

Lemma4.4.Attheendofthekthwalkstep,if

γ

k

>

k,thenforanyi andj,k

i

<

j

γ

k,Candidatekjalsocoverslocationi.

Proof. Candidatekj isasubstringstarting somewhereonorbeforek andgoing throughthelocation j.Becausek

i

<

j,it isobviousthatCandidatekjgoesthroughlocation i.

2

Lemma4.5.Attheendofthekthwalkstep,if

γ

k

>

k,then



Candidatekk

 ≤

Candidatekk+1

 ≤···≤

Candidatekγ

k



Proof. By Lemma 4.4,we knowCandidatekj alsocovers location i, forany i and j,k

i

<

j

γ

k. Thus, if

|

Candidatekj

|

<

|

Candidateki

|

, location i’s currentcandidate should be replaced by location j’s candidate,because that gives location i a shortercandidate.However,thecurrentcandidateforlocation i isalreadytheshortestcandidate.Itisacontradiction. So,

|

Candidateki

|

≤ |

Candidatekj

|

,whichprovesthelemma.

2

4.3. FindingSLS foreverylocation

Invariant. WecalculateSLSkfork

=

1

,

2

,

. . . ,

n bymaintainingthefollowinginvariantattheendofeverywalkstep k:(A) If

γ

k

>

k, locations

{

k

+

1

,

k

+

2

,

. . . , γ

k

}

will be cutinto chunks,such that: (A.1)All locationsinone chunkhavethe same

candidate.(A.2) Locationsbelongingtodifferentchunkshavedifferentcandidates.(A.3) Eachchunkwillberepresentedby

a linked listnode of fourfields:

ChunkStart,

ChunkEnd,

start,

length,

respectively representingthe startand

endlocation ofthechunk andthe startandlengthofthe candidatesharedby alllocationsofthechunk. (A.4) Allnodes representing different chunks will be connected into a linked list, ordered by the string positions of the corresponding chunks.Thelinkedlisthasa

head

anda

tail,

referringtothetwonodesthatrepresentthelowestpositionedchunkand thehighestpositionedchunk.(B) If

γ

k

k,thelinkedlistisempty.

Maintenanceoftheinvariant. Wedescribeinaninductivemannertheprocedurethatmaintainstheinvariant.Algorithm 3

showsthepseudocodeoftheprocedure.Westartwithanemptylinkedlist.

Basestep:k

=

1 Wearewalkingthefirststep.WefirstcalculateLSUS1 usingLemma 2.1.WeknowLSUS1 mustexist.Let’s

sayLSUS1

=

S

[

1

. . .

γ

1

]

forsome

γ

1

n.Then,Candidatei1

=

LSUS1 foreveryi

=

1

,

2

,

. . . , γ

1.Werecordallthesecandidates

by usinga singlenode

(

1

, γ

1

,

1

, γ

1

)

.This istheonlynode inthelinked listandispointedbyboth

head

and

tail.

We

knowSLS1

=

Candidate11(Fact 4.1),sowereturnSLS1 byreturning

(head.start, head.length)

= (1,

γ

1

)

.Wethenchange

head.ChunkStart

from1 to2.Ifitturns out

head

.ChunkEnd

=

γ

1

< 2

,meaning LSUS1 reallycoverslocation1 only,

wedeletethe

head

nodefromthelinkedlist,whichwillthenbecomeempty.

Inductivestep:k

2 Wearewalkingthekthstep.WefirstcalculateLSUSkusingLemma 2.1.

Case 1: LSUSk does not exist. (1) If

head

does not exist. It means that location k is covered neither by any of LSUS1

,

. . . ,

LSUSk−1 nor by LSUSk, so SLSk simply does not exist and we will return

(null, null)

. (2) If

head

(7)

Algorithm 3: The sequence of function calls FindSLS

(

1

)

, FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

returns SLS1, SLS2

,

. . . ,

SLSn, ifthe

correspondingSLS exists;otherwise,

null

willbereturned. 1 ConstructRank[1. . .n]andLCP[1. . .n]ofthestringS;

2 InitializeanemptyList; // Each node’s 4 fields: ChunkStart, ChunkEnd, start, length.

3 head←0;tail←0 ; // Reference to the head and tail node of the List 4 FindSLS(k)

/* Process LSUSk, if it exists. */

5 L←max{LCP[Rank[k]],LCP[Rank[k]+1]};

6 if k+Ln then // LSUSk exists.

// Add a new list element at the tail, if necessary.

7 if head=0 then List[1]← (k,k+L,k,L+1);head←1;tail←1 ; // List was empty.

8 elseif k+L>List[tail].ChunkEnd then

9 tail+ +;List[tail]← (List[tail−1].ChunkEnd+1,k+L,k,L+1);

/* Update candidates and merge the nodes whose candidates can be shorter. Resolve the tie by

picking the leftmost one. */

10 jtail;

11 while jhead and List[j].length>L+1 do j− −;

12 List[j+1]← (List[j+1].ChunkStart,List[tail].ChunkEnd,k,L+1);tailj+1;

13 if head=0 then SLSk← (head.start,head.length); // The list is not empty.

14 else SLSk← (null, null); // SLSk does not exist.

/* Discard the information about location k from the List. */

15 if head>0 then // List is not empty

16 if List[head].ChunkEndk then

17 head+ +; // Delete the current head node

18 if head>tail then head←0;tail←0; // List becomes empty

19 else List[head].ChunkStartk+1;

20 return SLSk

remove the information about location k from the head by setting head

.

ChunkStart

=

k

+

1. If it turns out that

head

.ChunkEnd < head.ChunkStart

,wewillremovethe

head

node.

Case 2: LSUSk exists andLSUSk

=

S

[

k

. . .

γ

k

]

,

γ

k

n. By Lemma 4.2,we know LSUS1,LSUS2

,

. . . ,

LSUSk−1 all exist. Let

γ

k−1 denotetherightboundaryofLSUS1,LSUS2

,

. . . ,

LSUSk−1.ByLemma 4.3,we know

γ

k

γ

k−1 and

γ

k−1 isalsothe

rightboundaryofLSUSk−1,i.e.,LSUSk−1

=

S

[

k

1

. . .

γ

k−1

]

.Notethatboth

γ

k−1

<

k and

γ

k−1

k arepossible.

1. If

head

doesnotexist,itmeans

γ

k−1

<

k andnoneofthelocations

{

k

. . .

γ

k

}

iscoveredbyanyofLSUS1

,

LSUS2

,

. . . ,

LSUSk−1.Wewillinsertanewnode

(k,

γ

k

, k,

γ

k

− k + 1)

,whichwillbetheonlynodeinthelinkedlist.

2. If

head

exists,itmeans

γ

k−1

k.If

γ

k

> tail.ChunkEnd

=

γ

k−1,wefirstinsertatthetailsideanewlinkedlist

node

(tail.ChunkEnd

+ 1,

γ

k

, k,

γ

k

− k + 1)

torecordthecandidateinformationforlocationsinthechunkafter

γ

k−1 through

γ

k.

Then,we willtravelthrough thenodesinthelinkedlistfromthetailside towardthehead.We stopwhen wemeet anode whosecandidateisshorterthan orequal toLSUSkorwhenwe reachtheheadendofthelinkedlist.Wewill

mergeallthenodeswhosecandidatesarelongerthanLSUSkintoanewlinkedlistnode.Thechunkcoveredbythenew

nodeistheunionofthechunkscoveredbythemergednodes,andthecandidateofthenewnodeisLSUSk.

Thistravelandmergeprocess isvalidbecauseofLemma 4.5.Thismergeprocessensuresevery locationmaintainsits best(shortest) candidateby theendofevery walk step.Italsoresolves thepossible tiesofmultipleshortestLSUSes coveringaparticularlocationbypickingtheleftmostoneasthatlocation’scandidate,becausethemergeprocessdoes notmergenodeswhosecandidatesareofthesamelength.

We willreturn

(head.start, head.length)

asSLSk, sinceCandidatekk

=

SLSk (Fact 4.1). Finally,we will removethe

informationaboutlocation k fromthe headby settinghead

.

ChunkStart

=

k

+

1. Wewill removethe

head

node ifit turnsoutthat

head

.ChunkEnd > head.ChunkStart

.

Lemma4.6.GiventhelcparrayandtherankarrayofS,thesequenceofFindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

functioncallswill returnSLS1

,

SLS2

,

. . . ,

SLSnifexisting.TheamortizedtimecostofoneFindSLS

()

functioncallisO

(

1

)

.

Proof. ThecorrectnessofAlgorithm 3isalreadygiveninthedescriptionoftheprocedurethatmaintainstheinvariance.All operationsinan instanceofFindSLS

()

function callclearlytake O

(

1

)

time,exceptthe

while

loopatLine11,whichisto mergelinkedlistnodeswhosecandidatescanbe shorter.Thus,thelemmawillbe proved,ifwecanprovethe amortized numberoflinkednodesthatwillbemergedviathat

while

loopisalsoboundedbyaconstant.Notethatnonodeinthe linkedlisteversplitsduetoLemma 4.3.InthesequenceoffunctioncallsFindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

,thereareat

(8)

Algorithm 4: FindingtheleftmostSUSk,k

=

1

,

. . . ,

n. 1 for k←1. . .n do

2 (start,length)FindSLS(k); // SLSk; It is (null, null) if SLSk does not exist. 3 if k=1 then PrintSUSk← (start,length);

4 elseif SUSk−1.start+SUSk−1.length−1>k1 then PrintSUSk← (start,length); 5 elseif(start,length)= (null, null)then PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

6 elseif length<SUSk−1.length+1 then PrintSUSk← (start,length);

7 else // Resolve the tie by picking the leftmost one.

8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1) 9

mostn linkedlistnodesto bemerged.We knowthe numberofmergeoperationsinmergingn nodes intoone node (in theworstcase)isnomorethan O

(

n

)

.Therefore,theamortizedtimecostonmergingthelinkedlistnodesinoneFindSLS

()

function calloverthesequenceofn functioncalls FindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

is O

(

1

)

.Thisfinishesthe proofof

thelemma.

2

4.4. FindingtheleftmostSUS foreverylocation

OnceweareabletosequentiallycalculateeverySLSk ordetectitdoesnotexist,wearereadytocalculateeverySUSk by

usingthe strategydescribed inSection4.1.Algorithm 4givesthepseudocodeoftheprocedure.It calculatesSUSesinthe orderofSUS1

,

SUS2

,

. . . ,

SUSn(Line1).Foreachlocationk,thefunctioncallatLine2istocalculateSLSkortofindSLSkdoes

not exist.Line 3handlesthe specialcasewhereSUS1

=

LSUS1

=

SLS1.The condition atLine4 showsthat SUSi cannotbe

an extension ofan LSUS(Lemma 4.1), soSUSk

=

SLSk,whichmustbeexisting inthiscase. Line5handlesthecasewhere SLSk does not exist,so SUSk mustbe SUSk−1 appended by S

[

k

]

.Line 6 handles the casewhere SLSk isshorter than the

one-character extension ofSUSk−1,so SUSk is SLSk.Lines 7–8 handlethe casewhereSLSk is longerthan or equalto the

one-character extension ofSUSk−1,soSUSk isSUSk−1 appendedby S

[

k

]

.Thisalsoresolves thetie bypicking theleftmost

oneifk iscoveredbymultipleSUSes.

Theorem4.1. Algorithm 4findsSUS1

,

SUS2

,

. . . ,

SUSn ofstring S usinga totalof O

(

n

)

timeandspace.Ifanystringlocation is covered bymultipleSUS,Algorithm 4findstheleftmostone.

Proof. We canconstruct thesuffix array ofthestring S in atotal of O

(

n

)

time andspaceusingexisting algorithms (for example, [9]). The rank array is justtheinverse suffix array andcan be directly obtainedfrom SAusing O

(

n

)

time and space.Thenwecanobtainthelcparrayfromthesuffixarrayandrankarrayusinganother O

(

n

)

timeandspace[8].Sothe totaltimeandspacecostsforpreparingtheseauxiliarydatastructuresareO

(

n

)

.

Time cost. The amortized time cost for each FindSLS function call at Line 2 in the sequence of function calls FindSLS

(

1

),

. . . ,

FindSLS

(

n

)

is O

(

1

)

(Lemma 4.6). The time cost forLines 3–8 isalso O

(

1

)

.There are a total ofn steps in the

For

loop,yieldingatotalofO

(

n

)

timecost.

Spaceusage. Theonly spaceusage (inaddition totheauxiliary datastructuressuchassuffixarray,rankarray,andthe lcparray,whichcostatotalofO

(

n

)

space)inouralgorithmisthedynamiclinkedlist,whichhoweverhasnomorethann nodes at anytime. Eachnode costs O

(

1

)

space.Therefore,thelinked listcosts O

(

n

)

space.Addingthespaceusageofthe auxiliarydatastructures,wegetthetotalspaceusageoffindingeverySUSisO

(

n

)

.

FindingtheleftmostSUS. For anyparticular location k, ifone SUS covering location k is an extension of an LSUS, we knowby Lemma 4.1,that SUSmustbe thesubstring SUSk−1 appendedbytheletter S

[

k

]

.ClearlythisSUS istheleftmost

one among all theSUSescovering location k andis guaranteedto be returned byLines 7–8 inAlgorithm 4.Ifall SUSes covering location k are LSUSes, the leftmost one of those LSUSes is already guaranteed to be returned by Algorithm 3

(Lemma 4.6).

2

4.5. Extension:findingallSUSesforeverylocation

It is possiblethat a particularlocation can havemultiple SUSes.Forexample,if S

= abcbb

, then SUS2 can be either S

[

1

,

2

]

= ab

orS

[

2

,

3

]

= bc

.Algorithm 4onlyreturnsoneofthemandresolvethetiebypickingtheleftmostone.However, itiseasytomodifyAlgorithm 4toreturnalltheSUSesofeverylocation,withoutchangingAlgorithm 3.

Suppose aparticularlocationk iscoveredbymultipleSUSes.Weknow,attheendofthekthwalkstepbutbeforethe linked listupdate(attheendofLine14inAlgorithm 3),SLSkreturnedbyAlgorithm 3isrecordedby the

head

nodeand

is the leftmostone amongall the SUSesthat are LSUS and coverlocation k. Because every string location maintainsits shortest candidateanddueto Lemma 4.5,all theother SUSesthat are LSUS andcoverlocation k arebeingrecordedby other linked listnodesthatareimmediatelyfollowingthe

head

node. Thisisbecauseifthoseother SUSesare notbeing recorded,that meansthelocation rightaftertheheadnode’schunkhasa candidatelonger thanSUSk ordoesnot havea

(9)

Algorithm 5: FindingallchoicesofeachSUSk,fork

=

1

,

. . . ,

n. 1 for k←1. . .n do

2 flag←0;(start,length)FindSLS(k); // SLSk; (null, null) if SLSk does not exist.

3 if k=1 then

4 PrintSUSk← (start,length);

5 elseif SUSk−1.start+SUSk−1.length−1>k1 then

6 PrintSUSk← (start,length);flag←1; 7 elseif(start,length)= (null, null)then

8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

9 elseif lengthSUSk−1.length+1 then

10 PrintSUSk← (start,length);flag←1; 11 else

12 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

/* Print out other SUSes that cover location k. */

13 if flag=1 then

14 if SUSk−1.length+1=SUSk.length then 15 Print(SUSk−1.start,SUSk−1.length+1);

16 jhead;

17 while j>0 andjtail do

/* List[j].start=SUSk.start condition checking is because the SUS from head node may have been printed.

*/

18 if List[j].length=SUSk.length andList[j].start=SUSk.start then 19 Print(List[j].start,List[j].length); jj+1;

20 elseif List[j].start=SUSk.start then

21 Break;

candidatecalculatedyet,butthatlocationisindeedcoveredbyan SUSkattheendofthekthwalkstep.It’sacontradiction.

SameargumentcanbemadetotheothernextneighboringlocationsthatarecoveredbySUSk.

Therefore, finding all the SUSes covering location k becomes easy—simply go through the linked list nodes fromthe

head

node toward the

tail

node andreport all the candidateswhose lengthsareequal tothe length ofSUSk that we

havefound.IftherightmostcharacterofSUSk−1isS

[

k

1

]

andthesubstringSUSk−1appendedbyS

[

k

]

hasthesamelength,

thatsubstringwillbereportedtoo.Algorithm 5givesthepseudocode,wherethe

flag

isusedtonoteinwhatcasesitis possibletohavemultipleSUSes.

If

flag

ison,wewillneedto checkthelinked listnodes(Lines17–21) aswellastheoneletterextension ofSUSk−1

(Lines 14–15). The overall time and space cost of maintaining the linked list data structure (the sequence of function callsFindSLS

(

1

),

FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

)isstill O

(

n

)

.ThetimecostofreportingtheSUSescoveringaparticularlocation becomes O

(

occ

)

,whereocc isthenumberofSUSesthatcoverthatlocation.Thatgivesusthefollowingtheorem.

Theorem4.2.Algorithm 5findsallSUSescoveringeverylocationofastringofsizen usingO

(

n

)

spaceandO

(

N

)

time,whereN

=



n

k=1occkandocck

1 isthenumberofSUSescoveringlocationk.

5. Experiments

We have implemented our proposal named IKXSUS in

C++,

2 using the

libdivsufsort

3 library for the suffix

ar-ray construction andKasaiet al.’s method[8]to compute thelcp array.We havecompared ourwork againstPeiet al.’s RSUS[12] andTsurataet al.’s [14]OSUS implementations,onboth one-SUS andall-SUS findingforevery stringlocation.

Notice that OSUS alsocomputes thesuffix array usingthe same

libdivsufsort

package andcomputes the lcp array

usingKasaiet al.’smethod.

RSUSwasoriginallypreparedwithanRinterface.WestrippedoffthatRinterfaceandbuiltastandalone

C++

executable forthe sake of fairbenchmarking. OSUS was originally developed in

C++.

We run OSUSboth with andwithout the

-l

option to compute a single leftmost SUS andall SUSes for every string location. In all three implementations, we also commented out the sections that print the results onto the screenand/or the disk as output, in order to measure the algorithmicperformancebetter.

WerunthetestsonamachinethathasIntel(R)Core(TM)i7-3770CPU@ 3.40 GHzprocessorwith8192KBcachesize and16 GBmemory.TheoperatingsystemwasLinuxMint 14.WeusedthePizza&Chilicorpusintheexperimentsbytaking

2 Sourcecodecanbedownloadedat:http://penguin.ewu.edu/~bojianxu/publications. 3 Availableat:https://code.google.com/p/libdivsufsort.

(10)

Fig. 1. The processing speed of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.

thefirst1,5,10,20,50,100, and200 MBs ofthelargest

dblp.xml,

dna,

English,

and

protein

files. Theresultsare showninFigs. 1,2,3,and4.

FindingtheleftmostSUSofeverylocation,Figs. 1 and2 ItwasnotpossibletorunRSUSonlongerstrings,sinceRSUSrequires

morememorythanwhatourmachinehas,andthus,onlyupto20 MBfileswereincludedintheRSUSbenchmark.

Com-paredtoRSUS,wehaveobservedthatIKXSUSisinaveragemorethan 8 timesfasteranduses 20 timeslessmemory.The experimentalresultsalsorevealedthatdifferenceoftheprocessingspeedsofOSUSandIKXSUSisnegligible,butinaverage

OSUSuses 4 timesmorememorythanIKXSUS.

FindingallSUSesofeverylocation,Figs. 3 and4 Inthe experiments of all-SUS finding for every string location, RSUSwas not includedasitdoesnot havethisfunctionality.Wehaveobservedthat OSUSuseslessmemoryinthe all-SUSfinding than what itneeds forone-SUS finding,while IKXSUS’smemorycost doesnot changebetweentheone-SUS andall-SUS finding. Overall,IKXSUSuses atleast 2 timeslessmemoryspace thanOSUSandalso marginallybeats OSUSinterms of theirprocessingspeeds.

Althoughallthreeworkshavelinearspacecomplexityinboththeoryandexperiments(notethatthe

X

axisinallfigures uses logscale), IKXSUSandOSUSusesignificantly lessmemoryspace, dueto thefact that thesetwoworks usesimpler

data structuresrather than the suffix tree used by RSUS. On the other hand,although both IKXSUS and OSUSuse the

same setofdatastructures,such assuffixarray,rankarray (inversesuffix array),andthelcparray,andcomputingthese arraysaredoneviathesamelibrary(libdivsufsortforsuffixarrayconstruction)andthesamealgorithm(Kasaiet al.’s method[8]forlcparrayconstruction),thepeakmemoryusagebyOSUSismuchhigherthanIKXSUS.Thedifferencestems fromdifferentmechanismsthesestudiesfollowtocompute theSUS.OSUScomputestheSUSbyusinganadditionalarray, whichisnamedasthemeaningfulminimaluniquesubstringarray.Thus,thespaceusedforthatadditionaldatastructure

makesOSUSrequiremorememory.

Withrespecttotheprocessingspeed,bothIKXSUSandOSUSpresentstablerunningtimesonall

dblp,

dna,

protein,

and

English

textsandscale well onincreasing sizes of thetarget dataconforming totheir linear time complexity.On theother hand,RSUSexhibitsitsquadratictimecomplexityonalltexts,andespeciallyitsrunningtimeon

English

text

is much longer when comparedto other text types.The speed-upof IKXSUSandOSUS against RSUScan be even more

(11)

Fig. 2. The peak memory consumptions of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.

(12)

Fig. 4. The peak memory consumptions of OSUS and our proposal in finding all SUSes of every location on several strings of different sizes.

6. Conclusion

We proposed IKXSUS, an optimal linear-time and linear-spacealgorithm for shortest unique substring query. Our al-gorithm significantly improved RSUS, the original work on shortestunique substring queryproposed recently[12], both theoreticallyandempiricallyinboththespaceandthetimecosts.Ourworkisindependentlydiscoveredwithoutknowing OSUS, anotherrecentlinear-timeandlinear-spacesolution[14]forSUSfinding,anduses adifferentapproach.Inpractice, IKXSUSusessignificantlylessmemorythanOSUSwhilemaintainingnearlythesameprocessingspeed.

Acknowledgements

Weacknowledgetheauthorsof[12,14]forprovidingtheirsourcecode. References

[1]M.Crochemore,W.Rytter,JewelsofStringology:TextAlgorithms,WorldScientific,2003.

[2]D.Gusfield,AlgorithmsonStrings,TreesandSequences:ComputerScienceandComputationalBiology,CambridgeUniversityPress,1997.

[3]B.Haubold,N.Pierstorff,F.Möller,T.Wiehe,Genomecomparisonwithoutalignmentusingshortestuniquesubstrings,BMCBioinform.6 (1)(2005) 123.

[4]X.Hu,J.Pei,Y.Tao,Shortestuniquequeriesonstrings,in:Proceedingsofthe21stInternationalSymposiumonStringProcessingandInformation Retrieval(SPIRE),2014,pp. 161–172.

[5]A.M.Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,in:Proceedingsofthe25thAnnualSymposiumonCombinatorialPattern Matching(CPM),2014,pp. 172–181.

[6] A.M.˙Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,http://arxiv.org/abs/1312.2738. [7]L.Ilie,W.F.Smyth,Minimumuniquesubstringsandmaximumrepeats,Fund.Inform.110 (1–4)(2011)183–195.

[8]T.Kasai,G.Lee,H.Arimura,S.Arikawa,K.Park,Linear-timelongest-common-prefixcomputationinsuffixarraysanditsapplications,in:Symposium onCombinatorialPatternMatching,2001,pp. 181–192.

[9]P.Ko,S.Aluru,Spaceefficientlineartimeconstructionofsuffixarrays,J.DiscreteAlgorithms3 (2–4)(2005)143–156.

[10]M.O.Külekci,J.S.Vitter,B.Xu,EfficientmaximalrepeatfindingusingtheBurrows–Wheelertransformandwavelettree,IEEE/ACMTrans.Comput.Biol. Bioinform.9 (2)(2012)421–429.

[11]S.Kurtz,J.V.Choudhuri,E.Ohlebusch,C.Schleiermacher,J.Stoye,R.Giegerich,Reputer:themanifoldapplicationsofrepeatanalysisonagenomic scale,NucleicAcidsRes.29 (22)(2001)4633–4642.

(13)

[12]J.Pei,W.C.H.Wu,M.Y.Yeh,Onshortestuniquesubstringqueries,in:ProceedingsofIEEEInternationalConferenceonDataEngineering(ICDE),2013, pp. 937–948.

[13]W.F.Smyth,Computingregularitiesinstrings:asurvey,EuropeanJ.Combin.34 (1)(2013)3–14.

[14]K.Tsuruta,S.Inenaga,H.Bannai,M.Takeda,Shortestuniquesubstringsqueriesinoptimaltime,in:ProceedingsofInternationalConferenceonCurrent TrendsinTheoryandPracticeofComputerScience(SOFSEM),2014,pp. 503–513.

Şekil

Fig. 1. The processing speed of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.
Fig. 2. The peak memory consumptions of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.
Fig. 4. The peak memory consumptions of OSUS and our proposal in finding all SUSes of every location on several strings of different sizes.

Referanslar

Benzer Belgeler

İstanbul Vali ve Belediye Başkanı Fahrettin Kerim Gökay, hem toplantının açılmasına katılmıştı, hem de kimi oturumlarını izlemişti.. Bu açıdan

Araflt›rma verilerinin analizi sonucunda üniversite- lerin tan›t›m videolar›nda vurgulanan temalara ve üniversite- lerin vermifl olduklar› e¤itim aç›s›ndan

Evinde bilgisayar olma durumu, evinde internet olma durumu, telefonunda internet paketi olma durumu ve online oyun oynama durumuna göre YİBT-KF puan ortalamaları

We have experimented with two approaches to the selection of query expansion terms based on lexical cohesion: (1) by selecting query expansion terms that

If some features have only main effects on targets, RPFP makes predictions for those features by using the whole instance space instead of local region determined by

1970 The Five Man Army (MGM, 1970) Rejected on the grounds of dealing with the Mexican Revolution and hence propagating political, economic and social ideologies which contradict

Just like the pair shown in figure 1, none of the STM images show an atomic contrast, whereas the force gradient images revealed atomic resolution with corrugation heights

We organize our review and discussion around a multilevel framework (see Figure 8.1) that considers five key linkages: the extent to which key motivational concepts