A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

(1)

Contents lists available atScienceDirect

Theoretical

Computer

Science

www.elsevier.com/locate/tcs

A

simple

yet

time-optimal

and

linear-space

algorithm

for shortest

unique

substring

queries

✩

Atalay Mert ˙Ileri

a

,

M. O˘guzhan Külekci

b

,

Bojian Xu

c

,

∗

,

1

a_Department_of_Electrical_Engineering_and_Computer_Science,_{Massachusetts}_Institute_of_Technology,_MA_02139,_USA b_Department_of_Biomedical_Engineering,_Istanbul_Medipol_University,_Turkey

c_Department_of_Computer_Science,_Eastern_Washington_University,_WA_99004,_USA

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received30March2014 Accepted7November2014 Availableonline13November2014 CommunicatedbyG.Ausiello Keywords:

Uniquesubstring Shortestuniquesubstring Repetitiveness

Regularity

We revisitthe problemoffinding shortestunique substring (SUS)proposed recentlyby Pei et al. (2013)[12].We propose anoptimal O(n)time and spacealgorithm thatcan findanSUS forevery locationofastringofsizen andthussignificantlyimprove their

O(n2₎ _time_complexity. _Our_method_also_supports _ﬁnding _all _the _SUSes_covering _every

location,whereastheirscanfindonlyoneSUSforeverylocation. Further,oursolution is simplerand easier to implementand ismore spaceefficient inpractice, since weonly use the inverse suffix arrayand the longest common prefix array of the string, while theiralgorithmusesthesuffixtreeofthestringandotherauxiliarydatastructures. Our theoreticalresultsarevalidatedbyanempiricalstudywithreal-worlddatathatshowsour methodisatleast 8 timesfasterand usesatleast20 times less memory.The speedup gained by our methodagainst Pei et al.’scan becomeeven more significant whenthe stringsizeincreasesduetotheirquadratictimecomplexity.We alsohavecomparedour methodwiththerecent Tsurutaet al.’s (2014)[14]proposal, anotherindependent O(n)

timeandspacealgorithmforSUSfinding.Theempiricalstudyshowsthatbothmethods havenearlythesameprocessingspeed.However,oursusesatleast4times lessmemory forfindingoneSUSandatleast2times lessmemoryforfindingallSUSes,bothcovering everystringlocation.

1. Introduction

Repetitivestructureandregularityfinding[2,1,13]hasreceivedmuchattentioninstringologyduetoits comprehensive applications in different fields, especially in computational biologyand bioinformatics research [11,10]. Finding shortest unique substrings(SUS)canbe an indirectwayforfindingrepetitive structuresof astring,becauseanyproper substring of ashortestuniquesubstringoccursmultipletimesinthestringandthusisarepeat[7].Shortestuniquesubstringshave beenpreviouslyusedincomparingDNAsequences[3].However,efficientmethod forfindingtheshortestuniquesubstring coveringa givenstringlocation wasnot studied,untilrecentlyit was proposedby Peiet al.[12].Aspointedout in[12],

✩ _Author_names_are_listed_in_alphabetical_order._A_preliminary_version_of_this_article_appeared_at_[5]_._Part_of_this_work_was_done_while_all_authors_were withTÜB˙ITAK-B˙ILGEM-UEKAEofTurkeyinSummer2013.

*

Correspondingauthor.

E-mailaddresses:atalay@mit.edu(A.M. ˙Ileri),okulekci@medipol.edu.tr(M.O. Külekci),bojianxu@ewu.edu(B. Xu). 1 _Supported_in_part_by_EWU’s_Faculty_Grants_for_Research_and_Creative_Works.

(2)

SUS ﬁndingalsohasits ownother importantusageinsearch engines andbioinformatics. Wereferreadersto [12] forits detaileddiscussionontheapplicationsofSUSﬁnding.Peiet al.proposedasolutionthatcosts O

(

n2

₎

_time_and _O

₍

_n

₎

_space

toﬁndan SUSforeverylocationofastringofsizen.Inthispaper,weproposeanoptimalO

(

n

)

timeandspacealgorithm for SUS finding. Ourmethod uses simpler data structuresthat include the suffix array, the inversesuffix array,and the longest commonprefixarray ofthegivenstring,whereas themethodin [12] isbuiltuponthe suffixtree datastructure. OuralgorithmalsoprovidesthefunctionalityoffindingalltheSUSescoveringevery location,whereasthemethodof[12]

searchesforonlyoneSUSforeverylocation.Ourmethodnotonlyimprovestheir resultstheoretically,theempiricalstudy also showsthat our method ismore spacesaving by a factor ofatleast 20 and is fasterby a factor of 4.The speedup gained byourmethodcan becomeevenmoresigniﬁcant whenthestringbecomes longerduetothe quadratictimecost of[12].Duetotheveryhighmemoryconsumptionof[12],wewerenotabletoruntheirmethodwithmassivedataonour machine.

Independenceofourwork AfterwepostedaninitialversionofthisproposalatarXiv[6],wewerecontactedviaemailsbythe coauthorsof[14]and[4],bothofwhichsolvedtheSUSﬁndingusingO

(

n

)

timeandspace.Bythetimewecommunicated, article[14]hadbeenacceptedbuthasnotbeenpublishedand[4]wasstillunderreview.Wewerealsoofferedwiththeir paperdraftsandthesourcecodeof[14].ThemethodsforSUSfindinginbothpapersarebasedonthesearchforminimum uniquesubstrings (MUS),aswhat[12]did.Ouralgorithmtakesa differentapproachanddoesnot needtosearchforMUS. The problemstudied by [4]is also more general, in that they want to findSUS covering a givenchunk of locations in the string,instead of a single location considered by [12,14] andour work. So, by all means, our work is independent andpresents adifferentoptimalalgorithm forSUS finding.Wealso haveincludedthe performancecomparison withthe algorithmof[14]intheempiricalstudy.Itshowsthatbothmethodshavenearlythesameprocessingspeed,butourmethod uses atleast4times lessmemoryforfindingoneSUSforeverystringlocationandusesatleast2times lessmemoryfor findingallSUSesforeverystringlocation.Thealgorithmfrom[4]cannotbeempiricallystudiedastheauthordidnotprefer toreleasethecodeuntiltheirpaperisaccepted.

2. Preliminary

We consider a string S

[

1

. . .

n

]

, where each character S

[

i

]

is drawn from an alphabet

Σ

= {

1

,

2

,

. . . , σ

}

.A substring S

[

i

. . .

j

]

of S represents S

[

i

]

S

[

i

+

1

]

. . .

S

[

j

]

if1

≤

i

≤

j

≤

n, andisan empty stringif i

>

j.String S

[

i

. . .

j

]

isa proper substring of another string S

[

i

. . .

j

]

ifi

≤

i

≤

j

≤

j and j

−

i

<

j

−

i. The length of a non-empty substring S

[

i

. . .

j

]

, denoted as

|

S

[

i

. . .

j

]|

,is j

−

i

+

1.We deﬁnethe lengthofan empty stringaszero.A preﬁx of S is asubstring S

[

1

. . .

i

]

forsome i,1

≤

i

≤

n.A properpreﬁx S

[

1

. . .

i

]

isapreﬁxofS wherei

<

n.A suﬃx ofS isasubstring S

[

i

. . .

n

]

forsome i, 1

≤

i

≤

n. A propersuﬃx S

[

i

. . .

n

]

is asuﬃx of S where i

>

1. Wesay thecharacter S

[

i

]

occupiesthe string location i. We saythesubstring S

[

i

. . .

j

]

covers thekthlocationof S,ifi

≤

k

≤

j.Fortwostrings A and B,wewrite A

=

B (andsay A is equal to B),if

|

A

|

= |

B

|

and A

[

i

]

=

B

[

i

]

fori

=

1

,

2

,

. . . ,

|

A

|

.Wesay A islexicographicallysmallerthan B,denotedas A

<

B,if(1) A isaproper preﬁxof B,or(2) A

[

1

]

<

B

[

1

]

,or(3)thereexistsan integerk

>

1 suchthat A

[

i

]

=

B

[

i

]

forall 1

≤

i

≤

k

−

1 but A

[

k

]

<

B

[

k

]

.Asubstring S

[

i

. . .

j

]

of S is unique,iftheredoesnotexistanothersubstring S

[

i

. . .

j

]

of S, suchthat S

[

i

. . .

j

]

=

S

[

i

. . .

j

]

buti

=

i.Asubstringisa repeat ifitisnotunique.

Deﬁnition2.1. Fora particular string location k

∈ {

1

,

2

,

. . . ,

n

}

, the shortestuniquesubstring(SUS)coveringlocation k, denotedasSUSk,isauniquesubstring S

[

i

. . .

j

]

,suchthat(1)i

≤

k

≤

j,and(2)thereisnootheruniquesubstring S

[

i

. . .

j

]

of S,suchthati

≤

k

≤

jand j

−

i

<

j

−

i.

Foranystringlocationk, SUSk mustexist,becausethestring S itself canbeSUSk ifnoneofthepropersubstringsof S is SUSk.AlsotheremightbemultiplecandidatesforSUSk.Forexample,ifS

= abcbb

,thenSUS2 canbeeitherS

[

1

,

2

]

= ab

or S

[

2

,

3

]

= bc

.

Foraparticular stringlocationk

∈ {

1

,

2

,

. . . ,

n

}

,the left-boundedshortestuniquesubstring(LSUS)startingatlocation k, denotedasLSUSk,isauniquesubstring S

[

k

. . .

j

]

,such thateitherk

=

j oranyproperpreﬁxof S

[

k

. . .

j

]

isnotunique.

NotethatLSUS1

=

SUS1always exists,becauseatleastthewholestringS isunique.However,foranarbitrarylocationk

≥

2, LSUSk may notexist. Forexample,if S

= abcabc

, then noneofLSUS4, LSUS5,andLSUS6 exists.An up-to-j extensionof

LSUSk,denotedasLSUSjk,isthesubstring S

[

k

. . .

j

]

,wherek

+ |

LSUSk

|

≤

j

≤

n.

The suﬃxarray SA

[

1

. . .

n

]

ofthestring S isapermutationof

{

1

,

2

,

. . . ,

n

}

,suchthatforanyi and j,1

≤

i

<

j

≤

n, we have S

[

SA

[

i

]

. . .

n

]

<

S

[

SA

[

j

]

. . .

n

]

.Thatis,SA

[

i

]

isthestartinglocationoftheithsuﬃxinthesortedorderofallthesuﬃxes of S.The rankarray Rank

[

1

. . .

n

]

is theinverseofthesuﬃxarray.Thatis,Rank

[

i

]

=

j iffSA

[

j

]

=

i.The longestcommon preﬁx(lcp)array LCP

[

1

. . .

n

+

1

]

isanarrayofn

+

1 integers,suchthatfori

=

2

,

3

,

. . . ,

n,LCP

[

i

]

isthelengthofthelcpof the twosuﬃxes S

[

SA

[

i

−

1

]

. . .

n

]

and S

[

SA

[

i

]

. . .

n

]

.WesetLCP

[

1

]

=

LCP

[

n

+

1

]

=

0.Intheliterature, thelcparray isoften deﬁnedasanarray ofn integers. Weincludean extrazeroatLCP

[

n

+

1

]

just tosimplifythedescriptionofourupcoming algorithms.Table 1showsthesuﬃxarrayandthelcparrayoftheexamplestring

mississippi.

The next Lemma 2.1showsthat, byusing therankarray andthe lcp arrayof thestring S, itis easyto calculateany LSUSi ifitexistsortodetectthatitdoesnotexist.

(3)

Table 1

ThesuﬃxarrayandthelcparrayofanexamplestringS= mississippi.

i LCP[i] SA[i] suﬃxes 1 0 11 i 2 1 8 ippi 3 1 5 issippi 4 4 2 ississippi 5 0 1 mississippi 6 0 10 pi 7 1 9 ppi 8 0 7 sippi 9 2 4 sissippi 10 1 6 ssippi 11 3 3 ssissippi 12 0 – – Lemma2.1.Fori

=

1

,

2

,

. . . ,

n: LSUSi

=

S

[

i

. . .

i

+

Li

],

if i

+

Li

≤

n

not existing

,

otherwise

whereLi

=

max

{

LCP

[

Rank

[

i

]],

LCP

[

Rank

[

i

]

+

1

]}

.

Proof. Notethatbythedefinitionofthelcparray,LiisthelengthofthelongestcommonprefixbetweenthesuffixS

[

i

. . .

n

]

andanyothersuﬃxofS.ThevalueofLicanbeanynumberfromtheset

{

0

,

1

,

. . . ,

n

−

i

+

1

}

.Ifi

+

Li

≤

n,i.e.,Li

<

n

−

i

+

1,

itmeanssubstring S

[

i

. . .

i

+

Li

]

existsandisunique,whilesubstring S

[

i

. . .

i

+

Li

−

1

]

iseitheremptyorisarepeat.So,by

thedeﬁnitionofLSUS,S

[

i

. . .

i

+

Li

]

isLSUSi.Ontheotherhand,ifi

+

Li

>

n,i.e., Li

=

n

−

i

+

1,itmeans S

[

i

. . .

i

+

Li

−

1

]

is

indeedthesuﬃx S

[

i

. . .

n

]

andisarepeat,soLSUSi doesnotexist.

2

3. SUSﬁndingforonelocation

Inthissection,we wanttoﬁndtheSUScoveringa givenlocation k usingO

(

n

)

time andspace. Westart withﬁnding theleftmostoneifk hasmultipleSUSes.Intheend,wewillshowatrivialextensiontoﬁndalltheSUSescoveringlocation k withthesametimeandspacecomplexities,ifk hasmultipleSUSes.

Lemma3.1.EverySUSiseitheranLSUSoranextensionofanLSUS.

Proof. Let’s saywe are looking atSUSk foranyk

∈ {

1

. . .

n

}

.We knowSUSk exists foranyk, so let’ssay SUSk

=

S

[

i

. . .

j

]

,

1

≤

i

≤

k

≤

j

≤

n. If S

[

i

. . .

j

]

isneither LSUSi noranextension ofLSUSi,itmeans S

[

i

. . .

j

]

is aproperpreﬁx ofLSUSi and

thusisarepeat,whichcontradictsthefactthat S

[

i

. . .

j

]

=

SUSk isunique.

2

Example1: S

= abcbca

,thenSUS2

=

S

[

1

,

2

]

= ab

, whichisLSUS1.Example2: S

= abcbc

,then SUS2

=

S

[

1

,

2

]

= ab

,

whichisanextensionofLSUS1

=

S

[

1

]

tolocation2.

By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS, andthe starting location of that LSUS

mustbeonorbeforelocationk. ThenthealgorithmforﬁndingSUSk foranygivenstringlocationk issimply tocalculate LSUS1

,

LSUS2

,

. . . ,

LSUSk ifexisting,usingLemma 2.1.Duringthiscalculation,ifanyLSUSdoesnotcoverthelocationk,we

simplyextendthatLSUSup tolocationk.WewillpicktheshortestoneamongalltheLSUSesortheir up-to-k extensions asSUSk.Weresolvethetiebypickingtheleftmostone.ItispossiblethisprocedurecanearlystopifitﬁndsanLSUSdoes

not exist,becausethat indicates allthe otherremaining LSUSes donot existeither.Algorithm 1 givesthepseudocode of thisprocedure,wherewerepresentSUSk byitstwoattributes:

start

and

length,

thestartinglocationandthelengthof SUSk,respectively.

Lemma3.2.Givenastringlocationk andtherankandthelcparrayofthestringS,Algorithm 1canﬁndSUSkusingO

(

k

)

time.Ifthere aremultiplecandidatesforSUSk,theleftmostoneisreturned.

Proof. TheprocedurestartswiththecandidateS

[

1

. . .

n

]

,whichisindeedunique(Line1).Thenthe

For

loopcalculatesthe LSUSifori

=

1

,

2

,

. . . ,

k (Lemma 2.1).IfLSUSiexists(Line4)andthelengthofLSUSioritsup-to-k extensionislessthanthe

length ofthecurrentbestcandidate(Line5), thenwe willpickthat LSUSi orits up-to-k extension asthe newcandidate

forSUSk.Thisalso resolvesthepossibletiesby pickingtheleftmostcandidate.Intheendoftheprocedure,we willhave

the shortestone amongLSUS1

. . .

LSUSk or their up-to-k extensions,and that isSUSk.Early stop ismade atLine 7ifthe LSUS beingcalculateddoesnotexist,becausethatmeansalltheremainingLSUSes tobecalculateddonotexisteither.Each stepinthe

For

loopcosts O

(

1

)

timeandtheloopexecutesnomorethank steps,sotheproceduretakesatotalof O

(

k

)

(4)

Algorithm 1: FindSUSk.Returntheleftmostoneifk hasmultipleSUSes.

Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS

Output: SUSk.Theleftmostonewillbereturnedifk hasmultipleSUSes.

1 start←1;length←n ; // Start location and length of the best candidate for SUSk.

2 for i=1,. . . ,k do

3 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};

4 if i+L≤n then // LSUSi exists.

/* Extend LSUSi up to k if needed. Resolve the tie by picking the leftmost SUS. */

5 if max{L+1,k−i+1}<length then 6 start←i;length←max{L+1,k−i+1};

7 else break; // Early stop.

8 PrintSUSk← (start,length);

Algorithm 2: FindalltheSUSescoveringagivenlocationk.

Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS

Output: AlltheSUSescoveringlocationk.

1 start←1;length←n ; // Start location and length of the best candidate for SUSk.

/* Find the length of SUSk. */

2 for i=1,. . . ,k do

5 if max{L+1,k−i+1}<length then // Extend LSUSi to location k if necessary.

6 start←i;length←max{L+1,k−i+1};

/* Find all SUSes covering location k. */

8 for i=1,. . . ,k do

11 if max{L+1,k−i+1}=length then // Extend LSUSi to location k if necessary.

12 Print(i,max{L+1,k−i+1});

Theorem3.1.Foranylocationk inthestringS,wecanﬁndSUSkusingO

(

n

)

timeandspace.IftherearemultiplecandidatesforSUSk, theleftmostoneisreturned.

Proof. Thesuﬃx arrayof S canbeconstructedbyexistingalgorithmsusing O

(

n

)

timeandspace(forexample,[9]).After thesuﬃxarrayisconstructed,therankarray(theinversesuﬃxarray)canbetriviallycreatedusinganotherO

(

n

)

timeand space.WecanthenusethesuﬃxarrayandtherankarraytoconstructthelcparrayusinganotherO

(

n

)

timeandspace[8]. Combining thetimecost ofAlgorithm 1(Lemma 3.2),thetotaltime costforﬁndingSUSk foranylocationk inthestring S of sizen is O

(

n

)

withatotal of O

(

n

)

spaceusage. IfmultiplecandidatesforSUSk exist,theleftmostcandidatewillbe

returnedasisprovidedbyAlgorithm 1(Lemma 3.2).

2

3.1. Extension:ﬁndingallSUSesforonelocation

It is trivialto extend Algorithm 1 to ﬁnd all the SUSes covering a particular location k as follows.We can ﬁrst use

Algorithm 1toﬁndtheleftmostSUSk.Thenwestart overagaintore-calculateLSUS1

. . .

LSUSkortheir up-to-k extensions,

and returnall ofthose whose length isequal to thelength of SUSk.Algorithm 2 showsthe pseudocode.This procedure

clearlycostsanextra O

(

k

)

time.CombiningtheresultsfromTheorem 3.1,wegetthefollowingtheorem. Theorem3.2.Foranylocationk inthestringS,wecanﬁndalltheSUSescoveringlocationk usingO

(

n

)

timeandspace.

4. SUSﬁndingforeverylocation

Inthissection,wewanttoﬁndSUSk foreverylocationk

=

1

,

2

,

. . . ,

n.Ifk hasmultipleSUSes,theleftmostonewillbe

returned.Intheend,wewillshowanextensiontoﬁndallSUSesforeverylocation.

Anaturalsolutionistoiteratively useAlgorithm 1asasubroutinetoﬁndevery SUSk,fork

=

1

,

2

,

. . . ,

n.However, the

totaltimecostofthissolutionwillbe O

(

n

)

+

n

k=1O

(

k

)

=

O

(

n2

)

,whereO

(

n

)

capturesthetimecostfortheconstruction

oftherankarrayandthelcparrayand

n_k₌₁O

(

k

)

isthetotaltimecostforthen instancesofAlgorithm 1.Wewanttohave asolutionthatcostsatotalof O

(

n

)

timeandspace,whichimpliesthattheamortizedcostforﬁndingeachSUSis O

(

1

)

.

(5)

ByLemma 3.1,weknowthateverySUSmustbe anLSUSoran extensionofanLSUS.ThenextLemma 4.1 furthersays ifSUSkisanextensionofanLSUS,ithassomespecialpropertiesandcanbequicklyobtainedfrom SUSk−1.

Lemma4.1.Foranyk

∈ {

2

,

3

,

. . . ,

n

}

,ifSUSkisanextensionofanLSUS,then(1)SUSk−1mustbeasubstringwhoserightboundaryis thecharacterS

[

k

−

1

]

,and(2)SUSkisthesubstringSUSk−1appendedbythecharacterS

[

k

]

.

Proof. Because SUSk is an extension of an LSUS, we have SUSk

=

S

[

i

. . .

k

]

for some i

<

k and LSUSi

=

S

[

i

. . .

j

]

forsome j

<

k. We alsoknow S

[

i

. . .

k

−

1

]

isunique, becausethe unique substring S

[

i

. . .

j

]

is a preﬁx of S

[

i

. . .

k

−

1

]

. Notethat anysubstringstartingfromalocationbeforei andcoveringlocationk

−

1 islongerthantheuniquesubstring S

[

i

. . .

k

−

1

]

, so SUSk−1 mustbe starting froma location betweeni and k

−

1,inclusive. Next, weshow SUSk−1 actuallymust start at

location i.ThefactSUSk

=

S

[

i

. . .

k

]

tellsusthat

|

LSUSt

|

≥ |

SUSk

|

=

k

−

i

+

1 foreveryt

=

i

+

1

,

i

+

2

,

. . . ,

k;otherwise,any LSUSt that isshorterthank

−

i

+

1 wouldbe abetter candidatethan S

[

i

. . .

k

]

asSUSk.Thatmeans,anyuniquesubstring

startingfromt

=

i

+

1

,

i

+

2

,

. . . ,

k

−

1 hasalengthatleastk

−

i

+

1.However,

|

S

[

i

. . .

k

−

1

]|

=

k

−

i

<

k

−

i

+

1 andS

[

i

. . .

k

−

1

]

isuniquealreadyandcoverslocationk

−

1 aswell,so S

[

i

. . .

k

−

1]istheonlycandidateforSUSk−1.ThisalsomeansSUSk

isindeedthesubstringSUSk−1appendedby S

[

k

]

.

2

4.1. Theoverallstrategy

We areready topresentthe overall strategy forﬁndingSUS ofevery location, by usingLemmas 3.1 and 4.1. Wewill calculate all the SUS in the orderof SUS1

,

SUS2

,

. . . ,

SUSn.That means when we want to calculate SUSk, k

≥

2,we have

hadSUSk−1 calculatedalready.Notethat SUS1

=

LSUS1, whichiseasyto calculateusingLemma 2.1.Now let’slookatthe

calculation of a particular SUSk,k

≥

2. By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS. By Lemma 4.1,wealsoknowifSUSk isanextensionofanLSUS,thentherightboundaryofSUSk−1 mustbe S

[

k

−

1

]

andSUSk

isjustSUSk−1 appendedby thecharacter S

[

k

]

.Suppose whenwe wanttocalculateSUSk,we havealreadycalculatedthe

shortestLSUScoveringlocationk orhaveknownthefactthatnoLSUScoverslocationk.Then,byusingSUSk−1,whichhas

beencalculatedbythen,andtheshortestLSUScoveringlocationk,wewillbeabletocalculateSUSkasfollows:

Case 1:If theright boundaryof SUSk−1 is not S

[

k

−

1

]

, then we knowSUSk cannot be an extension of an LSUS (the

contrapositiveofLemma 4.1).Thus,SUSk isjusttheshortestLSUScoveringlocationk,whichmustbeexistinginthiscase.

Case 2:If theright boundaryof SUSk−1 is S

[

k

−

1

]

, then SUSk mayor maynot be an extension of an LSUS. Wewill

considertwopossibilities:(1)IftheshortestLSUScoveringlocationk exists,wewillcompareitslengthwith

|

SUSk−1

|

+

1,

andpicktheshorteroneasSUSk.Ifbothhavethesamelength,weresolvethetiebypickingtheonewhosestartinglocation

indexissmaller.(2)IfnoLSUScoverslocationk,SUSkwilljustbeSUSk−1 appendedby S

[

k

]

.

Therefore,therealchallengehere,bythetimewewanttocalculateSUSk,k

≥

2,istoensurethatwewouldhavealready

calculatedtheshortestLSUScoveringlocation k orwewouldhavealreadyknownthefactthat noLSUS coverslocationk. IfthereexistmultipleshortestLSUSescoveringlocationk,wewouldliketoknowtheleftmostone.

4.2. Preparation

We nowfocus on thecalculation of theleftmost shortestLSUS covering every string location k, denotedby SLSk. Let

Candidatek_i denotetheleftmost shortestoneamongthoseofLSUS1

,

. . . ,

LSUSkthatexistandcoverlocationi.Foranarbitrary k,1

≤

k

≤

n,SLSkmaynotexist,becausethelocationk maynotbecoveredbyanyLSUSatall.Forexample,ifS

= abcabc

,

then locations 5 and6 arenotcovered byanyLSUS, andthus SLS5 andSLS6 donot exist.However,ifSLSk exists,bythe

deﬁnitionofSLS andCandidate,wehavethefollowingfact. Fact4.1.SLSk

=

Candidatekk

=

Candidatekk+1

= · · · =

Candidate n

k,ifSLSkexists.

OurgoalistoensureSLSkwillhavebeenknownwhenwe wanttocalculateSUSk,sowecalculateevery SLSkfollowing

the sameorderk

=

1

,

2

,

. . . ,

n, atwhich wecalculate all SUSes.Becausewe need toknow every LSUSi,i

≤

k in orderto

calculateSLSk (Fact 4.1), wewillwalk throughthestringlocationsk

=

1

,

2

,

. . . ,

n: ateachwalk stepk, wecalculateLSUSk

andmaintain Candidatek_i forevery stringlocation i thathasbeencovered byatleastoneofLSUS1

,

LSUS2

,

. . . ,

LSUSk.Note

that Candidatek_i

=

SLSi forevery i

≤

k (Fact 4.1). ThoseCandidateki withi

≤

k would havealreadybeenused asSLSi inthe

calculationofSUSi.So,aftereachwalkstepk,wewillonlyneedtomaintainthecandidatesforlocationsafterk.

Lemma4.2.(1)LSUS1alwaysexists.(2)IfLSUSkexists,thenLSUS1

,

LSUS2

,

. . . ,

LSUSkallexist.(3)IfLSUSkdoesnotexist,thennone ofLSUSk

,

LSUSk+1

,

. . . ,

LSUSnexist.

Proof. (1) LSUS1 mustexist,becausethestring S canbe LSUS1 ifeveryproper preﬁxof S isa repeat.(2)IfLSUSk exists,

sayLSUSk

=

S

[

k

. . .

γ

k

]

,thenLSUSi existsforevery i

≤

k,becauseatleastS

[

i

. . .

γ

k

]

isuniqueduetothefactthat S

[

k

. . .

γ

k

]

isuniqueandalsoisasuﬃxof S

[

i

. . .

γ

k

]

.(3)IfLSUSk doesnot exist,itmeans S

[

k

. . .

n

]

isarepeat,andthus everysuﬃx S

[

i

. . .

n

]

of S

[

k

. . .

n

]

fori

≤

k isalsoarepeat,i.e.,LSUSi doesnotexistforeveryi

≥

k.

2

(6)

ThenextlemmashowsthattherightboundaryofLSUSiwillbeonoraftertherightboundaryofLSUSi−1,ifLSUSiexists.

Lemma4.3.Foreachi

=

2

,

3

,

. . . ,

n:

|

LSUSi

|

≥ |

LSUSi−1

|

−

1.

Proof. We prove the lemma by contradiction. Suppose LSUSi−1

=

S

[

i

−

1

. . .

j

]

for some j, i

−

1

≤

j

≤

n. If

|

LSUSi

|

<

|

LSUSi−1

|

−

1,it means LSUSi

=

S

[

i

. . .

k

]

,where i

≤

k

<

j.Because S

[

i

. . .

k

]

is unique, S

[

i

−

1

. . .

k

]

is alsounique, whose

lengthhoweverisshorterthan S

[

i

−

1

. . .

j

]

.ThisisacontradictionbecauseS

[

i

−

1

. . .

j

]

isalreadyLSUSi−1.Thus,theclaim

inthelemmaistrue.

2

Now let’slookatthesituationattheendofthekthwalkstep.Bythen,wehavecalculatedLSUS1

,

LSUS2

,

. . . ,

LSUSk.By Lemma 4.2,weknowthatthereexistssome

k,1

≤

k

≤

k,suchthatLSUS1

,

. . . ,

LSUSk allexist,butLSUSk+1

. . .

LSUSkdonot exist.If

k

=

k,thatmeansLSUS1

,

. . . ,

LSUSkallexist.Let

γ

kdenotetherightboundaryofLSUSk,i.e.,LSUSk

=

S

[

k

. . .

γ

k

]

.By Lemma 4.3,we know

γ

kisalsotherightboundaryofthestringlocationscovered byLSUS1

,

. . . ,

LSUSk.So,every location 1

,

2

,

. . . , γ

k is covered by at least one LSUS from LSUS1

,

. . . ,

LSUSk. Thatis, at the end of the kth walk step: (1) every location j

=

1

,

. . . , γ

khasitscandidateCandidatekj calculatedalready.(2)If

γ

k

<

n,everylocation j

=

γ

k

+

1

,

. . . ,

n stilldoes

not haveits candidatecalculated,becauseeverysuchlocation j hasnotbeencoveredbyanyLSUS fromLSUS1

,

. . . ,

LSUSk thatwehavecalculatedattheendofthekthwalkstep.

Lemma4.4.Attheendofthekthwalkstep,if

γ

k

>

k,thenforanyi andj,k

≤

i

<

j

≤

γ

k,Candidatekjalsocoverslocationi.

Proof. Candidatek_j isasubstringstarting somewhereonorbeforek andgoing throughthelocation j.Becausek

≤

i

<

j,it isobviousthatCandidatek_jgoesthroughlocation i.

2

Lemma4.5.Attheendofthekthwalkstep,if

γ

k

>

k,then

Candidatek_k

≤

Candidatek_k₊₁

≤···≤

Candidatek_γ

k

Proof. By Lemma 4.4,we knowCandidatek_j alsocovers location i, forany i and j,k

≤

i

<

j

≤

γ

k. Thus, if

|

Candidatekj

|

<

|

Candidatek_i

|

, location i’s currentcandidate should be replaced by location j’s candidate,because that gives location i a shortercandidate.However,thecurrentcandidateforlocation i isalreadytheshortestcandidate.Itisacontradiction. So,

|

Candidatek_i

|

≤ |

Candidatek_j

|

,whichprovesthelemma.

2

4.3. FindingSLS foreverylocation

Invariant. WecalculateSLSkfork

=

1

,

2

,

. . . ,

n bymaintainingthefollowinginvariantattheendofeverywalkstep k:(A) If

γ

k

>

k, locations

{

k

+

1

,

k

+

2

,

. . . , γ

k

}

will be cutinto chunks,such that: (A.1)All locationsinone chunkhavethe same

candidate.(A.2) Locationsbelongingtodifferentchunkshavedifferentcandidates.(A.3) Eachchunkwillberepresentedby

a linked listnode of fourﬁelds:

ChunkStart,

ChunkEnd,

start,

length,

respectively representingthe startand

endlocation ofthechunk andthe startandlengthofthe candidatesharedby alllocationsofthechunk. (A.4) Allnodes representing different chunks will be connected into a linked list, ordered by the string positions of the corresponding chunks.Thelinkedlisthasa

head

anda

tail,

referringtothetwonodesthatrepresentthelowestpositionedchunkand thehighestpositionedchunk.(B) If

γ

k

≤

k,thelinkedlistisempty.

Maintenanceoftheinvariant. Wedescribeinaninductivemannertheprocedurethatmaintainstheinvariant.Algorithm 3

showsthepseudocodeoftheprocedure.Westartwithanemptylinkedlist.

Basestep:k

=

1 Wearewalkingtheﬁrststep.WeﬁrstcalculateLSUS1 usingLemma 2.1.WeknowLSUS1 mustexist.Let’s

sayLSUS1

=

S

[

1

. . .

γ

1

]

forsome

γ

1

≤

n.Then,Candidatei1

=

LSUS1 foreveryi

=

1

,

2

,

. . . , γ

1.Werecordallthesecandidates

by usinga singlenode

(

1

, γ

1

,

1

, γ

1

)

.This istheonlynode inthelinked listandispointedbyboth

head

and

tail.

We

knowSLS1

=

Candidate11(Fact 4.1),sowereturnSLS1 byreturning

(head.start, head.length)

= (1,

γ

1

)

.Wethenchange

head.ChunkStart

from1 to2.Ifitturns out

head

.ChunkEnd

=

γ

1

< 2

,meaning LSUS1 reallycoverslocation1 only,

wedeletethe

head

nodefromthelinkedlist,whichwillthenbecomeempty.

Inductivestep:k

≥

2 Wearewalkingthekthstep.WeﬁrstcalculateLSUSkusingLemma 2.1.

•

Case 1: LSUSk does not exist. (1) If

head

does not exist. It means that location k is covered neither by any of LSUS1

,

. . . ,

LSUSk−1 nor by LSUSk, so SLSk simply does not exist and we will return

(null, null)

. (2) If

head

(7)

Algorithm 3: The sequence of function calls FindSLS

(

1

)

, FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

returns SLS1, SLS2

,

. . . ,

SLSn, ifthe

correspondingSLS exists;otherwise,

null

willbereturned. 1 ConstructRank[1. . .n]andLCP[1. . .n]ofthestringS;

2 InitializeanemptyList; // Each node’s 4 fields: ChunkStart, ChunkEnd, start, length.

3 head←0;tail←0 ; // Reference to the head and tail node of the List 4 FindSLS(k)

/* Process LSUSk, if it exists. */

5 L←max{LCP[Rank[k]],LCP[Rank[k]+1]};

6 if k+L≤n then // LSUSk exists.

// Add a new list element at the tail, if necessary.

7 if head=0 then List[1]← (k,k+L,k,L+1);head←1;tail←1 ; // List was empty.

8 elseif k+L>List[tail].ChunkEnd then

9 tail+ +;List[tail]← (List[tail−1].ChunkEnd+1,k+L,k,L+1);

/* Update candidates and merge the nodes whose candidates can be shorter. Resolve the tie by

picking the leftmost one. */

10 j←tail;

11 while j≥head and List[j].length>L+1 do j− −;

12 List[j+1]← (List[j+1].ChunkStart,List[tail].ChunkEnd,k,L+1);tail←j+1;

13 if head=0 then SLSk← (head.start,head.length); // The list is not empty.

14 else SLSk← (null, null); // SLSk does not exist.

/* Discard the information about location k from the List. */

15 if head>0 then // List is not empty

16 if List[head].ChunkEnd≤k then

17 head+ +; // Delete the current head node

18 if head>tail then head←0;tail←0; // List becomes empty

19 else List[head].ChunkStart←k+1;

20 return SLSk

remove the information about location k from the head by setting head

.

ChunkStart

=

k

+

1. If it turns out that

head

.ChunkEnd < head.ChunkStart

,wewillremovethe

head

node.

•

Case 2: LSUSk exists andLSUSk

=

S

[

k

. . .

γ

k

]

,

γ

k

≤

n. By Lemma 4.2,we know LSUS1,LSUS2

,

. . . ,

LSUSk−1 all exist. Let

γ

k−1 denotetherightboundaryofLSUS1,LSUS2

,

. . . ,

LSUSk−1.ByLemma 4.3,we know

γ

k

≥

γ

k−1 and

γ

k−1 isalsothe

rightboundaryofLSUSk−1,i.e.,LSUSk−1

=

S

[

k

−

1

. . .

γ

k−1

]

.Notethatboth

γ

k−1

<

k and

γ

k−1

≥

k arepossible.

1. If

head

doesnotexist,itmeans

γ

k−1

<

k andnoneofthelocations

{

k

. . .

γ

k

}

iscoveredbyanyofLSUS1

,

LSUS2

,

. . . ,

LSUSk−1.Wewillinsertanewnode

(k,

γ

k

, k,

γ

k

− k + 1)

,whichwillbetheonlynodeinthelinkedlist.

2. If

head

exists,itmeans

γ

k−1

≥

k.If

γ

k

> tail.ChunkEnd

=

γ

k−1,weﬁrstinsertatthetailsideanewlinkedlist

node

(tail.ChunkEnd

+ 1,

γ

k

, k,

γ

k

− k + 1)

torecordthecandidateinformationforlocationsinthechunkafter

γ

k−1 through

γ

k.

Then,we willtravelthrough thenodesinthelinkedlistfromthetailside towardthehead.We stopwhen wemeet anode whosecandidateisshorterthan orequal toLSUSkorwhenwe reachtheheadendofthelinkedlist.Wewill

mergeallthenodeswhosecandidatesarelongerthanLSUSkintoanewlinkedlistnode.Thechunkcoveredbythenew

nodeistheunionofthechunkscoveredbythemergednodes,andthecandidateofthenewnodeisLSUSk.

Thistravelandmergeprocess isvalidbecauseofLemma 4.5.Thismergeprocessensuresevery locationmaintainsits best(shortest) candidateby theendofevery walk step.Italsoresolves thepossible tiesofmultipleshortestLSUSes coveringaparticularlocationbypickingtheleftmostoneasthatlocation’scandidate,becausethemergeprocessdoes notmergenodeswhosecandidatesareofthesamelength.

We willreturn

(head.start, head.length)

asSLSk, sinceCandidatekk

=

SLSk (Fact 4.1). Finally,we will removethe

informationaboutlocation k fromthe headby settinghead

.

ChunkStart

=

k

+

1. Wewill removethe

head

node ifit turnsoutthat

head

.ChunkEnd > head.ChunkStart

.

Lemma4.6.GiventhelcparrayandtherankarrayofS,thesequenceofFindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

functioncallswill returnSLS1

,

SLS2

,

. . . ,

SLSnifexisting.TheamortizedtimecostofoneFindSLS

()

functioncallisO

(

1

)

.

Proof. ThecorrectnessofAlgorithm 3isalreadygiveninthedescriptionoftheprocedurethatmaintainstheinvariance.All operationsinan instanceofFindSLS

()

function callclearlytake O

(

1

)

time,exceptthe

while

loopatLine11,whichisto mergelinkedlistnodeswhosecandidatescanbe shorter.Thus,thelemmawillbe proved,ifwecanprovethe amortized numberoflinkednodesthatwillbemergedviathat

while

loopisalsoboundedbyaconstant.Notethatnonodeinthe linkedlisteversplitsduetoLemma 4.3.InthesequenceoffunctioncallsFindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

,thereareat

(8)

Algorithm 4: FindingtheleftmostSUSk,k

=

1

,

. . . ,

n. 1 for k←1. . .n do

2 (start,length)←FindSLS(k); // SLSk; It is (null, null) if SLSk does not exist. 3 if k=1 then PrintSUSk← (start,length);

4 elseif SUSk−1.start+SUSk−1.length−1>k−1 then PrintSUSk← (start,length); 5 elseif(start,length)= (null, null)then PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

6 elseif length<SUSk−1.length+1 then PrintSUSk← (start,length);

7 else // Resolve the tie by picking the leftmost one.

8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1) 9

mostn linkedlistnodesto bemerged.We knowthe numberofmergeoperationsinmergingn nodes intoone node (in theworstcase)isnomorethan O

(

n

)

.Therefore,theamortizedtimecostonmergingthelinkedlistnodesinoneFindSLS

()

function calloverthesequenceofn functioncalls FindSLS

(

1

)

,FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

is O

(

1

)

.Thisﬁnishesthe proofof

thelemma.

2

4.4. FindingtheleftmostSUS foreverylocation

OnceweareabletosequentiallycalculateeverySLSk ordetectitdoesnotexist,wearereadytocalculateeverySUSk by

usingthe strategydescribed inSection4.1.Algorithm 4givesthepseudocodeoftheprocedure.It calculatesSUSesinthe orderofSUS1

,

SUS2

,

. . . ,

SUSn(Line1).Foreachlocationk,thefunctioncallatLine2istocalculateSLSkortoﬁndSLSkdoes

not exist.Line 3handlesthe specialcasewhereSUS1

=

LSUS1

=

SLS1.The condition atLine4 showsthat SUSi cannotbe

an extension ofan LSUS(Lemma 4.1), soSUSk

=

SLSk,whichmustbeexisting inthiscase. Line5handlesthecasewhere SLSk does not exist,so SUSk mustbe SUSk−1 appended by S

[

k

]

.Line 6 handles the casewhere SLSk isshorter than the

one-character extension ofSUSk−1,so SUSk is SLSk.Lines 7–8 handlethe casewhereSLSk is longerthan or equalto the

one-character extension ofSUSk−1,soSUSk isSUSk−1 appendedby S

[

k

]

.Thisalsoresolves thetie bypicking theleftmost

oneifk iscoveredbymultipleSUSes.

Theorem4.1. Algorithm 4ﬁndsSUS1

,

SUS2

,

. . . ,

SUSn ofstring S usinga totalof O

(

n

)

timeandspace.Ifanystringlocation is covered bymultipleSUS,Algorithm 4ﬁndstheleftmostone.

Proof. We canconstruct thesuﬃx array ofthestring S in atotal of O

(

n

)

time andspaceusingexisting algorithms (for example, [9]). The rank array is justtheinverse suﬃx array andcan be directly obtainedfrom SAusing O

(

n

)

time and space.Thenwecanobtainthelcparrayfromthesuﬃxarrayandrankarrayusinganother O

(

n

)

timeandspace[8].Sothe totaltimeandspacecostsforpreparingtheseauxiliarydatastructuresareO

(

n

)

.

Time cost. The amortized time cost for each FindSLS function call at Line 2 in the sequence of function calls FindSLS

(

1

),

. . . ,

FindSLS

(

n

)

is O

(

1

)

(Lemma 4.6). The time cost forLines 3–8 isalso O

(

1

)

.There are a total ofn steps in the

For

loop,yieldingatotalofO

(

n

)

timecost.

Spaceusage. Theonly spaceusage (inaddition totheauxiliary datastructuressuchassuﬃxarray,rankarray,andthe lcparray,whichcostatotalofO

(

n

)

space)inouralgorithmisthedynamiclinkedlist,whichhoweverhasnomorethann nodes at anytime. Eachnode costs O

(

1

)

space.Therefore,thelinked listcosts O

(

n

)

space.Addingthespaceusageofthe auxiliarydatastructures,wegetthetotalspaceusageofﬁndingeverySUSisO

(

n

)

.

FindingtheleftmostSUS. For anyparticular location k, ifone SUS covering location k is an extension of an LSUS, we knowby Lemma 4.1,that SUSmustbe thesubstring SUSk−1 appendedbytheletter S

[

k

]

.ClearlythisSUS istheleftmost

one among all theSUSescovering location k andis guaranteedto be returned byLines 7–8 inAlgorithm 4.Ifall SUSes covering location k are LSUSes, the leftmost one of those LSUSes is already guaranteed to be returned by Algorithm 3

(Lemma 4.6).

2

4.5. Extension:ﬁndingallSUSesforeverylocation

It is possiblethat a particularlocation can havemultiple SUSes.Forexample,if S

= abcbb

, then SUS2 can be either S

[

1

,

2

]

= ab

orS

[

2

,

3

]

= bc

.Algorithm 4onlyreturnsoneofthemandresolvethetiebypickingtheleftmostone.However, itiseasytomodifyAlgorithm 4toreturnalltheSUSesofeverylocation,withoutchangingAlgorithm 3.

Suppose aparticularlocationk iscoveredbymultipleSUSes.Weknow,attheendofthekthwalkstepbutbeforethe linked listupdate(attheendofLine14inAlgorithm 3),SLSkreturnedbyAlgorithm 3isrecordedby the

head

nodeand

is the leftmostone amongall the SUSesthat are LSUS and coverlocation k. Because every string location maintainsits shortest candidateanddueto Lemma 4.5,all theother SUSesthat are LSUS andcoverlocation k arebeingrecordedby other linked listnodesthatareimmediatelyfollowingthe

head

node. Thisisbecauseifthoseother SUSesare notbeing recorded,that meansthelocation rightaftertheheadnode’schunkhasa candidatelonger thanSUSk ordoesnot havea

(9)

Algorithm 5: FindingallchoicesofeachSUSk,fork

=

1

,

. . . ,

n. 1 for k←1. . .n do

2 ﬂag←0;(start,length)←FindSLS(k); // SLSk; (null, null) if SLSk does not exist.

3 if k=1 then

4 PrintSUSk← (start,length);

5 elseif SUSk−1.start+SUSk−1.length−1>k−1 then

6 PrintSUSk← (start,length);ﬂag←1; 7 elseif(start,length)= (null, null)then

8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

9 elseif length≤SUSk−1.length+1 then

10 PrintSUSk← (start,length);ﬂag←1; 11 else

12 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);

/* Print out other SUSes that cover location k. */

13 if ﬂag=1 then

14 if SUSk−1.length+1=SUSk.length then 15 Print(SUSk−1.start,SUSk−1.length+1);

16 j←head;

17 while j>0 andj≤tail do

/* List[j].start=SUSk.start condition checking is because the SUS from head node may have been printed.

*/

18 if List[j].length=SUSk.length andList[j].start=SUSk.start then 19 Print(List[j].start,List[j].length); j←j+1;

20 elseif List[j].start=SUSk.start then

21 Break;

candidatecalculatedyet,butthatlocationisindeedcoveredbyan SUSkattheendofthekthwalkstep.It’sacontradiction.

SameargumentcanbemadetotheothernextneighboringlocationsthatarecoveredbySUSk.

Therefore, ﬁnding all the SUSes covering location k becomes easy—simply go through the linked list nodes fromthe

head

node toward the

tail

node andreport all the candidateswhose lengthsareequal tothe length ofSUSk that we

havefound.IftherightmostcharacterofSUSk−1isS

[

k

−

1

]

andthesubstringSUSk−1appendedbyS

[

k

]

hasthesamelength,

thatsubstringwillbereportedtoo.Algorithm 5givesthepseudocode,wherethe

flag

isusedtonoteinwhatcasesitis possibletohavemultipleSUSes.

If

flag

ison,wewillneedto checkthelinked listnodes(Lines17–21) aswellastheoneletterextension ofSUSk−1

(Lines 14–15). The overall time and space cost of maintaining the linked list data structure (the sequence of function callsFindSLS

(

1

),

FindSLS

(

2

),

. . . ,

FindSLS

(

n

)

)isstill O

(

n

)

.ThetimecostofreportingtheSUSescoveringaparticularlocation becomes O

(

occ

)

,whereocc isthenumberofSUSesthatcoverthatlocation.Thatgivesusthefollowingtheorem.

Theorem4.2.Algorithm 5ﬁndsallSUSescoveringeverylocationofastringofsizen usingO

(

n

)

spaceandO

(

N

)

time,whereN

=

n

k=1occkandocck

≥

1 isthenumberofSUSescoveringlocationk.

5. Experiments

We have implemented our proposal named IKXSUS in

C++,

2 _using _the

_{libdivsufsort}

3 _library _for _the _suﬃx

ar-ray construction andKasaiet al.’s method[8]to compute thelcp array.We havecompared ourwork againstPeiet al.’s RSUS[12] andTsurataet al.’s [14]OSUS implementations,onboth one-SUS andall-SUS ﬁndingforevery stringlocation.

Notice that OSUS alsocomputes thesuﬃx array usingthe same

libdivsufsort

package andcomputes the lcp array

usingKasaiet al.’smethod.

RSUSwasoriginallypreparedwithanRinterface.WestrippedoffthatRinterfaceandbuiltastandalone

C++

executable forthe sake of fairbenchmarking. OSUS was originally developed in

C++.

We run OSUSboth with andwithout the

-l

option to compute a single leftmost SUS andall SUSes for every string location. In all three implementations, we also commented out the sections that print the results onto the screenand/or the disk as output, in order to measure the algorithmicperformancebetter.

WerunthetestsonamachinethathasIntel(R)Core(TM)i7-3770CPU@ 3.40 GHzprocessorwith8192KBcachesize and16 GBmemory.TheoperatingsystemwasLinuxMint 14.WeusedthePizza&Chilicorpusintheexperimentsbytaking

2 _Source_code_can_be_downloaded_at:_{http://penguin.ewu.edu/~bojianxu/publications}_. 3 _Available_at:_https_:/_/code_.google_.com_/p_{/libdivsufsort}_.

(10)

Fig. 1. The processing speed of RSUS, OSUS, and our proposal in ﬁnding the leftmost SUS of every location on several strings of different sizes.

theﬁrst1,5,10,20,50,100, and200 MBs ofthelargest

dblp.xml,

dna,

English,

and

protein

ﬁles. Theresultsare showninFigs. 1,2,3,and4.

FindingtheleftmostSUSofeverylocation,Figs. 1 and2 ItwasnotpossibletorunRSUSonlongerstrings,sinceRSUSrequires

morememorythanwhatourmachinehas,andthus,onlyupto20 MBﬁleswereincludedintheRSUSbenchmark.

Com-paredtoRSUS,wehaveobservedthatIKXSUSisinaveragemorethan 8 timesfasteranduses 20 timeslessmemory.The experimentalresultsalsorevealedthatdifferenceoftheprocessingspeedsofOSUSandIKXSUSisnegligible,butinaverage

OSUSuses 4 timesmorememorythanIKXSUS.

FindingallSUSesofeverylocation,Figs. 3 and4 Inthe experiments of all-SUS finding for every string location, RSUSwas not includedasitdoesnot havethisfunctionality.Wehaveobservedthat OSUSuseslessmemoryinthe all-SUSfinding than what itneeds forone-SUS finding,while IKXSUS’smemorycost doesnot changebetweentheone-SUS andall-SUS finding. Overall,IKXSUSuses atleast 2 timeslessmemoryspace thanOSUSandalso marginallybeats OSUSinterms of theirprocessingspeeds.

Althoughallthreeworkshavelinearspacecomplexityinboththeoryandexperiments(notethatthe

X

axisinallﬁgures uses logscale), IKXSUSandOSUSusesigniﬁcantly lessmemoryspace, dueto thefact that thesetwoworks usesimpler

data structuresrather than the suﬃx tree used by RSUS. On the other hand,although both IKXSUS and OSUSuse the

same setofdatastructures,such assuffixarray,rankarray (inversesuffix array),andthelcparray,andcomputingthese arraysaredoneviathesamelibrary(libdivsufsortforsuffixarrayconstruction)andthesamealgorithm(Kasaiet al.’s method[8]forlcparrayconstruction),thepeakmemoryusagebyOSUSismuchhigherthanIKXSUS.Thedifferencestems fromdifferentmechanismsthesestudiesfollowtocompute theSUS.OSUScomputestheSUSbyusinganadditionalarray, whichisnamedasthemeaningfulminimaluniquesubstringarray.Thus,thespaceusedforthatadditionaldatastructure

makesOSUSrequiremorememory.

Withrespecttotheprocessingspeed,bothIKXSUSandOSUSpresentstablerunningtimesonall

dblp,

dna,

protein,

and

English

textsandscale well onincreasing sizes of thetarget dataconforming totheir linear time complexity.On theother hand,RSUSexhibitsitsquadratictimecomplexityonalltexts,andespeciallyitsrunningtimeon

English

text

is much longer when comparedto other text types.The speed-upof IKXSUSandOSUS against RSUScan be even more

(11)

Fig. 2. The peak memory consumptions of RSUS, OSUS, and our proposal in ﬁnding the leftmost SUS of every location on several strings of different sizes.

(12)

Fig. 4. The peak memory consumptions of OSUS and our proposal in ﬁnding all SUSes of every location on several strings of different sizes.

6. Conclusion

We proposed IKXSUS, an optimal linear-time and linear-spacealgorithm for shortest unique substring query. Our al-gorithm significantly improved RSUS, the original work on shortestunique substring queryproposed recently[12], both theoreticallyandempiricallyinboththespaceandthetimecosts.Ourworkisindependentlydiscoveredwithoutknowing OSUS, anotherrecentlinear-timeandlinear-spacesolution[14]forSUSfinding,anduses adifferentapproach.Inpractice, IKXSUSusessignificantlylessmemorythanOSUSwhilemaintainingnearlythesameprocessingspeed.

Acknowledgements

Weacknowledgetheauthorsof[12,14]forprovidingtheirsourcecode. References

[1]M.Crochemore,W.Rytter,JewelsofStringology:TextAlgorithms,WorldScientiﬁc,2003.

[2]D.Gusﬁeld,AlgorithmsonStrings,TreesandSequences:ComputerScienceandComputationalBiology,CambridgeUniversityPress,1997.

[3]B.Haubold,N.Pierstorff,F.Möller,T.Wiehe,Genomecomparisonwithoutalignmentusingshortestuniquesubstrings,BMCBioinform.6 (1)(2005) 123.

[4]X.Hu,J.Pei,Y.Tao,Shortestuniquequeriesonstrings,in:Proceedingsofthe21stInternationalSymposiumonStringProcessingandInformation Retrieval(SPIRE),2014,pp. 161–172.

[5]A.M.Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,in:Proceedingsofthe25thAnnualSymposiumonCombinatorialPattern Matching(CPM),2014,pp. 172–181.

[6] A.M.˙Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,http://arxiv.org/abs/1312.2738. [7]L.Ilie,W.F.Smyth,Minimumuniquesubstringsandmaximumrepeats,Fund.Inform.110 (1–4)(2011)183–195.

[8]T.Kasai,G.Lee,H.Arimura,S.Arikawa,K.Park,Linear-timelongest-common-preﬁxcomputationinsuﬃxarraysanditsapplications,in:Symposium onCombinatorialPatternMatching,2001,pp. 181–192.

[9]P.Ko,S.Aluru,Spaceeﬃcientlineartimeconstructionofsuﬃxarrays,J.DiscreteAlgorithms3 (2–4)(2005)143–156.

[10]M.O.Külekci,J.S.Vitter,B.Xu,EﬃcientmaximalrepeatﬁndingusingtheBurrows–Wheelertransformandwavelettree,IEEE/ACMTrans.Comput.Biol. Bioinform.9 (2)(2012)421–429.

[11]S.Kurtz,J.V.Choudhuri,E.Ohlebusch,C.Schleiermacher,J.Stoye,R.Giegerich,Reputer:themanifoldapplicationsofrepeatanalysisonagenomic scale,NucleicAcidsRes.29 (22)(2001)4633–4642.

(13)

[12]J.Pei,W.C.H.Wu,M.Y.Yeh,Onshortestuniquesubstringqueries,in:ProceedingsofIEEEInternationalConferenceonDataEngineering(ICDE),2013, pp. 937–948.

[13]W.F.Smyth,Computingregularitiesinstrings:asurvey,EuropeanJ.Combin.34 (1)(2013)3–14.

[14]K.Tsuruta,S.Inenaga,H.Bannai,M.Takeda,Shortestuniquesubstringsqueriesinoptimaltime,in:ProceedingsofInternationalConferenceonCurrent TrendsinTheoryandPracticeofComputerScience(SOFSEM),2014,pp. 503–513.