Contents lists available atScienceDirect
Theoretical
Computer
Science
www.elsevier.com/locate/tcs
A
simple
yet
time-optimal
and
linear-space
algorithm
for shortest
unique
substring
queries
✩
Atalay Mert ˙Ileri
a,
M. O˘guzhan Külekci
b,
Bojian Xu
c,
∗
,
1aDepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,MA02139,USA bDepartmentofBiomedicalEngineering,IstanbulMedipolUniversity,Turkey
cDepartmentofComputerScience,EasternWashingtonUniversity,WA99004,USA
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received30March2014 Accepted7November2014 Availableonline13November2014 CommunicatedbyG.Ausiello Keywords:
Uniquesubstring Shortestuniquesubstring Repetitiveness
Regularity
We revisitthe problemoffinding shortestunique substring (SUS)proposed recentlyby Pei et al. (2013)[12].We propose anoptimal O(n)time and spacealgorithm thatcan findanSUS forevery locationofastringofsizen andthussignificantlyimprove their
O(n2) timecomplexity. Ourmethodalsosupports finding all the SUSescovering every
location,whereastheirscanfindonlyoneSUSforeverylocation. Further,oursolution is simplerand easier to implementand ismore spaceefficient inpractice, since weonly use the inverse suffix arrayand the longest common prefix array of the string, while theiralgorithmusesthesuffixtreeofthestringandotherauxiliarydatastructures. Our theoreticalresultsarevalidatedbyanempiricalstudywithreal-worlddatathatshowsour methodisatleast 8 timesfasterand usesatleast20 times less memory.The speedup gained by our methodagainst Pei et al.’scan becomeeven more significant whenthe stringsizeincreasesduetotheirquadratictimecomplexity.We alsohavecomparedour methodwiththerecent Tsurutaet al.’s (2014)[14]proposal, anotherindependent O(n)
timeandspacealgorithmforSUSfinding.Theempiricalstudyshowsthatbothmethods havenearlythesameprocessingspeed.However,oursusesatleast4times lessmemory forfindingoneSUSandatleast2times lessmemoryforfindingallSUSes,bothcovering everystringlocation.
©2014ElsevierB.V.All rights reserved.
1. Introduction
Repetitivestructureandregularityfinding[2,1,13]hasreceivedmuchattentioninstringologyduetoits comprehensive applications in different fields, especially in computational biologyand bioinformatics research [11,10]. Finding shortest unique substrings(SUS)canbe an indirectwayforfindingrepetitive structuresof astring,becauseanyproper substring of ashortestuniquesubstringoccursmultipletimesinthestringandthusisarepeat[7].Shortestuniquesubstringshave beenpreviouslyusedincomparingDNAsequences[3].However,efficientmethod forfindingtheshortestuniquesubstring coveringa givenstringlocation wasnot studied,untilrecentlyit was proposedby Peiet al.[12].Aspointedout in[12],
✩ Authornamesarelistedinalphabeticalorder.Apreliminaryversionofthisarticleappearedat[5].Partofthisworkwasdonewhileallauthorswere withTÜB˙ITAK-B˙ILGEM-UEKAEofTurkeyinSummer2013.
*
Correspondingauthor.E-mailaddresses:atalay@mit.edu(A.M. ˙Ileri),okulekci@medipol.edu.tr(M.O. Külekci),bojianxu@ewu.edu(B. Xu). 1 SupportedinpartbyEWU’sFacultyGrantsforResearchandCreativeWorks.
http://dx.doi.org/10.1016/j.tcs.2014.11.004 0304-3975/©2014ElsevierB.V.All rights reserved.
SUS findingalsohasits ownother importantusageinsearch engines andbioinformatics. Wereferreadersto [12] forits detaileddiscussionontheapplicationsofSUSfinding.Peiet al.proposedasolutionthatcosts O
(
n2)
timeand O(
n)
spacetofindan SUSforeverylocationofastringofsizen.Inthispaper,weproposeanoptimalO
(
n)
timeandspacealgorithm for SUS finding. Ourmethod uses simpler data structuresthat include the suffix array, the inversesuffix array,and the longest commonprefixarray ofthegivenstring,whereas themethodin [12] isbuiltuponthe suffixtree datastructure. OuralgorithmalsoprovidesthefunctionalityoffindingalltheSUSescoveringevery location,whereasthemethodof[12]searchesforonlyoneSUSforeverylocation.Ourmethodnotonlyimprovestheir resultstheoretically,theempiricalstudy also showsthat our method ismore spacesaving by a factor ofatleast 20 and is fasterby a factor of 4.The speedup gained byourmethodcan becomeevenmoresignificant whenthestringbecomes longerduetothe quadratictimecost of[12].Duetotheveryhighmemoryconsumptionof[12],wewerenotabletoruntheirmethodwithmassivedataonour machine.
Independenceofourwork AfterwepostedaninitialversionofthisproposalatarXiv[6],wewerecontactedviaemailsbythe coauthorsof[14]and[4],bothofwhichsolvedtheSUSfindingusingO
(
n)
timeandspace.Bythetimewecommunicated, article[14]hadbeenacceptedbuthasnotbeenpublishedand[4]wasstillunderreview.Wewerealsoofferedwiththeir paperdraftsandthesourcecodeof[14].ThemethodsforSUSfindinginbothpapersarebasedonthesearchforminimum uniquesubstrings (MUS),aswhat[12]did.Ouralgorithmtakesa differentapproachanddoesnot needtosearchforMUS. The problemstudied by [4]is also more general, in that they want to findSUS covering a givenchunk of locations in the string,instead of a single location considered by [12,14] andour work. So, by all means, our work is independent andpresents adifferentoptimalalgorithm forSUS finding.Wealso haveincludedthe performancecomparison withthe algorithmof[14]intheempiricalstudy.Itshowsthatbothmethodshavenearlythesameprocessingspeed,butourmethod uses atleast4times lessmemoryforfindingoneSUSforeverystringlocationandusesatleast2times lessmemoryfor findingallSUSesforeverystringlocation.Thealgorithmfrom[4]cannotbeempiricallystudiedastheauthordidnotprefer toreleasethecodeuntiltheirpaperisaccepted.2. Preliminary
We consider a string S
[
1. . .
n]
, where each character S[
i]
is drawn from an alphabetΣ
= {
1,
2,
. . . , σ
}
.A substring S[
i. . .
j]
of S represents S[
i]
S[
i+
1]
. . .
S[
j]
if1≤
i≤
j≤
n, andisan empty stringif i>
j.String S[
i. . .
j]
isa proper substring of another string S[
i. . .
j]
ifi≤
i≤
j≤
j and j−
i<
j−
i. The length of a non-empty substring S[
i. . .
j]
, denoted as|
S[
i. . .
j]|
,is j−
i+
1.We definethe lengthofan empty stringaszero.A prefix of S is asubstring S[
1. . .
i]
forsome i,1≤
i≤
n.A properprefix S[
1. . .
i]
isaprefixofS wherei<
n.A suffix ofS isasubstring S[
i. . .
n]
forsome i, 1≤
i≤
n. A propersuffix S[
i. . .
n]
is asuffix of S where i>
1. Wesay thecharacter S[
i]
occupiesthe string location i. We saythesubstring S[
i. . .
j]
covers thekthlocationof S,ifi≤
k≤
j.Fortwostrings A and B,wewrite A=
B (andsay A is equal to B),if|
A|
= |
B|
and A[
i]
=
B[
i]
fori=
1,
2,
. . . ,
|
A|
.Wesay A islexicographicallysmallerthan B,denotedas A<
B,if(1) A isaproper prefixof B,or(2) A[
1]
<
B[
1]
,or(3)thereexistsan integerk>
1 suchthat A[
i]
=
B[
i]
forall 1≤
i≤
k−
1 but A[
k]
<
B[
k]
.Asubstring S[
i. . .
j]
of S is unique,iftheredoesnotexistanothersubstring S[
i. . .
j]
of S, suchthat S[
i. . .
j]
=
S[
i. . .
j]
buti=
i.Asubstringisa repeat ifitisnotunique.Definition2.1. Fora particular string location k
∈ {
1,
2,
. . . ,
n}
, the shortestuniquesubstring(SUS)coveringlocation k, denotedasSUSk,isauniquesubstring S[
i. . .
j]
,suchthat(1)i≤
k≤
j,and(2)thereisnootheruniquesubstring S[
i. . .
j]
of S,suchthati
≤
k≤
jand j−
i<
j−
i.Foranystringlocationk, SUSk mustexist,becausethestring S itself canbeSUSk ifnoneofthepropersubstringsof S is SUSk.AlsotheremightbemultiplecandidatesforSUSk.Forexample,ifS
= abcbb
,thenSUS2 canbeeitherS[
1,
2]
= ab
or S[
2,
3]
= bc
.Foraparticular stringlocationk
∈ {
1,
2,
. . . ,
n}
,the left-boundedshortestuniquesubstring(LSUS)startingatlocation k, denotedasLSUSk,isauniquesubstring S[
k. . .
j]
,such thateitherk=
j oranyproperprefixof S[
k. . .
j]
isnotunique.NotethatLSUS1
=
SUS1always exists,becauseatleastthewholestringS isunique.However,foranarbitrarylocationk≥
2, LSUSk may notexist. Forexample,if S= abcabc
, then noneofLSUS4, LSUS5,andLSUS6 exists.An up-to-j extensionofLSUSk,denotedasLSUSjk,isthesubstring S
[
k. . .
j]
,wherek+ |
LSUSk|
≤
j≤
n.The suffixarray SA
[
1. . .
n]
ofthestring S isapermutationof{
1,
2,
. . . ,
n}
,suchthatforanyi and j,1≤
i<
j≤
n, we have S[
SA[
i]
. . .
n]
<
S[
SA[
j]
. . .
n]
.Thatis,SA[
i]
isthestartinglocationoftheithsuffixinthesortedorderofallthesuffixes of S.The rankarray Rank[
1. . .
n]
is theinverseofthesuffixarray.Thatis,Rank[
i]
=
j iffSA[
j]
=
i.The longestcommon prefix(lcp)array LCP[
1. . .
n+
1]
isanarrayofn+
1 integers,suchthatfori=
2,
3,
. . . ,
n,LCP[
i]
isthelengthofthelcpof the twosuffixes S[
SA[
i−
1]
. . .
n]
and S[
SA[
i]
. . .
n]
.WesetLCP[
1]
=
LCP[
n+
1]
=
0.Intheliterature, thelcparray isoften definedasanarray ofn integers. Weincludean extrazeroatLCP[
n+
1]
just tosimplifythedescriptionofourupcoming algorithms.Table 1showsthesuffixarrayandthelcparrayoftheexamplestringmississippi.
The next Lemma 2.1showsthat, byusing therankarray andthe lcp arrayof thestring S, itis easyto calculateany LSUSi ifitexistsortodetectthatitdoesnotexist.
Table 1
ThesuffixarrayandthelcparrayofanexamplestringS= mississippi.
i LCP[i] SA[i] suffixes 1 0 11 i 2 1 8 ippi 3 1 5 issippi 4 4 2 ississippi 5 0 1 mississippi 6 0 10 pi 7 1 9 ppi 8 0 7 sippi 9 2 4 sissippi 10 1 6 ssippi 11 3 3 ssissippi 12 0 – – Lemma2.1.Fori
=
1,
2,
. . . ,
n: LSUSi=
S[
i. . .
i+
Li],
if i+
Li≤
nnot existing
,
otherwisewhereLi
=
max{
LCP[
Rank[
i]],
LCP[
Rank[
i]
+
1]}
.Proof. Notethatbythedefinitionofthelcparray,LiisthelengthofthelongestcommonprefixbetweenthesuffixS
[
i. . .
n]
andanyothersuffixofS.ThevalueofLicanbeanynumberfromtheset
{
0,
1,
. . . ,
n−
i+
1}
.Ifi+
Li≤
n,i.e.,Li<
n−
i+
1,itmeanssubstring S
[
i. . .
i+
Li]
existsandisunique,whilesubstring S[
i. . .
i+
Li−
1]
iseitheremptyorisarepeat.So,bythedefinitionofLSUS,S
[
i. . .
i+
Li]
isLSUSi.Ontheotherhand,ifi+
Li>
n,i.e., Li=
n−
i+
1,itmeans S[
i. . .
i+
Li−
1]
isindeedthesuffix S
[
i. . .
n]
andisarepeat,soLSUSi doesnotexist.2
3. SUSfindingforonelocation
Inthissection,we wanttofindtheSUScoveringa givenlocation k usingO
(
n)
time andspace. Westart withfinding theleftmostoneifk hasmultipleSUSes.Intheend,wewillshowatrivialextensiontofindalltheSUSescoveringlocation k withthesametimeandspacecomplexities,ifk hasmultipleSUSes.Lemma3.1.EverySUSiseitheranLSUSoranextensionofanLSUS.
Proof. Let’s saywe are looking atSUSk foranyk
∈ {
1. . .
n}
.We knowSUSk exists foranyk, so let’ssay SUSk=
S[
i. . .
j]
,1
≤
i≤
k≤
j≤
n. If S[
i. . .
j]
isneither LSUSi noranextension ofLSUSi,itmeans S[
i. . .
j]
is aproperprefix ofLSUSi andthusisarepeat,whichcontradictsthefactthat S
[
i. . .
j]
=
SUSk isunique.2
Example1: S
= abcbca
,thenSUS2=
S[
1,
2]
= ab
, whichisLSUS1.Example2: S= abcbc
,then SUS2=
S[
1,
2]
= ab
,whichisanextensionofLSUS1
=
S[
1]
tolocation2.By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS, andthe starting location of that LSUS
mustbeonorbeforelocationk. ThenthealgorithmforfindingSUSk foranygivenstringlocationk issimply tocalculate LSUS1
,
LSUS2,
. . . ,
LSUSk ifexisting,usingLemma 2.1.Duringthiscalculation,ifanyLSUSdoesnotcoverthelocationk,wesimplyextendthatLSUSup tolocationk.WewillpicktheshortestoneamongalltheLSUSesortheir up-to-k extensions asSUSk.Weresolvethetiebypickingtheleftmostone.ItispossiblethisprocedurecanearlystopifitfindsanLSUSdoes
not exist,becausethat indicates allthe otherremaining LSUSes donot existeither.Algorithm 1 givesthepseudocode of thisprocedure,wherewerepresentSUSk byitstwoattributes:
start
andlength,
thestartinglocationandthelengthof SUSk,respectively.Lemma3.2.Givenastringlocationk andtherankandthelcparrayofthestringS,Algorithm 1canfindSUSkusingO
(
k)
time.Ifthere aremultiplecandidatesforSUSk,theleftmostoneisreturned.Proof. TheprocedurestartswiththecandidateS
[
1. . .
n]
,whichisindeedunique(Line1).ThentheFor
loopcalculatesthe LSUSifori=
1,
2,
. . . ,
k (Lemma 2.1).IfLSUSiexists(Line4)andthelengthofLSUSioritsup-to-k extensionislessthanthelength ofthecurrentbestcandidate(Line5), thenwe willpickthat LSUSi orits up-to-k extension asthe newcandidate
forSUSk.Thisalso resolvesthepossibletiesby pickingtheleftmostcandidate.Intheendoftheprocedure,we willhave
the shortestone amongLSUS1
. . .
LSUSk or their up-to-k extensions,and that isSUSk.Early stop ismade atLine 7ifthe LSUS beingcalculateddoesnotexist,becausethatmeansalltheremainingLSUSes tobecalculateddonotexisteither.Each stepintheFor
loopcosts O(
1)
timeandtheloopexecutesnomorethank steps,sotheproceduretakesatotalof O(
k)
Algorithm 1: FindSUSk.Returntheleftmostoneifk hasmultipleSUSes.
Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS
Output: SUSk.Theleftmostonewillbereturnedifk hasmultipleSUSes.
1 start←1;length←n ; // Start location and length of the best candidate for SUSk.
2 for i=1,. . . ,k do
3 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};
4 if i+L≤n then // LSUSi exists.
/* Extend LSUSi up to k if needed. Resolve the tie by picking the leftmost SUS. */
5 if max{L+1,k−i+1}<length then 6 start←i;length←max{L+1,k−i+1};
7 else break; // Early stop.
8 PrintSUSk← (start,length);
Algorithm 2: FindalltheSUSescoveringagivenlocationk.
Input: Thelocationindexk,andtherankarrayandthelcparrayofthestringS
Output: AlltheSUSescoveringlocationk.
1 start←1;length←n ; // Start location and length of the best candidate for SUSk.
/* Find the length of SUSk. */
2 for i=1,. . . ,k do
3 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};
4 if i+L≤n then // LSUSi exists.
5 if max{L+1,k−i+1}<length then // Extend LSUSi to location k if necessary.
6 start←i;length←max{L+1,k−i+1};
7 else break; // Early stop.
/* Find all SUSes covering location k. */
8 for i=1,. . . ,k do
9 L←max{LCP[Rank[i]],LCP[Rank[i]+1]};
10 if i+L≤n then // LSUSi exists.
11 if max{L+1,k−i+1}=length then // Extend LSUSi to location k if necessary.
12 Print(i,max{L+1,k−i+1});
13 else break; // Early stop.
Theorem3.1.Foranylocationk inthestringS,wecanfindSUSkusingO
(
n)
timeandspace.IftherearemultiplecandidatesforSUSk, theleftmostoneisreturned.Proof. Thesuffix arrayof S canbeconstructedbyexistingalgorithmsusing O
(
n)
timeandspace(forexample,[9]).After thesuffixarrayisconstructed,therankarray(theinversesuffixarray)canbetriviallycreatedusinganotherO(
n)
timeand space.WecanthenusethesuffixarrayandtherankarraytoconstructthelcparrayusinganotherO(
n)
timeandspace[8]. Combining thetimecost ofAlgorithm 1(Lemma 3.2),thetotaltime costforfindingSUSk foranylocationk inthestring S of sizen is O(
n)
withatotal of O(
n)
spaceusage. IfmultiplecandidatesforSUSk exist,theleftmostcandidatewillbereturnedasisprovidedbyAlgorithm 1(Lemma 3.2).
2
3.1. Extension:findingallSUSesforonelocationIt is trivialto extend Algorithm 1 to find all the SUSes covering a particular location k as follows.We can first use
Algorithm 1tofindtheleftmostSUSk.Thenwestart overagaintore-calculateLSUS1
. . .
LSUSkortheir up-to-k extensions,and returnall ofthose whose length isequal to thelength of SUSk.Algorithm 2 showsthe pseudocode.This procedure
clearlycostsanextra O
(
k)
time.CombiningtheresultsfromTheorem 3.1,wegetthefollowingtheorem. Theorem3.2.Foranylocationk inthestringS,wecanfindalltheSUSescoveringlocationk usingO(
n)
timeandspace.4. SUSfindingforeverylocation
Inthissection,wewanttofindSUSk foreverylocationk
=
1,
2,
. . . ,
n.Ifk hasmultipleSUSes,theleftmostonewillbereturned.Intheend,wewillshowanextensiontofindallSUSesforeverylocation.
Anaturalsolutionistoiteratively useAlgorithm 1asasubroutinetofindevery SUSk,fork
=
1,
2,
. . . ,
n.However, thetotaltimecostofthissolutionwillbe O
(
n)
+
nk=1O
(
k)
=
O(
n2)
,whereO(
n)
capturesthetimecostfortheconstructionoftherankarrayandthelcparrayand
nk=1O(
k)
isthetotaltimecostforthen instancesofAlgorithm 1.Wewanttohave asolutionthatcostsatotalof O(
n)
timeandspace,whichimpliesthattheamortizedcostforfindingeachSUSis O(
1)
.ByLemma 3.1,weknowthateverySUSmustbe anLSUSoran extensionofanLSUS.ThenextLemma 4.1 furthersays ifSUSkisanextensionofanLSUS,ithassomespecialpropertiesandcanbequicklyobtainedfrom SUSk−1.
Lemma4.1.Foranyk
∈ {
2,
3,
. . . ,
n}
,ifSUSkisanextensionofanLSUS,then(1)SUSk−1mustbeasubstringwhoserightboundaryis thecharacterS[
k−
1]
,and(2)SUSkisthesubstringSUSk−1appendedbythecharacterS[
k]
.Proof. Because SUSk is an extension of an LSUS, we have SUSk
=
S[
i. . .
k]
for some i<
k and LSUSi=
S[
i. . .
j]
forsome j<
k. We alsoknow S[
i. . .
k−
1]
isunique, becausethe unique substring S[
i. . .
j]
is a prefix of S[
i. . .
k−
1]
. Notethat anysubstringstartingfromalocationbeforei andcoveringlocationk−
1 islongerthantheuniquesubstring S[
i. . .
k−
1]
, so SUSk−1 mustbe starting froma location betweeni and k−
1,inclusive. Next, weshow SUSk−1 actuallymust start atlocation i.ThefactSUSk
=
S[
i. . .
k]
tellsusthat|
LSUSt|
≥ |
SUSk|
=
k−
i+
1 foreveryt=
i+
1,
i+
2,
. . . ,
k;otherwise,any LSUSt that isshorterthank−
i+
1 wouldbe abetter candidatethan S[
i. . .
k]
asSUSk.Thatmeans,anyuniquesubstringstartingfromt
=
i+
1,
i+
2,
. . . ,
k−
1 hasalengthatleastk−
i+
1.However,|
S[
i. . .
k−
1]|
=
k−
i<
k−
i+
1 andS[
i. . .
k−
1]
isuniquealreadyandcoverslocationk−
1 aswell,so S[
i. . .
k−
1]istheonlycandidateforSUSk−1.ThisalsomeansSUSkisindeedthesubstringSUSk−1appendedby S
[
k]
.2
4.1. TheoverallstrategyWe areready topresentthe overall strategy forfindingSUS ofevery location, by usingLemmas 3.1 and 4.1. Wewill calculate all the SUS in the orderof SUS1
,
SUS2,
. . . ,
SUSn.That means when we want to calculate SUSk, k≥
2,we havehadSUSk−1 calculatedalready.Notethat SUS1
=
LSUS1, whichiseasyto calculateusingLemma 2.1.Now let’slookatthecalculation of a particular SUSk,k
≥
2. By Lemma 3.1, we know SUSk is either an LSUS or an extension of an LSUS. By Lemma 4.1,wealsoknowifSUSk isanextensionofanLSUS,thentherightboundaryofSUSk−1 mustbe S[
k−
1]
andSUSkisjustSUSk−1 appendedby thecharacter S
[
k]
.Suppose whenwe wanttocalculateSUSk,we havealreadycalculatedtheshortestLSUScoveringlocationk orhaveknownthefactthatnoLSUScoverslocationk.Then,byusingSUSk−1,whichhas
beencalculatedbythen,andtheshortestLSUScoveringlocationk,wewillbeabletocalculateSUSkasfollows:
Case 1:If theright boundaryof SUSk−1 is not S
[
k−
1]
, then we knowSUSk cannot be an extension of an LSUS (thecontrapositiveofLemma 4.1).Thus,SUSk isjusttheshortestLSUScoveringlocationk,whichmustbeexistinginthiscase.
Case 2:If theright boundaryof SUSk−1 is S
[
k−
1]
, then SUSk mayor maynot be an extension of an LSUS. Wewillconsidertwopossibilities:(1)IftheshortestLSUScoveringlocationk exists,wewillcompareitslengthwith
|
SUSk−1|
+
1,andpicktheshorteroneasSUSk.Ifbothhavethesamelength,weresolvethetiebypickingtheonewhosestartinglocation
indexissmaller.(2)IfnoLSUScoverslocationk,SUSkwilljustbeSUSk−1 appendedby S
[
k]
.Therefore,therealchallengehere,bythetimewewanttocalculateSUSk,k
≥
2,istoensurethatwewouldhavealreadycalculatedtheshortestLSUScoveringlocation k orwewouldhavealreadyknownthefactthat noLSUS coverslocationk. IfthereexistmultipleshortestLSUSescoveringlocationk,wewouldliketoknowtheleftmostone.
4.2. Preparation
We nowfocus on thecalculation of theleftmost shortestLSUS covering every string location k, denotedby SLSk. Let
Candidateki denotetheleftmost shortestoneamongthoseofLSUS1
,
. . . ,
LSUSkthatexistandcoverlocationi.Foranarbitrary k,1≤
k≤
n,SLSkmaynotexist,becausethelocationk maynotbecoveredbyanyLSUSatall.Forexample,ifS= abcabc
,then locations 5 and6 arenotcovered byanyLSUS, andthus SLS5 andSLS6 donot exist.However,ifSLSk exists,bythe
definitionofSLS andCandidate,wehavethefollowingfact. Fact4.1.SLSk
=
Candidatekk=
Candidatekk+1= · · · =
Candidate nk,ifSLSkexists.
OurgoalistoensureSLSkwillhavebeenknownwhenwe wanttocalculateSUSk,sowecalculateevery SLSkfollowing
the sameorderk
=
1,
2,
. . . ,
n, atwhich wecalculate all SUSes.Becausewe need toknow every LSUSi,i≤
k in ordertocalculateSLSk (Fact 4.1), wewillwalk throughthestringlocationsk
=
1,
2,
. . . ,
n: ateachwalk stepk, wecalculateLSUSkandmaintain Candidateki forevery stringlocation i thathasbeencovered byatleastoneofLSUS1
,
LSUS2,
. . . ,
LSUSk.Notethat Candidateki
=
SLSi forevery i≤
k (Fact 4.1). ThoseCandidateki withi≤
k would havealreadybeenused asSLSi inthecalculationofSUSi.So,aftereachwalkstepk,wewillonlyneedtomaintainthecandidatesforlocationsafterk.
Lemma4.2.(1)LSUS1alwaysexists.(2)IfLSUSkexists,thenLSUS1
,
LSUS2,
. . . ,
LSUSkallexist.(3)IfLSUSkdoesnotexist,thennone ofLSUSk,
LSUSk+1,
. . . ,
LSUSnexist.Proof. (1) LSUS1 mustexist,becausethestring S canbe LSUS1 ifeveryproper prefixof S isa repeat.(2)IfLSUSk exists,
sayLSUSk
=
S[
k. . .
γ
k]
,thenLSUSi existsforevery i≤
k,becauseatleastS[
i. . .
γ
k]
isuniqueduetothefactthat S[
k. . .
γ
k]
isuniqueandalsoisasuffixof S
[
i. . .
γ
k]
.(3)IfLSUSk doesnot exist,itmeans S[
k. . .
n]
isarepeat,andthus everysuffix S[
i. . .
n]
of S[
k. . .
n]
fori≤
k isalsoarepeat,i.e.,LSUSi doesnotexistforeveryi≥
k.2
ThenextlemmashowsthattherightboundaryofLSUSiwillbeonoraftertherightboundaryofLSUSi−1,ifLSUSiexists.
Lemma4.3.Foreachi
=
2,
3,
. . . ,
n:|
LSUSi|
≥ |
LSUSi−1|
−
1.Proof. We prove the lemma by contradiction. Suppose LSUSi−1
=
S[
i−
1. . .
j]
for some j, i−
1≤
j≤
n. If|
LSUSi|
<
|
LSUSi−1|
−
1,it means LSUSi=
S[
i. . .
k]
,where i≤
k<
j.Because S[
i. . .
k]
is unique, S[
i−
1. . .
k]
is alsounique, whoselengthhoweverisshorterthan S
[
i−
1. . .
j]
.ThisisacontradictionbecauseS[
i−
1. . .
j]
isalreadyLSUSi−1.Thus,theclaiminthelemmaistrue.
2
Now let’slookatthesituationattheendofthekthwalkstep.Bythen,wehavecalculatedLSUS1
,
LSUS2,
. . . ,
LSUSk.By Lemma 4.2,weknowthatthereexistssomek,1
≤
k≤
k,suchthatLSUS1,
. . . ,
LSUSk allexist,butLSUSk+1. . .
LSUSkdonot exist.Ifk
=
k,thatmeansLSUS1,
. . . ,
LSUSkallexist.Letγ
kdenotetherightboundaryofLSUSk,i.e.,LSUSk=
S[
k. . .
γ
k]
.By Lemma 4.3,we knowγ
kisalsotherightboundaryofthestringlocationscovered byLSUS1,
. . . ,
LSUSk.So,every location 1,
2,
. . . , γ
k is covered by at least one LSUS from LSUS1,
. . . ,
LSUSk. Thatis, at the end of the kth walk step: (1) every location j=
1,
. . . , γ
khasitscandidateCandidatekj calculatedalready.(2)Ifγ
k<
n,everylocation j=
γ
k+
1,
. . . ,
n stilldoesnot haveits candidatecalculated,becauseeverysuchlocation j hasnotbeencoveredbyanyLSUS fromLSUS1
,
. . . ,
LSUSk thatwehavecalculatedattheendofthekthwalkstep.Lemma4.4.Attheendofthekthwalkstep,if
γ
k>
k,thenforanyi andj,k≤
i<
j≤
γ
k,Candidatekjalsocoverslocationi.Proof. Candidatekj isasubstringstarting somewhereonorbeforek andgoing throughthelocation j.Becausek
≤
i<
j,it isobviousthatCandidatekjgoesthroughlocation i.2
Lemma4.5.Attheendofthekthwalkstep,if
γ
k>
k,then Candidatekk≤
Candidatekk+1≤···≤
Candidatekγk
Proof. By Lemma 4.4,we knowCandidatekj alsocovers location i, forany i and j,k
≤
i<
j≤
γ
k. Thus, if|
Candidatekj|
<
|
Candidateki|
, location i’s currentcandidate should be replaced by location j’s candidate,because that gives location i a shortercandidate.However,thecurrentcandidateforlocation i isalreadytheshortestcandidate.Itisacontradiction. So,|
Candidateki|
≤ |
Candidatekj|
,whichprovesthelemma.2
4.3. FindingSLS foreverylocation
Invariant. WecalculateSLSkfork
=
1,
2,
. . . ,
n bymaintainingthefollowinginvariantattheendofeverywalkstep k:(A) Ifγ
k>
k, locations{
k+
1,
k+
2,
. . . , γ
k}
will be cutinto chunks,such that: (A.1)All locationsinone chunkhavethe samecandidate.(A.2) Locationsbelongingtodifferentchunkshavedifferentcandidates.(A.3) Eachchunkwillberepresentedby
a linked listnode of fourfields:
ChunkStart,
ChunkEnd,
start,
length,
respectively representingthe startandendlocation ofthechunk andthe startandlengthofthe candidatesharedby alllocationsofthechunk. (A.4) Allnodes representing different chunks will be connected into a linked list, ordered by the string positions of the corresponding chunks.Thelinkedlisthasa
head
andatail,
referringtothetwonodesthatrepresentthelowestpositionedchunkand thehighestpositionedchunk.(B) Ifγ
k≤
k,thelinkedlistisempty.Maintenanceoftheinvariant. Wedescribeinaninductivemannertheprocedurethatmaintainstheinvariant.Algorithm 3
showsthepseudocodeoftheprocedure.Westartwithanemptylinkedlist.
Basestep:k
=
1 Wearewalkingthefirststep.WefirstcalculateLSUS1 usingLemma 2.1.WeknowLSUS1 mustexist.Let’ssayLSUS1
=
S[
1. . .
γ
1]
forsomeγ
1≤
n.Then,Candidatei1=
LSUS1 foreveryi=
1,
2,
. . . , γ
1.Werecordallthesecandidatesby usinga singlenode
(
1, γ
1,
1, γ
1)
.This istheonlynode inthelinked listandispointedbybothhead
andtail.
WeknowSLS1
=
Candidate11(Fact 4.1),sowereturnSLS1 byreturning(head.start, head.length)
= (1,
γ
1)
.Wethenchangehead.ChunkStart
from1 to2.Ifitturns outhead
.ChunkEnd
=
γ
1< 2
,meaning LSUS1 reallycoverslocation1 only,wedeletethe
head
nodefromthelinkedlist,whichwillthenbecomeempty.Inductivestep:k
≥
2 Wearewalkingthekthstep.WefirstcalculateLSUSkusingLemma 2.1.•
Case 1: LSUSk does not exist. (1) Ifhead
does not exist. It means that location k is covered neither by any of LSUS1,
. . . ,
LSUSk−1 nor by LSUSk, so SLSk simply does not exist and we will return(null, null)
. (2) Ifhead
Algorithm 3: The sequence of function calls FindSLS
(
1)
, FindSLS(
2),
. . . ,
FindSLS(
n)
returns SLS1, SLS2,
. . . ,
SLSn, ifthecorrespondingSLS exists;otherwise,
null
willbereturned. 1 ConstructRank[1. . .n]andLCP[1. . .n]ofthestringS;2 InitializeanemptyList; // Each node’s 4 fields: ChunkStart, ChunkEnd, start, length.
3 head←0;tail←0 ; // Reference to the head and tail node of the List 4 FindSLS(k)
/* Process LSUSk, if it exists. */
5 L←max{LCP[Rank[k]],LCP[Rank[k]+1]};
6 if k+L≤n then // LSUSk exists.
// Add a new list element at the tail, if necessary.
7 if head=0 then List[1]← (k,k+L,k,L+1);head←1;tail←1 ; // List was empty.
8 elseif k+L>List[tail].ChunkEnd then
9 tail+ +;List[tail]← (List[tail−1].ChunkEnd+1,k+L,k,L+1);
/* Update candidates and merge the nodes whose candidates can be shorter. Resolve the tie by
picking the leftmost one. */
10 j←tail;
11 while j≥head and List[j].length>L+1 do j− −;
12 List[j+1]← (List[j+1].ChunkStart,List[tail].ChunkEnd,k,L+1);tail←j+1;
13 if head=0 then SLSk← (head.start,head.length); // The list is not empty.
14 else SLSk← (null, null); // SLSk does not exist.
/* Discard the information about location k from the List. */
15 if head>0 then // List is not empty
16 if List[head].ChunkEnd≤k then
17 head+ +; // Delete the current head node
18 if head>tail then head←0;tail←0; // List becomes empty
19 else List[head].ChunkStart←k+1;
20 return SLSk
remove the information about location k from the head by setting head
.
ChunkStart=
k+
1. If it turns out thathead
.ChunkEnd < head.ChunkStart
,wewillremovethehead
node.•
Case 2: LSUSk exists andLSUSk=
S[
k. . .
γ
k]
,γ
k≤
n. By Lemma 4.2,we know LSUS1,LSUS2,
. . . ,
LSUSk−1 all exist. Letγ
k−1 denotetherightboundaryofLSUS1,LSUS2,
. . . ,
LSUSk−1.ByLemma 4.3,we knowγ
k≥
γ
k−1 andγ
k−1 isalsotherightboundaryofLSUSk−1,i.e.,LSUSk−1
=
S[
k−
1. . .
γ
k−1]
.Notethatbothγ
k−1<
k andγ
k−1≥
k arepossible.1. If
head
doesnotexist,itmeansγ
k−1<
k andnoneofthelocations{
k. . .
γ
k}
iscoveredbyanyofLSUS1,
LSUS2,
. . . ,
LSUSk−1.Wewillinsertanewnode(k,
γ
k, k,
γ
k− k + 1)
,whichwillbetheonlynodeinthelinkedlist.2. If
head
exists,itmeansγ
k−1≥
k.Ifγ
k> tail.ChunkEnd
=
γ
k−1,wefirstinsertatthetailsideanewlinkedlistnode
(tail.ChunkEnd
+ 1,
γ
k, k,
γ
k− k + 1)
torecordthecandidateinformationforlocationsinthechunkafterγ
k−1 throughγ
k.Then,we willtravelthrough thenodesinthelinkedlistfromthetailside towardthehead.We stopwhen wemeet anode whosecandidateisshorterthan orequal toLSUSkorwhenwe reachtheheadendofthelinkedlist.Wewill
mergeallthenodeswhosecandidatesarelongerthanLSUSkintoanewlinkedlistnode.Thechunkcoveredbythenew
nodeistheunionofthechunkscoveredbythemergednodes,andthecandidateofthenewnodeisLSUSk.
Thistravelandmergeprocess isvalidbecauseofLemma 4.5.Thismergeprocessensuresevery locationmaintainsits best(shortest) candidateby theendofevery walk step.Italsoresolves thepossible tiesofmultipleshortestLSUSes coveringaparticularlocationbypickingtheleftmostoneasthatlocation’scandidate,becausethemergeprocessdoes notmergenodeswhosecandidatesareofthesamelength.
We willreturn
(head.start, head.length)
asSLSk, sinceCandidatekk=
SLSk (Fact 4.1). Finally,we will removetheinformationaboutlocation k fromthe headby settinghead
.
ChunkStart=
k+
1. Wewill removethehead
node ifit turnsoutthathead
.ChunkEnd > head.ChunkStart
.Lemma4.6.GiventhelcparrayandtherankarrayofS,thesequenceofFindSLS
(
1)
,FindSLS(
2),
. . . ,
FindSLS(
n)
functioncallswill returnSLS1,
SLS2,
. . . ,
SLSnifexisting.TheamortizedtimecostofoneFindSLS()
functioncallisO(
1)
.Proof. ThecorrectnessofAlgorithm 3isalreadygiveninthedescriptionoftheprocedurethatmaintainstheinvariance.All operationsinan instanceofFindSLS
()
function callclearlytake O(
1)
time,exceptthewhile
loopatLine11,whichisto mergelinkedlistnodeswhosecandidatescanbe shorter.Thus,thelemmawillbe proved,ifwecanprovethe amortized numberoflinkednodesthatwillbemergedviathatwhile
loopisalsoboundedbyaconstant.Notethatnonodeinthe linkedlisteversplitsduetoLemma 4.3.InthesequenceoffunctioncallsFindSLS(
1)
,FindSLS(
2),
. . . ,
FindSLS(
n)
,thereareatAlgorithm 4: FindingtheleftmostSUSk,k
=
1,
. . . ,
n. 1 for k←1. . .n do2 (start,length)←FindSLS(k); // SLSk; It is (null, null) if SLSk does not exist. 3 if k=1 then PrintSUSk← (start,length);
4 elseif SUSk−1.start+SUSk−1.length−1>k−1 then PrintSUSk← (start,length); 5 elseif(start,length)= (null, null)then PrintSUSk← (SUSk−1.start,SUSk−1.length+1);
6 elseif length<SUSk−1.length+1 then PrintSUSk← (start,length);
7 else // Resolve the tie by picking the leftmost one.
8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1) 9
mostn linkedlistnodesto bemerged.We knowthe numberofmergeoperationsinmergingn nodes intoone node (in theworstcase)isnomorethan O
(
n)
.Therefore,theamortizedtimecostonmergingthelinkedlistnodesinoneFindSLS()
function calloverthesequenceofn functioncalls FindSLS
(
1)
,FindSLS(
2),
. . . ,
FindSLS(
n)
is O(
1)
.Thisfinishesthe proofofthelemma.
2
4.4. FindingtheleftmostSUS foreverylocation
OnceweareabletosequentiallycalculateeverySLSk ordetectitdoesnotexist,wearereadytocalculateeverySUSk by
usingthe strategydescribed inSection4.1.Algorithm 4givesthepseudocodeoftheprocedure.It calculatesSUSesinthe orderofSUS1
,
SUS2,
. . . ,
SUSn(Line1).Foreachlocationk,thefunctioncallatLine2istocalculateSLSkortofindSLSkdoesnot exist.Line 3handlesthe specialcasewhereSUS1
=
LSUS1=
SLS1.The condition atLine4 showsthat SUSi cannotbean extension ofan LSUS(Lemma 4.1), soSUSk
=
SLSk,whichmustbeexisting inthiscase. Line5handlesthecasewhere SLSk does not exist,so SUSk mustbe SUSk−1 appended by S[
k]
.Line 6 handles the casewhere SLSk isshorter than theone-character extension ofSUSk−1,so SUSk is SLSk.Lines 7–8 handlethe casewhereSLSk is longerthan or equalto the
one-character extension ofSUSk−1,soSUSk isSUSk−1 appendedby S
[
k]
.Thisalsoresolves thetie bypicking theleftmostoneifk iscoveredbymultipleSUSes.
Theorem4.1. Algorithm 4findsSUS1
,
SUS2,
. . . ,
SUSn ofstring S usinga totalof O(
n)
timeandspace.Ifanystringlocation is covered bymultipleSUS,Algorithm 4findstheleftmostone.Proof. We canconstruct thesuffix array ofthestring S in atotal of O
(
n)
time andspaceusingexisting algorithms (for example, [9]). The rank array is justtheinverse suffix array andcan be directly obtainedfrom SAusing O(
n)
time and space.Thenwecanobtainthelcparrayfromthesuffixarrayandrankarrayusinganother O(
n)
timeandspace[8].Sothe totaltimeandspacecostsforpreparingtheseauxiliarydatastructuresareO(
n)
.Time cost. The amortized time cost for each FindSLS function call at Line 2 in the sequence of function calls FindSLS
(
1),
. . . ,
FindSLS(
n)
is O(
1)
(Lemma 4.6). The time cost forLines 3–8 isalso O(
1)
.There are a total ofn steps in theFor
loop,yieldingatotalofO(
n)
timecost.Spaceusage. Theonly spaceusage (inaddition totheauxiliary datastructuressuchassuffixarray,rankarray,andthe lcparray,whichcostatotalofO
(
n)
space)inouralgorithmisthedynamiclinkedlist,whichhoweverhasnomorethann nodes at anytime. Eachnode costs O(
1)
space.Therefore,thelinked listcosts O(
n)
space.Addingthespaceusageofthe auxiliarydatastructures,wegetthetotalspaceusageoffindingeverySUSisO(
n)
.FindingtheleftmostSUS. For anyparticular location k, ifone SUS covering location k is an extension of an LSUS, we knowby Lemma 4.1,that SUSmustbe thesubstring SUSk−1 appendedbytheletter S
[
k]
.ClearlythisSUS istheleftmostone among all theSUSescovering location k andis guaranteedto be returned byLines 7–8 inAlgorithm 4.Ifall SUSes covering location k are LSUSes, the leftmost one of those LSUSes is already guaranteed to be returned by Algorithm 3
(Lemma 4.6).
2
4.5. Extension:findingallSUSesforeverylocation
It is possiblethat a particularlocation can havemultiple SUSes.Forexample,if S
= abcbb
, then SUS2 can be either S[
1,
2]
= ab
orS[
2,
3]
= bc
.Algorithm 4onlyreturnsoneofthemandresolvethetiebypickingtheleftmostone.However, itiseasytomodifyAlgorithm 4toreturnalltheSUSesofeverylocation,withoutchangingAlgorithm 3.Suppose aparticularlocationk iscoveredbymultipleSUSes.Weknow,attheendofthekthwalkstepbutbeforethe linked listupdate(attheendofLine14inAlgorithm 3),SLSkreturnedbyAlgorithm 3isrecordedby the
head
nodeandis the leftmostone amongall the SUSesthat are LSUS and coverlocation k. Because every string location maintainsits shortest candidateanddueto Lemma 4.5,all theother SUSesthat are LSUS andcoverlocation k arebeingrecordedby other linked listnodesthatareimmediatelyfollowingthe
head
node. Thisisbecauseifthoseother SUSesare notbeing recorded,that meansthelocation rightaftertheheadnode’schunkhasa candidatelonger thanSUSk ordoesnot haveaAlgorithm 5: FindingallchoicesofeachSUSk,fork
=
1,
. . . ,
n. 1 for k←1. . .n do2 flag←0;(start,length)←FindSLS(k); // SLSk; (null, null) if SLSk does not exist.
3 if k=1 then
4 PrintSUSk← (start,length);
5 elseif SUSk−1.start+SUSk−1.length−1>k−1 then
6 PrintSUSk← (start,length);flag←1; 7 elseif(start,length)= (null, null)then
8 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);
9 elseif length≤SUSk−1.length+1 then
10 PrintSUSk← (start,length);flag←1; 11 else
12 PrintSUSk← (SUSk−1.start,SUSk−1.length+1);
/* Print out other SUSes that cover location k. */
13 if flag=1 then
14 if SUSk−1.length+1=SUSk.length then 15 Print(SUSk−1.start,SUSk−1.length+1);
16 j←head;
17 while j>0 andj≤tail do
/* List[j].start=SUSk.start condition checking is because the SUS from head node may have been printed.
*/
18 if List[j].length=SUSk.length andList[j].start=SUSk.start then 19 Print(List[j].start,List[j].length); j←j+1;
20 elseif List[j].start=SUSk.start then
21 Break;
candidatecalculatedyet,butthatlocationisindeedcoveredbyan SUSkattheendofthekthwalkstep.It’sacontradiction.
SameargumentcanbemadetotheothernextneighboringlocationsthatarecoveredbySUSk.
Therefore, finding all the SUSes covering location k becomes easy—simply go through the linked list nodes fromthe
head
node toward thetail
node andreport all the candidateswhose lengthsareequal tothe length ofSUSk that wehavefound.IftherightmostcharacterofSUSk−1isS
[
k−
1]
andthesubstringSUSk−1appendedbyS[
k]
hasthesamelength,thatsubstringwillbereportedtoo.Algorithm 5givesthepseudocode,wherethe
flag
isusedtonoteinwhatcasesitis possibletohavemultipleSUSes.If
flag
ison,wewillneedto checkthelinked listnodes(Lines17–21) aswellastheoneletterextension ofSUSk−1(Lines 14–15). The overall time and space cost of maintaining the linked list data structure (the sequence of function callsFindSLS
(
1),
FindSLS(
2),
. . . ,
FindSLS(
n)
)isstill O(
n)
.ThetimecostofreportingtheSUSescoveringaparticularlocation becomes O(
occ)
,whereocc isthenumberofSUSesthatcoverthatlocation.Thatgivesusthefollowingtheorem.Theorem4.2.Algorithm 5findsallSUSescoveringeverylocationofastringofsizen usingO
(
n)
spaceandO(
N)
time,whereN=
nk=1occkandocck
≥
1 isthenumberofSUSescoveringlocationk.5. Experiments
We have implemented our proposal named IKXSUS in
C++,
2 using thelibdivsufsort
3 library for the suffixar-ray construction andKasaiet al.’s method[8]to compute thelcp array.We havecompared ourwork againstPeiet al.’s RSUS[12] andTsurataet al.’s [14]OSUS implementations,onboth one-SUS andall-SUS findingforevery stringlocation.
Notice that OSUS alsocomputes thesuffix array usingthe same
libdivsufsort
package andcomputes the lcp arrayusingKasaiet al.’smethod.
RSUSwasoriginallypreparedwithanRinterface.WestrippedoffthatRinterfaceandbuiltastandalone
C++
executable forthe sake of fairbenchmarking. OSUS was originally developed inC++.
We run OSUSboth with andwithout the-l
option to compute a single leftmost SUS andall SUSes for every string location. In all three implementations, we also commented out the sections that print the results onto the screenand/or the disk as output, in order to measure the algorithmicperformancebetter.WerunthetestsonamachinethathasIntel(R)Core(TM)i7-3770CPU@ 3.40 GHzprocessorwith8192KBcachesize and16 GBmemory.TheoperatingsystemwasLinuxMint 14.WeusedthePizza&Chilicorpusintheexperimentsbytaking
2 Sourcecodecanbedownloadedat:http://penguin.ewu.edu/~bojianxu/publications. 3 Availableat:https://code.google.com/p/libdivsufsort.
Fig. 1. The processing speed of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.
thefirst1,5,10,20,50,100, and200 MBs ofthelargest
dblp.xml,
dna,
English,
andprotein
files. Theresultsare showninFigs. 1,2,3,and4.FindingtheleftmostSUSofeverylocation,Figs. 1 and2 ItwasnotpossibletorunRSUSonlongerstrings,sinceRSUSrequires
morememorythanwhatourmachinehas,andthus,onlyupto20 MBfileswereincludedintheRSUSbenchmark.
Com-paredtoRSUS,wehaveobservedthatIKXSUSisinaveragemorethan 8 timesfasteranduses 20 timeslessmemory.The experimentalresultsalsorevealedthatdifferenceoftheprocessingspeedsofOSUSandIKXSUSisnegligible,butinaverage
OSUSuses 4 timesmorememorythanIKXSUS.
FindingallSUSesofeverylocation,Figs. 3 and4 Inthe experiments of all-SUS finding for every string location, RSUSwas not includedasitdoesnot havethisfunctionality.Wehaveobservedthat OSUSuseslessmemoryinthe all-SUSfinding than what itneeds forone-SUS finding,while IKXSUS’smemorycost doesnot changebetweentheone-SUS andall-SUS finding. Overall,IKXSUSuses atleast 2 timeslessmemoryspace thanOSUSandalso marginallybeats OSUSinterms of theirprocessingspeeds.
Althoughallthreeworkshavelinearspacecomplexityinboththeoryandexperiments(notethatthe
X
axisinallfigures uses logscale), IKXSUSandOSUSusesignificantly lessmemoryspace, dueto thefact that thesetwoworks usesimplerdata structuresrather than the suffix tree used by RSUS. On the other hand,although both IKXSUS and OSUSuse the
same setofdatastructures,such assuffixarray,rankarray (inversesuffix array),andthelcparray,andcomputingthese arraysaredoneviathesamelibrary(libdivsufsortforsuffixarrayconstruction)andthesamealgorithm(Kasaiet al.’s method[8]forlcparrayconstruction),thepeakmemoryusagebyOSUSismuchhigherthanIKXSUS.Thedifferencestems fromdifferentmechanismsthesestudiesfollowtocompute theSUS.OSUScomputestheSUSbyusinganadditionalarray, whichisnamedasthemeaningfulminimaluniquesubstringarray.Thus,thespaceusedforthatadditionaldatastructure
makesOSUSrequiremorememory.
Withrespecttotheprocessingspeed,bothIKXSUSandOSUSpresentstablerunningtimesonall
dblp,
dna,
protein,
andEnglish
textsandscale well onincreasing sizes of thetarget dataconforming totheir linear time complexity.On theother hand,RSUSexhibitsitsquadratictimecomplexityonalltexts,andespeciallyitsrunningtimeonEnglish
textis much longer when comparedto other text types.The speed-upof IKXSUSandOSUS against RSUScan be even more
Fig. 2. The peak memory consumptions of RSUS, OSUS, and our proposal in finding the leftmost SUS of every location on several strings of different sizes.
Fig. 4. The peak memory consumptions of OSUS and our proposal in finding all SUSes of every location on several strings of different sizes.
6. Conclusion
We proposed IKXSUS, an optimal linear-time and linear-spacealgorithm for shortest unique substring query. Our al-gorithm significantly improved RSUS, the original work on shortestunique substring queryproposed recently[12], both theoreticallyandempiricallyinboththespaceandthetimecosts.Ourworkisindependentlydiscoveredwithoutknowing OSUS, anotherrecentlinear-timeandlinear-spacesolution[14]forSUSfinding,anduses adifferentapproach.Inpractice, IKXSUSusessignificantlylessmemorythanOSUSwhilemaintainingnearlythesameprocessingspeed.
Acknowledgements
Weacknowledgetheauthorsof[12,14]forprovidingtheirsourcecode. References
[1]M.Crochemore,W.Rytter,JewelsofStringology:TextAlgorithms,WorldScientific,2003.
[2]D.Gusfield,AlgorithmsonStrings,TreesandSequences:ComputerScienceandComputationalBiology,CambridgeUniversityPress,1997.
[3]B.Haubold,N.Pierstorff,F.Möller,T.Wiehe,Genomecomparisonwithoutalignmentusingshortestuniquesubstrings,BMCBioinform.6 (1)(2005) 123.
[4]X.Hu,J.Pei,Y.Tao,Shortestuniquequeriesonstrings,in:Proceedingsofthe21stInternationalSymposiumonStringProcessingandInformation Retrieval(SPIRE),2014,pp. 161–172.
[5]A.M.Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,in:Proceedingsofthe25thAnnualSymposiumonCombinatorialPattern Matching(CPM),2014,pp. 172–181.
[6] A.M.˙Ileri,M.O.Külekci,B.Xu,Shortestuniquesubstringqueryrevisited,http://arxiv.org/abs/1312.2738. [7]L.Ilie,W.F.Smyth,Minimumuniquesubstringsandmaximumrepeats,Fund.Inform.110 (1–4)(2011)183–195.
[8]T.Kasai,G.Lee,H.Arimura,S.Arikawa,K.Park,Linear-timelongest-common-prefixcomputationinsuffixarraysanditsapplications,in:Symposium onCombinatorialPatternMatching,2001,pp. 181–192.
[9]P.Ko,S.Aluru,Spaceefficientlineartimeconstructionofsuffixarrays,J.DiscreteAlgorithms3 (2–4)(2005)143–156.
[10]M.O.Külekci,J.S.Vitter,B.Xu,EfficientmaximalrepeatfindingusingtheBurrows–Wheelertransformandwavelettree,IEEE/ACMTrans.Comput.Biol. Bioinform.9 (2)(2012)421–429.
[11]S.Kurtz,J.V.Choudhuri,E.Ohlebusch,C.Schleiermacher,J.Stoye,R.Giegerich,Reputer:themanifoldapplicationsofrepeatanalysisonagenomic scale,NucleicAcidsRes.29 (22)(2001)4633–4642.
[12]J.Pei,W.C.H.Wu,M.Y.Yeh,Onshortestuniquesubstringqueries,in:ProceedingsofIEEEInternationalConferenceonDataEngineering(ICDE),2013, pp. 937–948.
[13]W.F.Smyth,Computingregularitiesinstrings:asurvey,EuropeanJ.Combin.34 (1)(2013)3–14.
[14]K.Tsuruta,S.Inenaga,H.Bannai,M.Takeda,Shortestuniquesubstringsqueriesinoptimaltime,in:ProceedingsofInternationalConferenceonCurrent TrendsinTheoryandPracticeofComputerScience(SOFSEM),2014,pp. 503–513.