by
ESRA VURAL
Submitted to the Graduate School of Engineering and Natural Sciences
in partial fulllment of
the requirements for the degree of
Master of Science
Sabanc University
July 2003
APPROVED BY
KemalOFLAZER ...
(Thesis Supervisor)
Hakan ERDO
GAN ...
Yucel SAYGIN ...
DATE OFAPPROVAL: ...18/07/2003...
SabancUniversity2003
AllRightsReserved
IamthankfultomythesissupersivorKemalO azerforintroducingmetothischal-
lengingsubject. Hisknowladge,patienceandinsight wasvery helpful inadvancing
through the project. I am alsograteful toHakan Erdogan for his motivations and
criticisms during the preparationof the thesis. I would also like to thank toYucel
Saygn forservingon mythesis committeeandforhis revisions onthe thesis.
I would like to thank to my parents and sister for their love, support and en-
couragement. Lastly,Iamthankfultothepatienceandsupportof AydnAkyolfor
enablingus tohave his voicerecordings.
Abstract
NaturalnessinText-to-Speechsystemsisveryimportantinachievinghighqual-
ity waveform. The naturalness of the waveform is highlycorrelated with phonetic
coverage and prosodic features such as, duration and F0 contour. Duration de-
termines the timing for the synthesized phoneme, whereas F0 contour determines
fundamental frequency component of thewaveform.
This thesis presents the development of a prosodic Text-to-Speech System for
TurkishLanguageusingtheFestivalTool[31]. Wedescribeacompleterealizationof
anewmalevoice,coveringallophonesofTurkishusingdurationandF0parameters.
The duration of the allophonesand the word stress have been studied extensively.
Sentence stress andphrasal stress are alsodiscussed by inless detail.
Carrier words are designed approximately forall allophone-allophonecombina-
tions. 1680 carrier words are recorded in a sound-proof recording studio. LPC
(linear predictive coding) and RES (residual) parameters are computed. The text
normalisationmoduleisimplementedforabbreviationsandnumbers. Durationsfor
the allophones are entered. Sentence level and word level F0 generation modules
are implemented. By increasing the number of phonemes and giving prosody we
obtained a morenatural sounding Text-to-Speech SystemforTurkishLanguage.
TURKCE ICIN VURGULU METINDENSES SENTEZLEYICISI
Ozet
Metinden ses sentezleyicisi sistemlerinde dogallk kaliteli bir ses dalgas elde
edilmesinde cok onemli bir rol oynar. Ses dalgasnn dogallg fonetik kapsama ve
vurgusalozelliklerolanperdefrekansegrisive surebilgileriyleiliskilidir. Surebilgisi
sentezlenen fonemin zaman bilgisinibelirler, perde frekans egrisi ise ses dalgasnn
temelfrekans ozelliklerinikapsar.
Butezde,Festivalsessentezlemesistemikullanlarak,Turkceicinvurgulumetinden
ses sentezleyicisigelistirilmistir[31]. Yeni bir erkek sesi, Turkcedeki alofonlarkap-
sayarak, temel frekans ve sure bilgileri kullanlarak olusturulmustur. Alofonlarn
suresi ve kelime vurgusu genis capta calslmstr. Cumle vurgusu ve kelime obek
vurgusu daha azdetaylolarakcalslmstr.
Tum alofon kombinasyonlar icin tasyc kelimeler olusturulmustur. 1680 tane
tasyc kelime ses yaltml bir kayt studyosunda kaydedilmistir. LPC ve RES
parametreleri hesaplanmstr. Ksaltmalar ve saylar icin metni normalize eden bir
modul gelistirilmistir. Alofonlar icin sure bilgisi girilmistir. Cumle ve kelime se-
viyelerinde F0 uretimmodullerigelistirilmistir. Fonem saysn arttrarak ve vurgu
yaratarakTurkceicindahadogalbirmetindensessentezleyicisistemeldeedilmistir.
Acknowledgments v
Abstract vi
Ozet vii
1 Introduction 1
1.1 Introduction ToSpeechSynthesis . . . . . . . . . . . . . . . . . . . . 1
1.2 Text-to-Speech Systems . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Prosodic Turkish TTSSynthesizer. . . . . . . . . . . . . . . . . . . . 1
1.4 Fundamental Dierences Between TTS Systems and Other Talking Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Application AreasOfTTS Systems . . . . . . . . . . . . . . . . . . . 2
1.6 Review of PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 CurrentWorkon TTSSystems . . . . . . . . . . . . . . . . . . 3
1.6.2 TurkishTTSSystems . . . . . . . . . . . . . . . . . . . . . . . 5
2 Text-to-Speech 6 2.1 Stages OfTTS Conversion . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The NaturalLanguage ProcessingComponent . . . . . . . . . . . . . 7
2.2.1 TextAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 PhoneticAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 DigitalSignal ProcessingComponent . . . . . . . . . . . . . . . . . . 11
2.3.1 ProsodicAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 SpeechSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 The Festival Speech Synthesis System 19 3.1 Introduction ToFestivalSpeech Synthesis System . . . . . . . . . . . 19
3.2 Festival Text ToSpeech . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Utterance Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Utterance Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 Diphone Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8.1 ClusterUnitSelection . . . . . . . . . . . . . . . . . . . . . . . 29
3.8.2 Diphonesfrom generaldatabases. . . . . . . . . . . . . . . . . . 30
3.9 Buildingprosodic models . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9.1 Phrasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9.2 Accent/BoundaryAssignment . . . . . . . . . . . . . . . . . . . 32
3.9.3 F0Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9.4 F0byrule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9.5 F0bylinearregression . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.6 TiltModelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9.7 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 A Prosodic Turkish Text-To-Speech System 39 4.1 TurkishPhonetization . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Stress inTurkish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Roleof StressInTurkishWords . . . . . . . . . . . . . . . . . . 41
4.2.2 PhoneticCorrelatesof Stress. . . . . . . . . . . . . . . . . . . . 46
4.2.3 Distinctionsbetweendierentlevels ofstress . . . . . . . . . . . 46
4.2.4 Word-accent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Sentence intonation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 DesigningandRecording ofa Diphone Corpus . . . . . . . . . . . . . 48
4.5 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Designingthe Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Designingthe Intonation . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Designingthe Duration . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Conclusion and Further Research 61 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 FurtherResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bibliography 63
2.1 SimpleTTS Synthesis Procedure . . . . . . . . . . . . . . . . . . . . 6
2.2 Basic SystemArchitecture of aTTS system . . . . . . . . . . . . . . 7
2.3 Natural Language Processing module of a general TTS Conversion System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 DigitalsignalprocessingcomponentofageneralTTSconversionsystem 12 2.5 BlockDiagramof a prosodygeneration system . . . . . . . . . . . . . 13
2.6 Enriched ProsodyRepresentation . . . . . . . . . . . . . . . . . . . . 14
2.7 Dierent kinds of informationprovided by intonation(lines indicate pitchmovements; solidlinesindicatestress[1]. a. Focusorgiven/new information;b. Relationshipsbetweenwords(saw-yesterday;I-yesterday; I-him) c. Finality (top) or continuation (bottom), as it appears on the lastsyllable; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 The human vocalorgans . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Blockdiagramof a synthesis-by-rulesystem . . . . . . . . . . . . . . 16
3.1 An example representation of an utterance structure. This example showsthewordrelationandthesyntax relation. Thesyntax relation (shown on top) is a tree with links connecting the nodes, shown as blackcircles. Thewordrelation (shownon thebottom)isalist. The items containthe actuallinguisticinformationandare shown in the rounded boxes. The dotted lines show the connections between the nodes anditems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Close-up pitchmarks inwaveform signal . . . . . . . . . . . . . . . . 28
3.3 TOBI Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 TiltParameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1 UnittypesinEnglish assuming aphone set of 42phonemes. Longer
Units produce higherqualityattheexpense of morestorage. . . . . . 17
4.1 TurkishVowel Inventory . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 TURKISH PHONETICENCODING FOR VOWELS . . . . . . . . . 41
4.3 TURKISH PHONETICENCODING FOR VOWELSCONTINUED 42 4.4 TURKISH PHONETICENCODING FOR CONSONANTS . . . . . 43
4.5 ALLOPHONES USED IN TTSFOR VOWELS . . . . . . . . . . . . 44
4.6 ALLOPHONES USED IN TTSFOR CONSONANTS . . . . . . . . 45
4.7 Letter toSound Conversion Table . . . . . . . . . . . . . . . . . . . . 52
4.8 Duration ofthe allaphonesinmilliseconds[36].. . . . . . . . . . . . . 59
4.9 Duration ofthe allaphonesinmilliseconds[36].. . . . . . . . . . . . . 60
by
ESRA VURAL
Submitted to the Graduate School of Engineering and Natural Sciences
in partial fulllment of
the requirements for the degree of
Master of Science
Sabanc University
July 2003
APPROVEDBY
Kemal OFLAZER ...
(Thesis Supervisor)
HakanERDO
GAN ...
Yucel SAYGIN ...
DATE OF APPROVAL: ...18/07/2003...
All RightsReserved
Iamthankful tomythesissupersivorKemalO azerforintroducingmetothis chal-
lengingsubject. His knowladge, patience and insightwas very helpful in advancing
through the project. I am alsograteful to Hakan Erdogan for his motivations and
criticismsduring the preparation of the thesis. I would alsolike tothank to Yucel
Saygnfor serving on my thesis committee and for hisrevisions on the thesis.
I would like to thank to my parents and sister for their love, support and en-
couragement. Lastly,I amthankfultothe patience and supportof AydnAkyolfor
enablingus tohave hisvoice recordings.
Abstract
Naturalness inText-to-Speechsystems isvery importantinachievinghigh qual-
ity waveform. The naturalness of the waveform is highlycorrelated with phonetic
coverage and prosodic features such as, duration and F0 contour. Duration de-
termines the timing for the synthesized phoneme, whereas F0 contour determines
fundamentalfrequency component of the waveform.
This thesis presents the development of a prosodic Text-to-Speech System for
TurkishLanguageusingtheFestivalTool[31]. Wedescribeacompleterealizationof
anewmalevoice,coveringallophonesofTurkishusing durationandF0parameters.
The duration of the allophones and the word stress have been studied extensively.
Sentence stress and phrasal stress are alsodiscussed by inless detail.
Carrier words are designed approximately for all allophone-allophone combina-
tions. 1680 carrier words are recorded in a sound-proof recording studio. LPC
(linear predictive coding) and RES (residual) parameters are computed. The text
normalisationmoduleisimplementedforabbreviations andnumbers. Durationsfor
the allophones are entered. Sentence level and word level F0 generation modules
are implemented. By increasing the number of phonemes and giving prosody we
obtained a more naturalsounding Text-to-Speech System for Turkish Language.
TURKCE ICIN VURGULU METINDEN SES SENTEZLEYICISI
Ozet
Metinden ses sentezleyicisi sistemlerinde dogallk kaliteli bir ses dalgas elde
edilmesinde cok onemli bir rol oynar. Ses dalgasnn dogallg fonetik kapsama ve
vurgusalozelliklerolanperdefrekansegrisivesurebilgileriyleiliskilidir. Surebilgisi
sentezlenen fonemin zaman bilgisini belirler, perde frekans egrisi ise ses dalgasnn
temel frekans ozelliklerinikapsar.
Butezde,Festivalsessentezlemesistemikullanlarak,Turkceicinvurgulumetinden
ses sentezleyicisi gelistirilmistir[31]. Yenibirerkek sesi, Turkcedeki alofonlarkap-
sayarak, temel frekans ve sure bilgileri kullanlarak olusturulmustur. Alofonlarn
suresi ve kelime vurgusu genis capta calslmstr. Cumle vurgusu ve kelime obek
vurgusu daha azdetayl olarakcalslmstr.
Tum alofon kombinasyonlar icin tasyc kelimeler olusturulmustur. 1680 tane
tasyc kelime ses yaltml bir kayt studyosunda kaydedilmistir. LPC ve RES
parametreleri hesaplanmstr. Ksaltmalar ve saylar icinmetni normalizeeden bir
modul gelistirilmistir. Alofonlar icin sure bilgisi girilmistir. Cumle ve kelime se-
viyelerinde F0 uretim modulleri gelistirilmistir. Fonem saysnarttrarak ve vurgu
yaratarakTurkceicindahadogalbirmetindensessentezleyicisistemeldeedilmistir.