Homograph Disambiguation Morphological Analysis Letter-to-Sound Conversion

(1)

by

ESRA VURAL

Submitted to the Graduate School of Engineering and Natural Sciences

in partial fulllment of

the requirements for the degree of

Master of Science

Sabanc University

July 2003

(2)

APPROVED BY

KemalOFLAZER ...

(Thesis Supervisor)

Hakan ERDO

GAN ...

Yucel SAYGIN ...

DATE OFAPPROVAL: ...18/07/2003...

(3)

SabancUniversity2003

AllRightsReserved

(4)

(5)

IamthankfultomythesissupersivorKemalO azerforintroducingmetothischal-

lengingsubject. Hisknowladge,patienceandinsight wasvery helpful inadvancing

through the project. I am alsograteful toHakan Erdogan for his motivations and

criticisms during the preparationof the thesis. I would also like to thank toYucel

Saygn forservingon mythesis committeeandforhis revisions onthe thesis.

I would like to thank to my parents and sister for their love, support and en-

couragement. Lastly,Iamthankfultothepatienceandsupportof AydnAkyolfor

enablingus tohave his voicerecordings.

(6)

Abstract

NaturalnessinText-to-Speechsystemsisveryimportantinachievinghighqual-

ity waveform. The naturalness of the waveform is highlycorrelated with phonetic

coverage and prosodic features such as, duration and F0 contour. Duration de-

termines the timing for the synthesized phoneme, whereas F0 contour determines

fundamental frequency component of thewaveform.

This thesis presents the development of a prosodic Text-to-Speech System for

TurkishLanguageusingtheFestivalTool[31]. Wedescribeacompleterealizationof

anewmalevoice,coveringallophonesofTurkishusingdurationandF0parameters.

The duration of the allophonesand the word stress have been studied extensively.

Sentence stress andphrasal stress are alsodiscussed by inless detail.

Carrier words are designed approximately forall allophone-allophonecombina-

tions. 1680 carrier words are recorded in a sound-proof recording studio. LPC

(linear predictive coding) and RES (residual) parameters are computed. The text

normalisationmoduleisimplementedforabbreviationsandnumbers. Durationsfor

the allophones are entered. Sentence level and word level F0 generation modules

are implemented. By increasing the number of phonemes and giving prosody we

obtained a morenatural sounding Text-to-Speech SystemforTurkishLanguage.

(7)

TURKCE ICIN VURGULU METINDENSES SENTEZLEYICISI

Ozet

Metinden ses sentezleyicisi sistemlerinde dogallk kaliteli bir ses dalgas elde

edilmesinde cok onemli bir rol oynar. Ses dalgasnn dogallg fonetik kapsama ve

vurgusalozelliklerolanperdefrekansegrisive surebilgileriyleiliskilidir. Surebilgisi

sentezlenen fonemin zaman bilgisinibelirler, perde frekans egrisi ise ses dalgasnn

temelfrekans ozelliklerinikapsar.

Butezde,Festivalsessentezlemesistemikullanlarak,Turkceicinvurgulumetinden

ses sentezleyicisigelistirilmistir[31]. Yeni bir erkek sesi, Turkcedeki alofonlarkap-

sayarak, temel frekans ve sure bilgileri kullanlarak olusturulmustur. Alofonlarn

suresi ve kelime vurgusu genis capta calslmstr. Cumle vurgusu ve kelime obek

vurgusu daha azdetaylolarakcalslmstr.

Tum alofon kombinasyonlar icin tasyc kelimeler olusturulmustur. 1680 tane

tasyc kelime ses yaltml bir kayt studyosunda kaydedilmistir. LPC ve RES

parametreleri hesaplanmstr. Ksaltmalar ve saylar icin metni normalize eden bir

modul gelistirilmistir. Alofonlar icin sure bilgisi girilmistir. Cumle ve kelime se-

viyelerinde F0 uretimmodullerigelistirilmistir. Fonem saysn arttrarak ve vurgu

yaratarakTurkceicindahadogalbirmetindensessentezleyicisistemeldeedilmistir.

(8)

Acknowledgments v

Abstract vi

Ozet vii

1 Introduction 1

1.1 Introduction ToSpeechSynthesis . . . . . . . . . . . . . . . . . . . . 1

1.2 Text-to-Speech Systems . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Prosodic Turkish TTSSynthesizer. . . . . . . . . . . . . . . . . . . . 1

1.4 Fundamental Dierences Between TTS Systems and Other Talking Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.5 Application AreasOfTTS Systems . . . . . . . . . . . . . . . . . . . 2

1.6 Review of PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.6.1 CurrentWorkon TTSSystems . . . . . . . . . . . . . . . . . . 3

1.6.2 TurkishTTSSystems . . . . . . . . . . . . . . . . . . . . . . . 5

2 Text-to-Speech 6 2.1 Stages OfTTS Conversion . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The NaturalLanguage ProcessingComponent . . . . . . . . . . . . . 7

2.2.1 TextAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 PhoneticAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 DigitalSignal ProcessingComponent . . . . . . . . . . . . . . . . . . 11

2.3.1 ProsodicAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 SpeechSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 The Festival Speech Synthesis System 19 3.1 Introduction ToFestivalSpeech Synthesis System . . . . . . . . . . . 19

3.2 Festival Text ToSpeech . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Utterance Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Utterance Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.7 Diphone Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

(9)

3.8.1 ClusterUnitSelection . . . . . . . . . . . . . . . . . . . . . . . 29

3.8.2 Diphonesfrom generaldatabases. . . . . . . . . . . . . . . . . . 30

3.9 Buildingprosodic models . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.9.1 Phrasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.9.2 Accent/BoundaryAssignment . . . . . . . . . . . . . . . . . . . 32

3.9.3 F0Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9.4 F0byrule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9.5 F0bylinearregression . . . . . . . . . . . . . . . . . . . . . . . 34

3.9.6 TiltModelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.9.7 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 A Prosodic Turkish Text-To-Speech System 39 4.1 TurkishPhonetization . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Stress inTurkish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1 Roleof StressInTurkishWords . . . . . . . . . . . . . . . . . . 41

4.2.2 PhoneticCorrelatesof Stress. . . . . . . . . . . . . . . . . . . . 46

4.2.3 Distinctionsbetweendierentlevels ofstress . . . . . . . . . . . 46

4.2.4 Word-accent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Sentence intonation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 DesigningandRecording ofa Diphone Corpus . . . . . . . . . . . . . 48

4.5 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 Designingthe Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Designingthe Intonation . . . . . . . . . . . . . . . . . . . . . . . . 53

4.8 Designingthe Duration . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Conclusion and Further Research 61 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 FurtherResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 63

(10)

2.1 SimpleTTS Synthesis Procedure . . . . . . . . . . . . . . . . . . . . 6

2.2 Basic SystemArchitecture of aTTS system . . . . . . . . . . . . . . 7

2.3 Natural Language Processing module of a general TTS Conversion System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 DigitalsignalprocessingcomponentofageneralTTSconversionsystem 12 2.5 BlockDiagramof a prosodygeneration system . . . . . . . . . . . . . 13

2.6 Enriched ProsodyRepresentation . . . . . . . . . . . . . . . . . . . . 14

2.7 Dierent kinds of informationprovided by intonation(lines indicate pitchmovements; solidlinesindicatestress[1]. a. Focusorgiven/new information;b. Relationshipsbetweenwords(saw-yesterday;I-yesterday; I-him) c. Finality (top) or continuation (bottom), as it appears on the lastsyllable; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 The human vocalorgans . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.9 Blockdiagramof a synthesis-by-rulesystem . . . . . . . . . . . . . . 16

3.1 An example representation of an utterance structure. This example showsthewordrelationandthesyntax relation. Thesyntax relation (shown on top) is a tree with links connecting the nodes, shown as blackcircles. Thewordrelation (shownon thebottom)isalist. The items containthe actuallinguisticinformationandare shown in the rounded boxes. The dotted lines show the connections between the nodes anditems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Close-up pitchmarks inwaveform signal . . . . . . . . . . . . . . . . 28

3.3 TOBI Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 TiltParameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

(11)

2.1 UnittypesinEnglish assuming aphone set of 42phonemes. Longer

Units produce higherqualityattheexpense of morestorage. . . . . . 17

4.1 TurkishVowel Inventory . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 TURKISH PHONETICENCODING FOR VOWELS . . . . . . . . . 41

4.3 TURKISH PHONETICENCODING FOR VOWELSCONTINUED 42 4.4 TURKISH PHONETICENCODING FOR CONSONANTS . . . . . 43

4.5 ALLOPHONES USED IN TTSFOR VOWELS . . . . . . . . . . . . 44

4.6 ALLOPHONES USED IN TTSFOR CONSONANTS . . . . . . . . 45

4.7 Letter toSound Conversion Table . . . . . . . . . . . . . . . . . . . . 52

4.8 Duration ofthe allaphonesinmilliseconds[36].. . . . . . . . . . . . . 59

4.9 Duration ofthe allaphonesinmilliseconds[36].. . . . . . . . . . . . . 60

(12)

by

ESRA VURAL

Submitted to the Graduate School of Engineering and Natural Sciences

in partial fulllment of

the requirements for the degree of

Master of Science

Sabanc University

July 2003

(13)

APPROVEDBY

Kemal OFLAZER ...

(Thesis Supervisor)

HakanERDO

GAN ...

Yucel SAYGIN ...

DATE OF APPROVAL: ...18/07/2003...

(14)

All RightsReserved

(15)

(16)

Iamthankful tomythesissupersivorKemalO azerforintroducingmetothis chal-

lengingsubject. His knowladge, patience and insightwas very helpful in advancing

through the project. I am alsograteful to Hakan Erdogan for his motivations and

criticismsduring the preparation of the thesis. I would alsolike tothank to Yucel

Saygnfor serving on my thesis committee and for hisrevisions on the thesis.

I would like to thank to my parents and sister for their love, support and en-

couragement. Lastly,I amthankfultothe patience and supportof AydnAkyolfor

enablingus tohave hisvoice recordings.

(17)

Abstract

Naturalness inText-to-Speechsystems isvery importantinachievinghigh qual-

ity waveform. The naturalness of the waveform is highlycorrelated with phonetic

coverage and prosodic features such as, duration and F0 contour. Duration de-

termines the timing for the synthesized phoneme, whereas F0 contour determines

fundamentalfrequency component of the waveform.

This thesis presents the development of a prosodic Text-to-Speech System for

TurkishLanguageusingtheFestivalTool[31]. Wedescribeacompleterealizationof

anewmalevoice,coveringallophonesofTurkishusing durationandF0parameters.

The duration of the allophones and the word stress have been studied extensively.

Sentence stress and phrasal stress are alsodiscussed by inless detail.

Carrier words are designed approximately for all allophone-allophone combina-

tions. 1680 carrier words are recorded in a sound-proof recording studio. LPC

(linear predictive coding) and RES (residual) parameters are computed. The text

normalisationmoduleisimplementedforabbreviations andnumbers. Durationsfor

the allophones are entered. Sentence level and word level F0 generation modules

are implemented. By increasing the number of phonemes and giving prosody we

obtained a more naturalsounding Text-to-Speech System for Turkish Language.

(18)

TURKCE ICIN VURGULU METINDEN SES SENTEZLEYICISI

Ozet

Metinden ses sentezleyicisi sistemlerinde dogallk kaliteli bir ses dalgas elde

edilmesinde cok onemli bir rol oynar. Ses dalgasnn dogallg fonetik kapsama ve

vurgusalozelliklerolanperdefrekansegrisivesurebilgileriyleiliskilidir. Surebilgisi

sentezlenen fonemin zaman bilgisini belirler, perde frekans egrisi ise ses dalgasnn

temel frekans ozelliklerinikapsar.

Butezde,Festivalsessentezlemesistemikullanlarak,Turkceicinvurgulumetinden

ses sentezleyicisi gelistirilmistir[31]. Yenibirerkek sesi, Turkcedeki alofonlarkap-

sayarak, temel frekans ve sure bilgileri kullanlarak olusturulmustur. Alofonlarn

suresi ve kelime vurgusu genis capta calslmstr. Cumle vurgusu ve kelime obek

vurgusu daha azdetayl olarakcalslmstr.

Tum alofon kombinasyonlar icin tasyc kelimeler olusturulmustur. 1680 tane

tasyc kelime ses yaltml bir kayt studyosunda kaydedilmistir. LPC ve RES

parametreleri hesaplanmstr. Ksaltmalar ve saylar icinmetni normalizeeden bir

modul gelistirilmistir. Alofonlar icin sure bilgisi girilmistir. Cumle ve kelime se-

viyelerinde F0 uretim modulleri gelistirilmistir. Fonem saysnarttrarak ve vurgu

yaratarakTurkceicindahadogalbirmetindensessentezleyicisistemeldeedilmistir.