A Machine Translation System for Turkish-Turkmen Chat

(1)

1 (1), 2007, 125-134

©BEYKENT UNIVERSITY

A Machine Translation System for

Turkish-Turkmen Chat

Cemal KOSE

(1)

, Guychmyrat AMANMYRADOV

(2)

, Ozcan

OZYURT

(3)

e-mail: {(1)ckose, (2)guychmyrat, (3)oozyurt}@ktu.edu.tr

Department of Computer Engineering, Faculty of Engineering, Karadeniz Technical University,

61080 Trabzon, TURKEY

ABSTRACT

This paper describes the details of a Turkish-Turkmen machine translation system. Machine translation between closely related languages is easier than translation between languages that are not closely related. Turkish and Turkmen languages are closely related but they have rather complex morphotactics. Thus, less effort needed to develop a translation system between closely related languages because many parts of related languages such as grammars and vocabularies are common. Hence, syntactic and semantic analysis may not be necessary since the translation system makes morphological analysis most of the time. So, simple translation rules and context dependent bilingual dictionaries may be sufficient for translation. Turkish and Turkmen are agglutinative languages in terms of word structures formed by productive affixations of derivational and inflectional suffixes to root words. Therefore, different words can be constructed from roots and morphemes to describe objects and concepts. Basically, the translation system takes a Turkish or Turkmen sentence and analyzes all the words morphologically. Then, it translates the root words and morphemes, and finally generates the Turkish or Turkmen sentence morphologically. Finally, a context dependent decision mechanism is employed to determine the true output sentences.

Key Words: Natural Language Processing, Machine Translation, Closely Related Languages, Agglutinative Languages.

1. INTRODUCTION

Since the invention of computers, machine translation has been taking interest of many researchers. Machine translation from one language to another is a very difficult issue because of various reasons such as cultural, conceptual and grammatical differences [2, 3, 7 and 8]. A source language may have

(2)

certain structure or concepts but other language may not. The most of the early researchers in Machine Translation have spent most of their time on western languages such as English, French and German. Later, many researchers considered other languages and concentrated on translation between English and other languages [7, 8, 12, 15, and 16] but machine translation among closely related languages such as Turkic languages was left rather untouched. In this study, a machine translation method for closely related languages is described to set up a Turkic languages domain. Here, Turkish is chosen as base language among the Turkic languages. Thus, this will facilitate the translation from other languages to Turkic languages because developing an independent machine translation system between each non-Turkic and Turkic languages are very difficult and costly than developing a system for all Turkic languages.

Turkish and Turkmen languages have some syntactical, grammatical and morphological differences, though they are closely related. The main differences are usually in morphemes rather than in deeper levels of grammar. Consequently, different morphemes are used to get the same meaning. The word order, roots and words are usually similar but sometimes words may have different meanings in both language.

In this paper we present a Turkish-Turkmen machine translation system. This translation system translates basic Turkish text to Turkmen and Turkmen text to Turkish. The rest of the paper is organized as the following. In Section 2, the translation system is described. A summary of Turkish and Turkmen is also given in the same section. The implementation and results are discussed in Section 3. The conclusion and future work are given in section 4.

2 THE TRANSLATION SYSTEM

The system consists of two main modules. These modules are the translation modules from Turkish to Turkmen and Turkmen to Turkish. The first module carries out morphological analysis of Turkish text, one-to-one translation of words and morphemes from Turkish to Turkmen, and morphological generation of Turkmen text [4, 5, and 14]. The second module also carries out morphological analysis of Turkmen text, application of context dependent and grammatical translation rules, one-to-one translation of words and morphemes from Turkmen to Turkish, and morphological generation of Turkish text [12, 13].

Turkish or Turkmen morphological analyzer first separates input text into words. Then, each of the input words is sequentially separated into roots and morphemes. By using bilingual dictionary, root and morphemes of each word are translated to the target language. Finally, translated Turkish or Turkmen word is checked whether its morpheme sequence is true. In Turkic languages, suffixes cannot be added to a root word in a random order. Therefore, suffixes should be added to a root word in a certain order. These rules, determining the order of suffixes, are implemented in the system. After

(3)

that, the first word is translated and the system takes the next word from the input. These processes continue until the end of the input. In the following example, the analyzer produces two outputs for two input words, one Turkish and Turkmen.

kalemlerimizden > kalem+Noun+A3pl+P1pl+Abl (Turkish to Turkmen) galamlarymyzdan > galam+Noun+A3pl+P1pl+Abl (Turkmen to Turkish)

If the content information is the same, not obligator or not given, the source text is directly mapped to target text for any content. In this case, the translation rules of words can be categorized as most trivial or no change, root change, morpheme change, root and morpheme change, verbs that effect its objects, more than one word maping to one word, and one word maps to more than one word. In the first category, set of rules includes no change in the roots or morphemes, but translation rules are applied from Turkish to Turkmen and vice versa. In the second category, only morphemes are conserved except the root. Similarly a bilingual dictionary is used and the root word mapped to the target word. In the third category, some of the morphemes are changed but root of the word is not. In the fourth category, some morphemes and the root of the source word are changed so that root and morphemes of the source text are mapped to target root and morphemes. In the fifth category, the same verb is used with different meanings of its object in both languages. In the sixth category, one word in the input text should be expressed with more than one word in target language or vice versa. In the seventh and last category, the compound tenses are written separately in the target language.

2.1 Morphotactic rules for Turkish and Turkmen

A morphological module is employed to analyze and synthesize morphological structures in Turkish and Turkmen [1, 11]. These are surface and lexical levels. The surface level is the input as represented in original language. The lexical level is decomposed form of the input into morphemes. The sequence of morphemes, appearing in a word, is determined by morphotactics of a language. Morphotactic rules of Turkish and Turkmen mostly comply with each other. In both languages, the order of the morphemes in a word and meaning they imply are usually the same. On the other hand, the total number of roots in Turkish and Turkmen vary when the word from non-Turkic languages are taken into consideration. All of non-Turkic and non-non-Turkic words and suffixes are included in system's database. The lists of words are grouped as nouns, verbs, adjectives, simple numbers, pronouns and connectives.

The morphological analysis in Turkish and Turkmen is realized in three steps; determining the root word of an input word, morphological tests, and determinations of morphemes of the input word. The system tries to locate the root and possible following morphemes by checking possible constructions.

(4)

For the input word, the program looks out the suitable roots from the dictionary. For example, if the input word is "Turkmencede", the program finds roots; "tür-", "Türk-", "Türkmen-", "Türkmence-", and then takes the longest one. Then, the system sequentially adds morphemes to the root and compares with input until it finds the right morpheme. If there is no suitable morpheme for the root, the root is translated to the target language directly. The system may return no result when it locates an invalid suffix and detects an invalid situation according to the rules.

In general, Turkish or Turkmen words can be expressed in the following morphological structures.

Noun: root + plural suffix + possessive suffix + case suffix+ conjunction suffix (ki).

Verb: root + voice suffix + negative suffix + necessity suffix + simple tense suffix + question suffix + compound tense suffix + personal endings. For example, "arabalardakiler" (people in the cars) and "okuldakiler" (people in the school) as nouns are parsed into morphemes as araba-lar-da-ki-ler (maşyn+lar+da+ky+lar) and okul-da-ki-araba-lar-da-ki-ler (okuw+da+ky+lar), respectively. On the other hand, silmiyordum (I was not erasing) and topluyordun (you were collecting) as verbs are parsed into morphemes as sil-mi-yor-du-m (poz+ma+yar+dy+m) and toplu-yor-du-n (topla+yar+dy+n), respectively. A typical morphological analyzer for Turkish and Turkmen, illustrated in Fig. 1, recognizes most of the noun words. Another morphological analyzer for verbs is also illustrated in Fig. 2.

Fig. t. A typical finite state model of the morphological parser for nouns.

(5)

Here, {} represents that letters in the parenthesis may be omitted in some situations, [] represents that the letters in the parenthesis can be exchanged with others mainly according to the harmony of sounds, e explains that sometimes this suffix may be omitted, and s shows that word may end at the state.

An internal representation of an input word like "kedisi" (his/her cat) can be explained as "kedi + sH" which is created with the help of vowel and consonant harmony rules. The translation system would first check the roots for "kedi" and it generates the output as the lexical form "kedi + Noun" if it finds the verb there. Then, it goes to the next state. At this state the possible morpheme is + lAr. If it does not match with the morpheme, the system goes to the next morpheme. Hence, the system searches for the matching suffix "sH" sequentially. If there is no matching suffix, the system applies the spelling checker to correct the input word. The system continues until it reaches the final state or a state that does not accept the input.

Some words and suffixes are different or they don't exist in both languages. For example, the Turkish word "gidebilmek" (to be able to go) is expressed as "gidip bilmek" in Turkmen. Some of the morphemes such as +sAl, +(A)dHr are not present in Turkmen, whereas some extra morphemes such as +Anok, +gHn (present progressive negative) are to be added.

2.2 Turkish-Turkmen translation

Turkish and Turkmen languages have very similar morphological structures and translation systems between them have similar computational complexities. These systems have linear time complexities because translation time increases linearly against increasing number of words. Here, a typical model of the translation system is illustrated in Fig. 3 to explain the time complexity more clearly. Most of the time our system does not need the syntax parser because the Turkish and Turkmen grammars are quite similar except few cases. In general, morphemes are also added to a word in the same sequence. As explained in the following table most of the basic tenses exist both in Turkish and Turkmen and the usage of the tenses are similar. In these corresponding examples, Turkish and Turkmen sentences are given. Each line is arranged as lexical forms, corresponding surface morphemes and necessary explanations. The structures and rules are almost the same in both languages for past tense, but other tenses have some differences. A short explanation and comparison of simple present in Turkish and Turkmen is given in Table 1, 2 and 3 respectively.

(6)

Fig. 3. A flowchart of the essential steps in Turkish-Turkmen translation system

In Turkmen language there are three types of present continuous tense. The first one is constructed with "yar/yar" suffixes. The second one is constructed with dur, otur, yat and yor, auxiliary verbs. Here, the main verb before auxiliary verbs takes adverbial suffixes such as yazyp dur and geplap otyr. This type of the tense has no negative form. The third type of the tense has only negative form. In this type the verb is constructed by adding possessive and "ok" suffix derived from the word "yok", ie yazamok, okamok, baramok. Turkish Turkmen -r, Ar, Hr -r/ar/er/ır/ir/ur/ür -Vowel harmony rules apply gel + Ar = gelir (s/he comes) kullan + Hr = kullanır (s/he uses) söyle + r = söyler (s/he says) gizle + r = gizler (s/he hides) -Ar -ar/er -Vowel harmony rules apply -Can correspond to English simple present tense and Future tense.

gel+Ar = geler ulan + Ar = ulanar sözle +Ar = sözlär gizle + Ar = gizlär

Table 1: Simple present tense

The second tense of compound tenses in Turkic languages is mostly past or narrative. The second tense of the compound tense sentences may be written separately with the verb "emek", and then the second tense comes after the root as "eken" in Turkmen. Here, the possessive suffixes are added to the second tense as explained in the Table 2.

(7)

Turkish Turkmen -di

-for past tense as second tense -it is joined to the root

yazmıştın (you had written)

-eken

-for past tense as second tense

yazan ekenin

Table 2: Compound tenses

Although the structure of the passive sentences may be different in Turkish and Turkmen, passive sentences are generated by adding a suffix to the root word in both languages. These rules are explained in Table 3 for both languages.

Turkish Turkmen

-Hl, Hn -ıl/il/ul/ül, ın/in/un/ün

ver + Hl +di= verildi (was given) yaz + Hl + dı = yazıldı (was written) al + Hn + dı = alındı (was taken) bil + Hn + di = bilindi (was known) -Hl, Hn -yl/il/ul/ul, yn/in/un/un -Hn is adding to root, if root ends in l constant -vowel and consonant harmony rules apply ber +Hl +di = berildi yaz +Hl + dy = yazyldy al + Hn +dy = alyndy

Table 3: Passive voice

3 RESULTS

Even though, Turkish and Turkmen languages have similar structures and rules, preparing the rules for Turkish-Turkmen machine translation is not very easy. The grammars of these languages are usually similar and both morphological and semantic ambiguities are usually preserved. Therefore, a semantic analyzer for each Turkic language may not be necessary, because an analyzer developed for one of these languages can be applicable to the other one.

(8)

Same Very Less Falsely Other Same Suffixes Other words similar similar same words suffixe with some suffixe

(%) words words words (%) s (%) difference s (%)

(%) (%) (%) _{s (%)}

25 11 9 1 54 53 23 24

Table 4: Similarity and usage of Turkish and Turkmen words and suffixes

How similar Turkish and Turkmen words and suffixes are summarized in Table 4. Understandability of a Turkmen sentence by Turkish may vary according to the number of differences in a sentence or vice versa. Thus, the understandability decreases according to the number of morphological differences accounted in the sentence.

In the realization of this system some restrictions and limitations are applied because of difficulties encountered in the implementation of these processes. These difficulties fall into two different categories. Firstly, some tags such as thematic role tags are left out. Secondly, since a few people worked on this project some limitations were necessary. For example a limited bilingual dictionary is used.

Turkish to Turkmen Turkmen to Turkish Correct translations 60.8% 56.6% Slightly differing translations 25.4% 26.2% Hardly understandable

translations

9.6% 11.5%

Wrong translations 2.2% 3.7%

Overall translation performance 97.8% 96.3%

Table 5: Translation performance of the system

Twenty medium level articles (ten Turkish and ten Turkmen) consisting of more than a thousand words are chosen to test the translation system. The performance of the system is evaluated in four categories; completely correct translations, slightly differing translations, hardly understandable translations and wrong translations. The performance of the translation system for the texts is summarized in Table 5. It is seen that our system mostly makes correct translations. Sometimes it may translate sentences in a slightly different way than expected but the translation still gives the same meaning. For example, some words may not express exactly the same meaning in the target language.

(9)

Rarely, hardly understandable translations are also encountered. For example, because some words can't be morphologically analyzed they are directly sent to output without any change. General performance of the system for Turkish-Turkmen translations is over 90% as presented in the last line of Table 5.

4 CONCLUSION AND FUTURE WORKS

This paper has presented the first full scale implementation of a machine translation system for Turkish-Turkmen chat. These languages have quite similar grammatical, semantic, and morphological structures but they also have some dissimilarity as well. For an acceptable translation we have tried to cover the largest possible number of rules in this simple Turkish-Turkmen text translation system. For example, morphological analysis of an input text either from Turkish or Turkmen user interfaces is done from left to right by the system. Although this morphological analyzer is quite successful, it can still be improved by adding it a right to left analyzer. The methods developed for the system may also be applicable to the other Turkic languages because as members of Turkic languages family Azeri, Kazakh, Uzbek, or Kyrgyz have similar syntactical, morphological and grammatical structures. In this respect, Turkish may be chosen a base language between the Turkic languages and then the translation between them may easily be done via the base language.

REFERENCES

[1] Altintaş Kemal, Cicekli, İlyas: A Morphological Analyzer for Crimean Tatar, in Proceedings of Turkish Artificial Intelligence and Neural Network Conference (TAINN 2001), North Cyprus, 2001.

[2] Altıntaş Kemal and Çiçekli İlyas, A machine Translation System Between a Pair of Closely Related Languages, in: Proceedings of the 17th International Symposium on Computer and Information Sciences--ISCIS 2002, Orlando, Florida, CRC Press, pp:192-196, 2002.

[3] Allen James, Natural Language Processing (second edition), The Benjamin/Cummings Publishin Company, Inc., 1995.

[4] Dilek Hakkani-Tür, Kemal Oflazer, and Gokhan Türk, Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities, 36(4), 2002.

[5] Hakkani-Tür D. Z., Oflazer K. and Tür, G., Statistical morphological disambiguation for agglutinative languages. In Proceedings of co LING 2000 -ICCL, 2000.

[6] Kubon, Vladislav, Hajşc, Jab and Hric, Machine Translation of Very Closed Languages, in ANLP-NAACL 2000, Washington, January 2000. [7] Li H, Japkowicz N, Barriere C, English to Chinese translation of prepositions, Lecture Notes In Computer Science 3501: 412-416, 2005.

(10)

[8]. Mahsut M., Ogawa Y., Sugino K., Toyama K., Inagaki Y., An experiment on Japanese-Uighur machine translation and its evaluation, Machine Translation: From user to research, Proceeding lecture notes in Computer Science Vol. 3265, pp. 208-216, 2004.

[9] Nabiyev Vasif V., Yapay Zeka: problemler-Yöntemler-Algoritmalar, 2. Baskı, Seçkin Yayınevi, Ankara, 2005.

[10] Nabiyev Vasif V., Yazıcı R. and Ulutaş M., Akraba diller için bilgisayar destekli çeviri sistemi, 9. Türk zeka ve sinir ağları sempozyumu, pp.393-397, 2000.

[11] Oflazer Kemal, Two-level Description of Turkish Morphology, Literary and Linguistic Computing, vol. 9, No:2, 1994.

[12] Oh JH, Choi KS, Machine learning based English-to-Korean transliteration using grapheme and phoneme information, IEICE Transaction on Information Systems E88D (7): 1737-1748, July 2005.

[13] Say Bilge Say, Zeyrek Deniz, Oflazer Kemal, and Özge Umut, Development of a corpus and a treebank for present-day written Turkish, 11th

Intentional Conference on Turkish Linguistics, 2002.

[14] The zebmerek project (Morphological parsing of Turkish word). zemberek.dev.java.net.

[15] Udupa R and Faruquie T. A., An English-Hindi statistical machine translation system, Lecture Notes In Computer Science 3248: 254-262, 2005. [16] Wang XJ, Ren FJ, Chinese-Japanese clause alignment, Lecture Notes In Computer Science 3406: 400-412, 2005.