Effects of diacritics in Turkish information retrieval

Tam metin

(1)DOKUZ EYLÜL UNIVERSITY GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES. EFFECTS OF DIACRITICS IN TURKISH INFORMATION RETRIEVAL. by Nefise Meltem CEYLAN. July, 2010 ĐZMĐR.

(2) EFFECTS OF DIACRITICS IN TURKISH INFORMATION RETRIEVAL. A Thesis Submitted to the Graduate School of Natural and Applied Sciences of Dokuz Eylül University In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering, Computer Engineering Program. by Nefise Meltem CEYLAN. July, 2010 ĐZMĐR.

(3) M.Sc THESIS EXAMINATION RESULT FORM. We have read the thesis entitled “EFFECTS OF DIACRITICS IN TURKISH INFORMATION RETRIEVAL” completed by NEFĐSE MELTEM CEYLAN under supervision of ASST. PROF. DR. ADĐL ALPKOÇAK and we certify that in our opinion it is fully adequate, in scope and in quality, as a thesis for the degree of Master of Science.. Asst. Prof. Dr. Adil Alpkoçak Supervisor. (Jury Member). (Jury Member). Prof.Dr. Mustafa SABUNCU Director Graduate School of Natural and Applied Science ii.

(4) ACKNOWLEDGMENTS. I would like to thank to my thesis advisor Assist. Prof. Dr. Adil Alpkoçak for his help, suggestions, patient and systematic guidance throughout the all formation phases of this thesis. Furthermore, I would like to thank you my friend Esra Aycan for her motivations and help, for my thesis and life, she is always more than a classmate for me. And lastly, my special thanks go to my family; the most valuable asset of my life; for all their support, patience and happiness they gave me throughout my life. Nefise Meltem CEYLAN. iii.

(5) EFFECTS OF DIACRITICS IN TURKISH INFORMATION RETRIEVAL ABSTRACT. In this study, we investigated the loss of retrieval performance by writing Turkish text with English Alphabet. As a starting point of this study, we first calculated the loss rate when either documents or queries do not include any Turkish diacritics. After this point we apply document and query expansion techniques to reach the initial performance. Additional to document indexing and query processing techniques, we also investigated the effects of stemming, weighting methods and query lengths on Turkish information retrieval. Our test results show that document expansion technique and equivalence classes produced good results to gain the initial performance. Key Words: Information Retrieval, Document Expansion, Query Expansion, Turkish Information Retrieval, Diacritics. iv.

(6) FONETĐK ĐŞARETLERĐN TÜRKÇE BĐLGĐ ERĐŞĐMĐNDEKĐ ETKĐLERĐ ÖZ. Bu çalışmada, Türkçe metinlerin Đngiliz Alfabesiyle yazılması durumuda bilgi erişimi performasında oluşacak kaybı araştırdık. Başlangıç noktası olarak, öncelikle dökümanlardan veya sorgulardan herhangi birinde Türkçe karakterlerin olmaması durumunda performasın nasıl etkilendiğini belirledik. Bu noktadan sonra döküman ve sorgu genişletme yöntemleri uygulayarak başlangıç performansını elde etmeye çalıştık. Uygulanan döküman indexleme ve sorgu işleme tekniklerinin yanında, kök bulma, ağırlık belirleme yöntemlerinin ve farklı uzunluktaki sorguların Türkçe Bilgi Erişimi üzerindeki etkilerini araştırdık. Elde ettiğimiz test sonuçları gösterdi ki, döküman genişletme ve denklik sınıfı uygulama yöntemleri başlangıç performansına ulaşabilmemiz için iyi sonuçlar üretti. Anahtar Sözcükler: Bilgi Erişimi, Döküman Genişletme, Sorgu Genişletme, Türkçe Bilgi Erişimi, Fonetik Đşaretler. v.

(7) CONTENTS Page. M.Sc THESIS EXAMINATION RESULT FORM .................................................. ii ACKNOWLEDGMENTS....................................................................................... iii ABSTRACT ........................................................................................................... iv ÖZ ............................................................................................................................v CHAPTER ONE - INTRODUCTION.......................................................................1. CHAPTER TWO - PRELIMINARIES AND DEFINITIONS ...................................4. CHAPTER THREE - EXPANSION METHODS for TURKISH TEXT with DIACRITICS............................................................................................................7 3.1 Token Normalization by Equivalence Classes .................................................7 3.2 Document Expansion ......................................................................................8 3.3 Query Expansion .............................................................................................8. CHAPTER FOUR - EXPERIMENTATIONS .........................................................13 4.1 Terrier ...........................................................................................................14 4.2 Weighting Models .........................................................................................15 4.3 Retrieval Procedure .......................................................................................17 4.4 Evaluation Measures .....................................................................................23 4.4.1 Precision & Recall ..................................................................................23 4.4.2 Document Level Averages......................................................................24 4.4.3 Map ........................................................................................................24 4.4.4. Bpref .....................................................................................................25 4.4.5 GMAP ....................................................................................................25. vi.

(8) 4.4.6 Mean Reciprocal Rank ...........................................................................26 4.5 Experimentation Results................................................................................26 4.5.1 Performance for Expansion and Equivalent Classes ................................26 4.5.2 Performance for Different Stemming Methods .......................................36 4.5.3 Performance for Different Weighting Models .........................................37 4.5.4 Performance for Different Query Lengths ...............................................38 CHAPTER FIVE - DISCUSSION AND CONCLUSION .......................................43 REFERENCES .......................................................................................................45 APPENDIX A ........................................................................................................48 APPENDIX B.........................................................................................................50. vii.

(9) CHAPTER ONE INTRODUCTION. Online data, the information retrieval based on is becoming larger day by day without any rule or control. NetCraft Web Survey states, the number of host names on the internet reaches approximately 200.000.000 by the April of 2010 (Netcraft, 2010). Along with these growth; Information retrieval deals on large scale documents that are created for different purposes in many different languages by numerous web users. Information retrieval works for classifying, indexing and searching on this huge amount of data. As the necessity of this, various approaches are applied to address this issue for indexing, retrieval and ranking, some of them are kept secret. Stemming and Query Expansion methods and matching functions are only some of these approaches. In addition to these approaches, more specific, language dependent methods are required to improve results. For this purpose the major points of a language that differ from others must be determined. If we clarify this subject for Turkish, we come up with the differences of Turkish Alphabet and the grammar structure for suffixes. From the point of the differences of Turkish Alphabet, we investigated on what do we miss out either the documents or the queries are not typed in Turkish character set. Many computer users generating new data for their requirements on the web, these data includes commercial or official information, additional to these it also includes uncontrolled personal data by mails, forums etc. This uncontrolled data leads to miswriting and typing errors, especially when we look at the side of Turkish, typing errors occur on Turkish characters of the Alphabet. Usually, users type words of queries or documents without Turkish Characters whether for the reason of speed, laziness, avoiding encoding errors or habits born of the days while non existence of Turkish Characters on many computer systems. Missing Turkish characters may lead to losing the real meaning of the. 1.

(10) 2. world. As human reader, we generally recover this meaning loss from its context, automatically. However, On the other hand, Information retrieval systems are not brilliant enough to go close the gap in meaning. As a result of this, recall measure of the retrieval process decreases because retrieval system does not match the exact terms in the index. Information retrieval Performance is a popular issue by the existence of search engines. Grefenstette & Nioche calculate the growth of a number of European languages on the Web (Grefenstette & Nioche, 2000) . As the rate of the non-English web pages grows; Bar-Ilan & Gutman (2003), study on the subject 'How do search Engines Handle non English Queries'. Soraka (2000), detailed the Performance of search Engines for Polish and Kazimierz Choros' (2005), tests the effectiveness of Retrieval to Queries using Polish Words with Diacritics. Similar to Soraka, Haidar Moukdad (2006), studies the same subject for Arabic. Additional to Arabic studies; Amjad M Daoud deals with Morphological Analysis and Diacritical Arabic Text Compression (Daoud, 2010). On the Side of Turkish text; Bitirim et al. (2002), inspect Information Retrieval Effectiveness of Turkish Search Engines by 17 specific Queries, Two of these queries are about the mistyping of Turkish characters in Queries. According to their study, retrieval efficiency of a query change if Turkish characters of the query is written in English alphabet. Number of relevant documents decreases in the case of using Turkish characters in a query. The idea they defend by their test results is; using Turkish characters in queries such as ‘i’, ‘ş’, ‘ç’ leads to problems for Turkish search engines. To the best of our knowledge, although many different information retrieval techniques are proposed for Turkish documents, the performance degradation caused by the non use of Turkish characters is not studied. Along with defined issue; this thesis is aimed to bring up the effects on information retrieval efficiency of mistyping Turkish characters in documents and queries. In this thesis, we first clarify the problem of the performance loss caused by mistyping of diacritics in Turkish words with English alphabet. Then, we presented three different solution approaches to the aforementioned problem. Firstly, during the indexing process equivalence.

(11) 3. classes of terms are obtained in means of normalization and the terms are indexed according to their equivalence classes. Additional to this technique we apply query expansion and document expansion methods during indexing and retrieval. According the results of these tests we bring up what do we miss out by Writing Turkish with English Alphabet and how can we deal with this problem. As the result of this study, we certainly define what we miss out by writing Turkish characters in English alphabet, on which rate we can reach to the initial level of performance. The rest of this thesis is organized as follows; brief information on Turkish language and the methods that are applied are explained under the headings of definition of Turkish language, normalization and equivalence classes, query expansion and document expansion in Section 2. The major points of our method are summarized under Section 3. The next section shows the experimental results we obtained by the given techniques. Finally; the section 5 renders the results and gives a look into the future studies on this subject..

(12) CHAPTER TWO PRELIMINARIES AND DEFINITIONS. Turkish language belongs to the Altaic Branch of Ural-Altaic language family; there are many different dialectics for Turkish, but this thesis and the experiments on the Turkish Dialectic, spoken in Turkish Republic and Cyprus. The Turkish alphabet consists of Latin characters, includes 8 vowels and 21 constants. Constants are “b, c, ç, d, f, g, ğ, h, j, k, l, m, n, p, r, s, ş, t, v, y, z” and vowels are “a, e, ı, i, o, ö, u, ü” listed as given. Beside constant- vowels classification, some of the Turkish characters can have five diacritical marks. These are; breve (ğ), circumflex (â), umlauts (ü, ö), dot (Đ), cedilla (ç, ş), dot/tittle (i,Đ) •. Breve: In Turkish, Ğ/ğ lengthens the preceding vowel. It is thus placed between two vowels and is silent in standard Turkish, but may be pronounced [ɡ] in some regional dialects or varieties closer to Ottoman Turkish.. •. Circumflex is used to soften the letter "A". Such as in "Âdet" (Tradition) "Kâğıt" (Paper).. •. The umlaut is a pair of dots ( ¨ ) placed over the letter that represents the affected vowel sound, O and Ö, and U and Ü.. •. A cedilla/cedille is a hook (¸) added under consonant letters of c and s as a diacritical mark to modify their pronunciation. The character “ç” represents the voiceless postalveolar affricate (as in English "church") in Turkish (as in Çorum). The character "ş" represents the voiceless postalveolar fricative (as in "show") in Turkish (şeker).. •. Dot/tittle. It is a small distinguishing mark, such as a diacritic or the dot on an i or j. The Turkish alphabet includes two distinct versions of the letter I, one dotted and the other dotless. 4.

(13) 5. These five diacritical marks compose the major points of Turkish language characteristics. Based on given characteristics of Turkish language, this study deals with the usage of diacritics, except circumflex, in documents or queries. According to given diacritic definitions; usage of diacritics without their umlauts, dots, cedilla or breve, turn them into new characters in Latin alphabet. Mentioned characters and their Standard Latin forms are given in Table 2.1. Table 2.1 Turkish characters and their standard Latin character equivalents. Turkish character ç,Ç ğ,Ğ ı,I i,Đ ö,Ö ş,Ş ü,Ü. Standard Latin character c,C g,G i,I i,I o,O s,S u,U. By this perspective, we focus on 'What is missed out by the mistyping of Turkish diacritics in documents and queries'. Moreover, this thesis also addresses that what kind of techniques can be achieved during indexing and retrieval processes to overcome this problem. Although, English seems the most popular language for internet searches, nonEnglish languages have a great percentage all over the internet access. The growth of the ratio of non English languages, leads to new information retrieval problems depend on the characteristics of these languages. Language dependent problems are not limited by the encodings of systems, but also it is about the compatibility and utility of these systems. By the development processes of encoding systems, all the characteristics of a language became available on the digital world. Different language characteristics necessitate more intelligent and customized search engines to overcome language specific problems. Turkish diacritics is one of the most characterized morphological feature of Turkish language and this study focus on the usage of diacritics on the internet searches..

(14) 6. Characteristic of Turkish characters is familiar with the usage of diacritics in other languages. The effect of the diacritics on the meaning of a word is not much sensible for English language; cliché and cliche or naïve and naive has the same meaning and has to be matched by the retrieval process. German has umlauts similar to Turkish umlauts “ö,ü”. Some of the German umlauts can be given as “ä, ö, ü”, Distinctively from Turkish umlauts German umlauts can be represented by other characters like “ae, oe, ue” in documents or queries. But Turkish umlauts cannot be represented by any characters or combination of any characters. From the side of English and German, removing diacritics during token normalization process seems quite reasonable (Braschler, Ripplinger & Schauble, 2002). However, in Polish; a word without a diacritic even becomes another word in a different language and it loses its real meaning as stated by Kazimierz Choros' (2005). Similar to Polish, in Spanish meaning of a term changes according to the usage of diacritics, for instance peña is ‘a cliff’, while pena is ‘sorrow’, Because for the languages like, Spanish, Polish and Turkish, diacritics are a regular part of the writing system and distinguish different sounds as stated by Manning, Raghavan & Schütze (chap. 2, 2008). But the main decision point for processing diacritics does not based on the linquistic properties of a language. Diacritics usage depends on the user habits, computer systems limitations. As a result of this Manning et. Al. defend to normalizing tokens to remove diacritics (Manning et al., chap. 2, 2008). On the side of the search engines, Google determines for each query term if there is a form with diacritics and searches for both of them since 2006, but the search can be forced to retrieve the exact form by adding a (+) plus sign of the term (Google Web Master Central Blog, 2006)..

(15) CHAPTER THREE EXPANSION METHODS for TURKISH TEXT with DIACRITICS. In this project we aspire to define the degradation rate of the mistyping Turkish characters in information retrieval systems, we test it for different stemming techniques and Query lengths. Three different approaches, document and query expansion and equivalent classes, are tested for improving the retrieval performance that decreases by the mistyping of Turkish diacritics. Applying equivalence classes either documents or queries determine the exact degradation rate in numbers and combination of all these methods signifies how this performance loss can be regained although mistyping of Turkish diacritics occurs both in documents or queries. Additional to these we compare our results for different retrieval models that can be listed as; TF-IDF and BM25 document weighting models. 3.1 Token Normalization by Equivalence Classes Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens. The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set (Manning et al., chap. 2, 2008). Token normalization process based on specific structures; these are token, type and term. These structures are shortly defined as; Token is an instance of a sequence of characters that are grouped together as a useful semantic unit for processing. Type is the class of all tokens containing the same character sequence. Lastly; Term is a type that is included in the system dictionary (normalized) (Nenkova, n.d.). During the broken up process of documents and queries into tokens, to obtain the terms to be indexed some estimations have to be done because although the tokens in a collection has the same meaning; in general they do not have the exact character set. The reason of this problem is the occurrences of different forms of token or typing methods. In retrieval process these tokens, which have different character set 7.

(16) 8. but the same logical meaning have to be matched. For instance, if the term USA is searched, the documents that also containing U.S.A. have to be matched. On the side of this study; equivalence classes are used to map terms that includes Turkish characters but typed in Latin forms of Turkish characters as given in Table 2.1. For instance, if a search runs for the term “Kıbrıs”, even if the term typed as “Kibris”, “Kıbris” or “Kibrıs” in documents; equivalence classes provides chance to match the documents that include the given forms of the term “Kıbrıs”. 3.2 Document Expansion Document Expansion is a technique that aims to increase information retrieval effectiveness by including different forms of tokens to the lexicon during indexing process. Document expansion requires more space to store postings, at the first sight this seems a disadvantage but by the development of the modern technology, space cost loses its importance and the distinct posting lists became a preferable idea. (Manning et al., chap. 2, 2008). Document expansion includes different forms of a term.That is to say, if a document contains the term “Internet”; a common example of miswriting in Turkish; this term will be indexed as “ınternet” and “intenet” in the lexicon. And the search even match the document contains the term includes “ınternet”. As a result of this technique any form of a term written by Turkish characters or not in a query can match the relevant documents. 3.3 Query Expansion Query expansion based on the idea of including different terms to the query that stands the meaning or the morphological structure of the query. Query expansion decreases the efficiency of retrieval time by the increase in the number of query terms. Both for document expansion and query expansion techniques, the applied approach for indexing the documents or processing the queries is, expanding the indexing and retrieval by new synthetic forms of a type for a token. For example, in the English language, run, runs ran and running are forms of the same type,.

(17) 9. conventionally written as RUN. New synthetic forms of a type are generated by applying equivalence classes for terms that includes Turkish characters in this study. Approximate approaches are also practiced for query expansion and document expansion in this study; the queries are expanded by the help of equivalence classes. The reason of performing query expansion and document expansion is to compare the effectiveness of these two methods with each other. Additionally, effect on information retrieval performance of applying document and query expansion techniques during indexing and retrieval procedures in the same test, is observed. The starting point of this project is to reveal the degradation in the performance, if the users do type queries without diacritics. Thus, all the Turkish characters of terms in the queries are changed to their non Turkish forms during token normalization process. And the similar technique is applied whether the documents are not typed in Turkish characters but the queries include Turkish characters. During the token normalization for indexing terms, all the Turkish characters are replaced by their non English forms so that a lexicon that does not include any Turkish characters is obtained. The initial lexicon that stores original forms of each term in the test collection is called as “A”, as this lexicon basis all other synthetic lexicons, the lexicon “A” stores the exact form of the terms as they take place in documents. The lexicon generated by equivalence classes, that does not contain any Turkish character is named as; “X”. For instance, if a user searches the query “Sanat ödülleri”, tests that uses the lexicon “A” can match documents that includes terms “Sanat” or “ödülleri”, because the terms are in their original forms as documents in the lexicon “A”, but this search’s result will not be the same for the lexicon “X” because the lexicon “X” does not contain any terms as “ödülleri”, while this term turns into a new form as “odulleri”. But the remarkable point of this situation is that the lexicon “A” cannot match the documents if the term “ödülleri” is written without diacritics. After the ratio of the loss is calculated; document expansion and query expansion methods are combined and retrieval performances are obtained on each step. During document and query expansion all the tokens are normalized according to the parameters of the test, and if test includes an expansion technique for document or query non Turkish forms of each token are obtained and added to the index or query.

(18) 10. if the new form of the token is not the same with its initial form. For the tests that based on document expansion technique another synthetic lexicon that, consists of terms written by Turkish characters and their equivalence forms, is generated and named as “AX”. In instance; the word “çiçek” (means “flower” in English) takes place in a document and the lexicon “A”, if we want to apply document expansion for the given test the non-Turkish form of the word is generated as “cicek” and added to the index “AX”, but if non-Turkish form of a word is the same with the initial form of the word like “beyaz” means “white” in Turkish, it is not added to the index “AX” for the second time. The lexicon “X”, does not contain any term that includes one of the following Turkish characters “ı, Đ, I, ö, Ö, ü, Ü, ç, Ç, ğ, Ğ, ş, Ş”, distinctively the lexicons “A” and “AX”. According to this definition the “X” lexicon stores only the terms “cicek” and “beyaz” for the given terms “çiçek” and “beyaz” following. By the combination of different approaches; stemming, document expansion and equivalence classes; 6 different lexicons are obtained; these can be listed as; For No stemming (NS) and F5; •. Terms in original form with Turkish Diacritics (A).. •. New forms of words by applying equivalence classes; this lexicon does not include any Turkish characters (X).. •. Lexicon that is generated by document expansion by the way each token takes place in its original form and its equivalent form; without Turkish characters if any (AX).. For instance; some tokens and their occurrences for each lexicon are given in Table 3.1..

(19) 11. Table 3.1 Instances of tokens and occurrences of the given token for each lexicon.. ENGLISH STEMMING DOCUMENT INDEXING TECHNIQUE MEANING METHOD A AX X anlaşmazlığım NS anlaşmazlığım anlasmazligim anlasmazligim My anlaşmazlığım disaggrement anlaş F5 anlaş anlas anlas TOKEN. kelebekler. butterfilies. yök. Higher Education Committee. yok. yaşamak. yasamak. satranç. not to be exist. NS. kelebekler. kelebekler. kelebekler. F5. keleb. keleb. keleb. NS. yök. F5. yök. NS. yok. yok. yok. F5. yok. yok. yok. NS. yaşamak. F5. yaşam. NS. yasamak. yasamak. yasamak. F5. yasam. yasam. yasam. NS. satranç. satranç. satranc. satranc. satran. F5. satra. satra. satra. ınternet. ınternet. ınternet. internet. internet. ınter. ınter. inter. inter. to live. yok yök yok. yaşamak yasamak yaşam yasam. yok yok. yasamak yasam. to legislate. chess. NS Internet. yök. internet F5. ınter. When we take the lexicon “A” as basic lexicon, The distinct term count in the second lexicon decreases; because for some terms in Turkish, replacing diacritics with their English forms may lead to turn into another term that exists in the lexicon. For instance; token “açı” (means “angle” in English) and “acı” (means “pain”) in English turns into a new synthetic term by equivalence classes that has no meaning “aci” after token normalization..

(20) 12. The disadvantage of equivalence classes appears in this manner. While we generate equivalent Class from term 'Türkiye' as 'Turkiye' it seems quite reasonable, but when we generate the equivalent class for term 'YÖK', means Higher Education Committee, as 'YOK' it lost its meaning and turn into a high frequency term in Turkish which means ‘not to be exist'. A similar situation in term normalization by equivalence classes for abbreviations exists in English U.S.A terms into USA and gives a definite solution certainly whether a similar conflict occurs while 'C.A.T' projected to 'CAT'..

(21) CHAPTER FOUR EXPERIMENTATIONS. In this study we have used a test collection from Bilkent Information retrieval Group, Computer Engineering Department of Bilkent University. The test Collection contains news articles and columns of 5 years (2001-2005) from the Turkish newspaper Milliyet (www.milliyet.com.tr) and its size is about 800 MB containing 408,305 documents (Can, Kocberber, Balcik, Kaynak, Ocalan, Vursavas, 2008). During token normalization procedure of indexing and query processing; character casing is ignored and all the characters are turned into their lower case forms. Stop word elimination is not applied. Files that compose the test collection based on UTF-8 encoding. In details of token normalization process; all quotation marks are accepted as a term terminator, even it takes place in the middle of word. For Example the word “Ali'nin” is processed as to distinct words as “Ali” and “nin”. The same methodology is also applied to numeric tokens; i.e. “1.5” is indexed as “1” and “5”. This approach leads to minimize lexicon length. If only the quotation mark “'” in the middle of a term accepted as a term terminator and all other marks that takes place in a term like “1,073,210” or “7.editör” left as a part of the token lexicon grows in number of distinct term counts. Two stemming modalities are applied during this study; one of them is no stemming. Tokens in the collection are indexed according to their original form but as an affect of token normalization process the suffixes; that are separated from the exact word by a “'” mark is not taken as a part of the term, as a result of this art; suffixes are split from the exact term. The second stemming method is generating a new form of the given token by getting the first five characters if the given token is longer than five characters. Can et al. (2008) points out that although F5 is a simple word truncation approach, provides similar performance in terms of effectiveness with a word truncation approach that uses corpus statistics, and an elaborate. 13.

(22) 14. lemmatizer-based stemmer. Based on this evidence stemming methods are limited by F5. The tests that are generated by no stemming (NS) and F5 methods without any expansion and equivalence techniques compose the baseline retrieval results. Performance of all other tests obtained by the combination of the document and query processing parameters compared with the baseline test results. For the first token normalization method the lexicon length is about 888,885, but in the second approach the lexicon length grows up to 1,242,188 if no stemming method applied during token normalization process. The second approach is not used for test procedures; this approach is only given to elucidate lexicon lengths for different methods. 4.1 Terrier For indexing and retrieval processes Terrier [Terabyte Retriever] 2.1 platform on java is used. Terrier is the first serious development on IR in Europe (Ounis, Amati, Plachouras, He, Macdonald, Lioma, 2006), and it is a flexible, efficient, open source search engine, can deal with large scale collections of documents. Terrier searches on TREC and CLEF documents. Terrier is written in Java, and developed at the Department of Computing Science, University of Glasgow (Terrier, 2010). Terrier Indexing, includes 4 structures; collection, document, term pipeline and indexer; at each stage of indexing process, new APIs, even generated by the developer according to the needs, can be added or omitted. On the document stage a term has three major properties, the first one is string textual form of the term, second one is the position of the term in the related document, and lastly the tag, where the term occurs in. During the term pipeline stage, token is transformed by the parameterized methods. Terrier includes some predefined classes like porter stemmer, stop word elimination etc. at the code generation of this project this functionality of the Terrier is improved for generating equivalence classes of terms for Turkish information retrieval. Terrier Indexing based on four data structures; Lexicon, inverted index, document index and direct index. The lexicon stores the term, its term id (a unique number for.

(23) 15. each term) and term and document frequency of the related term; additionally includes offsets of postings list in the inverted index. The inverted index, stores posting lists of a term by the document id of matching documents and the term frequency of that term. If the index is generated for also positional information, it stores block id; that provides performing phrasal and proximity search. The posting list in Terrier is highly compressed; the document ids are encoded by Gamma Encoding, and term frequencies are encoded by Unary Encoding. Document index is composed of the external unique identifier of the document, length of the document (number of tokens) and the offset of the document in the inverted index. Lastly, the Direct Index has information on terms and term frequencies of the terms that take place in each document. Terrier can evaluate document weights by 9 different weighting models, this weighting models are coded under matching models package, beside this weighting models, developer defined weighting models can be added to the class structure. TF-IDF and BM25 weighting models that are already defined in Terrier is used for this project. Beside these properties, terrier also proceses different forms of a query for different iterations. Query file is composed of tags that classify the given queries as title, description and narrative. Retrieval parameters on queries can manage which query tags will be processed, and which ones are skipped at each run. 4.2 Weighting Models TF-IDF weighting is a statistical measure; that evaluates the importance of a word in a document in a collection. Number of occurrences of a word in a document increases the retrieval chance of document in a query for that term, oppositely to this approach; increase of term frequency in the collection has a negative impact. Term frequency factor means; a measure of the importance of the term t within the particular document d; called as tf..

(24) 16. The inverse document frequency is a measure of the general importance of the term; is formalized as = log .

(25)

(26)

(27) . (1). According the given definitions TF-IDF value of a term t in a given document d is; − (,. ). = (,. ). ∗ . (2). A term has a high weight in a related document; if it has a high term frequency for that document and a low term frequency within the collection. This approach leads to ignore common terms during the retrieval procedure (Manning et al., chap.6, 2008). TF-IDF weighting of Terrier based on the Robertson’s tf and standard Sparck Jones’ idf (Robertson, Spark Jones, 1976). Tf-idf formula of the terrier is given as; # = $% ∗. = log(. . ' (ℎ )) ( + $% ∗ (1 − + ∗ *+*(' (ℎ.

(28)

(29) + 1)

(30) . # − = $ ∗ . / ∗ . (3). (4). (5). For the given formula; tf is the term frequency of the term in the document; docLength means the document's length. The document frequency of the term is called documentFrequency. As a query factor; keyFrequency represents the term frequency in the query. As the combination of these parameters Tf-idf; Weight of a term t in given document d is obtained. Additional to dynamic parameters K_1 = 1.2 and b = 0.75 are constants..

(31) 17. Okapi BM25 is another way of ranking documents according to their relevance to a given search query (Robertson, Walker, Jones, Hancock-Beaulieu, Gatford, 1996). 1 = $% ∗ 2(1 − ) + ∗. ' (ℎ 3 + *+*(

(32) ' (ℎ. (6). 5625 =. ∗ ($7 + 1) ∗ $ (($7 + $ ) ∗ 1). (

(33)

(34) −

(35) + 0.5) ∗ log 8 ; (7)

(36) + 0.5. k1 = 1.2; k3 = 8 and b = 0.75 are constants. 4.3 Retrieval Procedure For each retrieval model; after calculating the term weights of each query term according to the given retrieval model formulas Terrier ignores low idf terms like stop words during retrieval procedure. Beside this, along with the retrieval process at the stage of getting similarity of each document to the given query, it ranks the top 1000 documents in the retrieved documents, at the end of evaluation ignores the documents that are not in the first 1000 documents. Terrier indexing and retrieval procedures can be managed by user defined parameters easily. One of the most common parameters that are used in this project is; term pipeline functionality of Terrier. Term pipeline functionality provides applying token normalization methods that can be added to the terrier's open source project by the developer as the java classes, implemented by TermPipeline interface. When a new token is read from the document, the token is normalized in the given sequence of the normalization functions. For Token normalization procedures two new classes are inherited from TermPipeline interface; one of them is for stemming strategy; gets the first five characters of a token if the token is longer than five characters (this method is called.

(37) 18. as “F5”, on the rest of this thesis), else returns the exact token. The other class that is generated for token normalization replaces Turkish characters with their English forms. This class generates the equivalence class form of each token, according to the defined rules. “ReplaceTurkishTerms” class replaces all Turkish characters of each token into their equivalence forms according to the Table 2.1. Additional to token normalization; Query and Document expansion methods are applied. During the indexing procedure, if the related test includes document expansion technique; each token, read from the document checked for if it become into a new form after removing Turkish characters with diacritics. When this filter generates a new word form even it is meaningful or not, added to the dictionary to be indexed. The lexicons that are generated by different stemming and document indexing methods differ in the number of terms. Stemming method is the major indicator for the term counts in the lexicons. While Turkish is an agglutinative language, each term in a sentence usually takes place with a suffix. The suffixes generate many different tokens of a type in the lexicon. As a result of this the F5, stemming method maps many different tokens into a one type in the lexicon. The lexicon lengths in number of distinct term counts are given in Table 4.1. Table 4.1. Distinct Term Counts in each set. A AX X NS 888,885 1,366,720 867,433 F5 166,595 187,293 147,671. The shortest lexicon is the one generated by F5 stemming method and replacing Turkish characters by their English forms. Oppositely; the longest one is not normalized by any stemming method and enlarged by document expansion technique. According to the given values in Table 4.1, Table 4.2 shows the ratio of the distinct terms in the lexicons. The lexicons for both Stemming and F5 generated by.

(38) 19. document expansion called “AX” grows in number of distinct term counts in the rate of 53.75% and 12.24%. Oppositely; the lexicons generated by the idea of equivalence classes called “X” have a decreasing rate when compared by the initial lexicon “A” in the numbers 2.41% and 11.35% for No Stemming and F5. Table 4.2 The Ratio of Distinct Terms in Lexicons. AX/A X/A NS 53.75% -2.41% F5 12.24% -11.35%. And also Table 4.1 and Table 4.2 prove that the major parameter effects the distinct term count in lexicons is the stemming method. Beside these the average document length (*+*(

(39) ' (ℎ); is a parameter that effects term weights; differs only for the lexicon “AX”. The lexicons “A” and “X” have an average document length 266.54 but the lexicon “AX” has an average document length 368.073. All the test results in this study depend on the combination of the values of 5 parameters. Two of these parameters are about the indexing process. The major indexing parameter is stemming method, additional to this indexes are generated in three different ways; no change in token normalization, generating new types by equivalence classes and lastly expanding the lexicon by new synthetic types obtained from equivalence classes function. Second class of these parameters depend on the queries, firstly similar to the document indexing techniques; queries processed by no change, all tokens in queries are map to their equivalent classes and queries are expanded by new synthetic types of equivalence classes. Beside the processing strategy, effect of different query lengths classified as; title, description and narrative as in the previous search of Bilkent information Retrieval Group is tested (Can et al., 2008). But addition to the short, medium and long query lengths; two different combinations of query lengths are tested, one of them is called “SM” at the rest of this study. “SM” strategy includes all the tokens of short and long queries. If the token exists for both short and medium queries the new synthetic query includes the token twice, as a result of this.

(40) 20. the key frequency factor of this token in the synthetic query increases. Similarly, the queries called “ALL” consist of the combination of three different query lengths; short, medium and long. Lastly, all the combinations of these given methods are tested by two different term weighting models; TF-IDF and BM25 by Terrier’s implementation. All the parameters with their choices and the abbreviations of each choice can be listed as; •. Stemming Method NS (No Stemming) F5 (First Five Character of Each Token). •. Document indexing method A (No Change) AX (Document Expansion by Synthetic Types) X (Applying Equivalent Classes for Turkish Characters). •. Query Processing method A (No Change) AX (Query Expansion by Synthetic Types) X (Applying Equivalent Classes for Turkish Characters). •. Weighting Model TF (TF-IDF) BM (BM25). •. Query Length S (Titles as Short Queries) M (Descriptions as Medium Queries) L (Narratives as Long Queries) SM (Titles + Descriptions).

(41) 21. ALL (Titles + Descriptions + Narratives) The combination of these parameters and their values generated 180 test results, the details of the test results will be given under related topics at the rest of this thesis. During the explanation of test results of the study we called each test by the combination of their parameter values. The parameters are separated from each other by an underline; each parameter takes place in the order of the naming procedure by its value. The order of the parameters is stemming method, document indexing method, query processing method, weighting model and query length. The abbreviations of each value are as given in the above list. For instance; if we point on a test called F5_A_AX_BM_M, this test is generated by F5 stemming algorithm, documents are in their original form, queries are expanded by new synthetic types, evaluation measure values are obtained by BM25 weighting model, and lastly the query includes only medium length queries that is about the description part of the queries. For a better understanding a test name NS_X_AX_TF_ALL means No Stemming method is applied during token normalization process, Documents are expanded by new synthetic types TF-IDF weighting approach is used and queries consist of titles, descriptions and narratives. In comparison of the evaluation results of all test parameters the test results for no Stemming and F5 for Middle Length queries (Qm) by TF-IDF is chosen as basic results for our test collection.. And also for the comparison of the retrieval. performance of our basic test results, the related test values from the referenced paper of Bilkent Information retrieval group “Information Retrieval in Turkish Texts” are included in the following in Table 4.3 (Can et al., 2008)..

(42) 22. Table 4.3 Comparison of base-line retrieval results in Bpref values for Qm by TF-IDF model.. NS. F5. Bilkent (MF8) Our Study Bilkent(MF8) Our Study 0.3255. 0.3579. 0.4322. 0.4471. The details of the methods of the test results taken from the referenced paper can be explained by matching Function approximation. In the referenced paper different matching functions are applied during retrieval process. These functions are generated by the combination of the three term weighting component; term frequency component (TFC), collection frequency component (CFC) and normalization component (NC) both for document and query. The combination of these factors depends on the idea of multiplying the respective weights of these factors. Multiplying these factors for document and query provides the evaluation of term weight in a document (wdk) or a query (wqk). The Similarity of a doc and query is formalized as; (Salton & Buckley, 1988).

(43) =*(, >) = ∑AL%. CDEFGHI/JGHEK. @. A. ∗ @BA. (8). All these weighting components have different possibilities; combination of these possibilities’ makes the difference in matching functions. Additional to this another matching strategy is applied in the referenced paper called as MF8. The results that are given in Table 4.3 is obtained by matching function 8 called as MF8 (Long & Suel, 2003). MF8 calculates matching value for document dj, for the search Query Q as follows; 1 + = Q 68 = N( ∗ (B ∗ ln (1 + )) √ R ∈T. (9).

(44) 23. In the given formula for MF8; fdt; is the frequency of Term t in Document dj, and D; is the total number of term occurrences in dj, these parameters form the document factor of the formula. Additional to document factor of the formula query factor consists of the parameters fqt; is the frequency of term t in the query Q, N; is the total number of documents and lastly ft; is the frequency of term t in the entire collection. The referenced paper indicates that MF8 is especially suitable for dynamic environments, because of the effect of query term weights in the second component. It is illustrated that; MF8 gives the best results for NS and F5 in the referenced paper for this reason we basis the MF8 function results to compare our evaluation results. 4.4 Evaluation Measures 4.4.1 Precision & Recall Precision; is a measure that evaluates the efficiency of a system according to relevant items only, beside this recall determines the retrieval efficiency according to all relevant documents. Precision and recall is formalized as; *== =.

(45) =+*

(46) + (10).

(47) =+*

(48) ℎ ==. V =.

(49) =+*

(50) + *=

(51)

(52) +. (11). Precision and recall are set based measures because they only determine the quality of unordered sets. For evaluating the ranked lists; precision can be plotted against recall after each retrieved documents. To compute average performance over a set of topics, individual topic precision values are interpolated to a set of standard recall levels from 0 to 1 by increments of 0.1. By the obtained interpolated precision values at the defined recall levels (λ) called Pλ, Precision at 11 standard recall levels can be calculated by the given formula. The parameter NUM simulates the number of topics..

(53) 24 ∑YZ[ RL% WX. Q\6. X. = ]0.0,0.1,0.2,0.3, … ,1.0_ (12). 4.4.2 Document Level Averages Two evaluation measures can be given under this topic; one of them is “precision at 9 document cutoff values” called P@ values, these values reflects the actual measured system performance as a user might see it. Any information retrieval application users or especially most of the web users are interested on the first 10 or 20 results because of human habits. As a result; this measure is important for retrieval evaluation. This value for given cutoff value is computed by summing the precisions at the specified document cutoff value and dividing by the number of topics. R precision; is another method for calculating document level averages. R precision is the precision after R documents retrieved and R is the number of relevant documents for the topic. This method loses the impact of exact ranking of retrieved relevant documents; this approach is useful for TREC because of the large numbers of the relevant documents. 4.4.3 Map The major point that differs MAP (Mean Average Precision) from other evaluation measures is MAP, provides a single-figure measure of quality across recall levels, additional to this MAP has especially good discrimination and stability among others. Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved. The Formula of MAP can be given as; if the set of relevant documents for an information need ` ∈ > ]% , … ,

(54) ` _ and Rjk is the set of ranked retrieval results from the. top result until you get to document dk, then; |T|. Ec. `L%. AL%. 1 1 6aW(>) = N N W (.`A ) |>|

(55) `. (13).

(56) 25. MAP is roughly the average area under the precision-recall curve for a set of queries (Manning et al., chap 8, 2008). 4.4.4. Bpref As it is stated in TREC document (Text Retrieval Conference [TREC], 2010); it is designed for situations where relevance judgments are known to be far from complete. Bpref; computes a preference relation of whether judged relevant documents are retrieved ahead of judged irrelevant documents. As it is seen from the given definition it deals with the relative ranks of judged documents only. The Bpref measure can be formalized as; 5V =. 1 | * $ ℎ(ℎ ℎ* | N(1 − ) . min (., Q) H. (14). R is the number of judged relevant documents, N is the number of judged irrelevant documents, r is a relevant retrieved document, and n is a member of the first R irrelevant retrieved documents. Additional to this Formula after a bug fix, the definition of R is changed as the number of judged irrelevant documents as stated in “bpref_bug” file of “trec_eval” distribution (Buckley, 2005). The ranking results of Bpref and MAP is close to each other for complete judgments however MAP cannot correlate to the exact rankings when the judgments are incomplete. The advantage of Bpref becomes clear by the correlated rankings for incomplete judgments. 4.4.5 GMAP The aim of the GMAP measure is highlighting improvements for low-performing topics. GMAP is the geometric mean of per-topic average precision. The geometric mean is defined as the n-th root of the product of n values; f6aW = gh aWC i. C. (15).

(57) 26. 4.4.6 Mean Reciprocal Rank The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. Measure is most useful for tasks in which there is only one relevant doc, or the user only wants one relevant doc. The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q. T. 1 1 6.. = N |>| * $R RL%. (16). The test results obtained to determine information retrieval performance for the given parameterized tests in this study depend on these evaluation measures detailed above. “Trec_eval 8.1” is used for calculating the efficiency measures of test results; most of the tables for test comparison include Bpref values because of its advantages according to MAP. Some tests results are regenerated for different trec-eval parameters. These test groups are the ones which cannot retrieve 1000 documents for each query. The reason of this approach is ‘–c’ parameter; averages over the complete set of queries in the relevance judgments instead of the queries in the intersection of relevance judgments and results as it is indicated in the trec_eval manual. Missing queries will contribute a value of 0 to all evaluation measures (which may or may not be reasonable for a particular evaluation measure, but is reasonable for standard TREC measures.) 4.5 Experimentation Results 4.5.1 Performance for Expansion and Equivalent Classes The starting point of this study; is to determine the loss of the performance by using English Alphabet characters instead of Turkish characters while typing the document or queries. Table 4.4 shows the exact evaluation results for the tests; which aim to bring up degradation of performance in mistyping of Turkish characters..

(58) 27. Table 4.4 Bpref values of NS and F5 for Qm by TF-IDF Model. DOC A QUERY. NS. AX F5. NS. X F5. NS. F5. A. 0.3579 0.4471 0.3571 0.4445 0.2462 0.3269. AX. 0.3528 0.4247 0.3255 0.4179 0.3505 0.4226. X. 0.2466 0.3254 0.3542 0.4414 0.3552 0.4435. As it is shown in Table 4.4 the best value for NS approach is the result the documents and queries remain with proper Turkish diacratics. If we take the value of the test NS_A_A_M_TF (No stemming method applied, documents and queries remain in their original form) we get the bpref evaluation result 0.3579 as the basis. The ratio between initial point and the test results of NS_X_A_M_TF; which simulates the non existence of Turkish characters in documents or the test results of NS_A_X_M_TF; means the non existence of Turkish characters in queries gives us the exact information this study aims. While the retrieval efficiency of bpref value is about 0.3579 for the initial point it decreases to 0.2462, if the documents do not contain any Turkish characters but the queries are typed by using Turkish characters, a similar situation is occurred by the value 0.2466 in non existence of Turkish characters in queries, but exist in documents. Table 4.5 The loss of the performance in bpref values of retrieval results according to the initial point for NS using Query Form Qm by TF-IDF weighting model.. Document AX X 0.22% 32.2% 1.42% 9.05% 2.06% 31.09% 1.03% 0.75%. Query. A. A AX X.

(59) 28. Table 4.6 The loss of the performance in bpref values of retrieval results according to the initial point for F5 using Query Form Qm by TF-IDF weighting model.. Document. Query. A A. -. AX. X. 0.58% 26.88%. AX 5.01% 6.53% 5.47% X 27.21% 1.27% 0.80%. As the values given in Table 4.5 and Table 4.6; supports our research hypothesis which is non existence of Turkish characters in documents or in queries leads to the degradation of performance in rates of 32.2% and 31.09% for no stemming approach and 26.88% and 27.21% for F5. It is hard to say which one non existence of Turkish characters in the documents or in queries gives the worst result at all time because the ratios are changed according to the stemming approach.. Query. Table 4.7 Bpref values of NS and F5 for Qm by BM25 model. Document A AX X STEMMING N F N F N F 0.3589 0.4559 0.3575 0.4544 0.248 0.3325 A 0.3555 0.4345 0.3263 0.4274 0.3528 0.4336 AX 0.247 0.3299 0.3548 0.4514 0.3562 0.4536 X. Table 4.8 The loss of the performance in bpref values of retrieval results according to the initial point. Query. for NS using Query Form Qm by BM25 weighting model.. DOC A AX X 0.39% 30.89% A 0.94% 9.08% 1.69% AX X 31.17% 1.42% 0.75%.

(60) 29. Table 4.9 The loss of the performancein bpref values of retrieval results according to the initial point for F5 using Query Form Qm by BM25 weighting model.. DOC AX X 0.32% 27.06% A AX 4.69% 6.25% 4.89% X 27.63% 0.98% 0.50%. Query. A. Table 4.8 and Table 4.9 show; similar to the Table 4.5 and Table 4.6; Non existence of Turkish characters in documents or queries cause to decrease of information retrieval values for a different weighting model. The comparison of Table 4.5 and Table 4.8; shows that BM25 weighting model generates worse results for document expansion (AX) and token normalization by equivalent classes (X) for queries in BM25 model to TF-IDF model for No Stemming approach (NS). The given Tables; Table 4.5, Table 4.6, Table 4.8 and Table 4.9 proves that although the ratios are changing according to the stemming methods and weighting models; regardless of stemming methods and weighting models the loss of the retrieval performance is exact for mistyping of Turkish diacritics. Query Expansion technique gives better results for Lexicons “A” and “X”. Applying query expansion and document expansion at the same time makes the loss rate bigger as the rates are 1.42% and 2.06% for Lexicons “A” and “X” but 9.05% for Lexicon “AX” for No Stemming approach in TF-IDF weighting model. Similar to NS in TF-IDF, F5 in TF-IDF degredates performance in rates of 5.01% and 5.47% for “A” and “X” but 6.53% for “AX”. As the results summarizes; differences of the negative impact of query expansion for different lexicons is smaller for F5 by the values 5.01% , 5.47% for “A” and “X” ; 6.53% for “AX”. But the Difference of the rates for different lexicons for NS is more appreciable, by the values 1.42% and 2.06% for Lexicons “A” and “X”; 9.05% for Lexicon “AX”. As it is seen from Table 4.5, Table 4.6, Table 4.8 and Table 4.9 document expansion leads to a smaller loss in the performance and gives more stable results for each lexicon regardless of the stemming methods and weighting models. The only.

(61) 30. important point of document expansion is not applying document and query expansion at the same time. Document expansion gives better results for Lexicons “A” and “X” at all times. The proof of this assumption in numbers can be given as; Document expansion technique loses the performance in the rates of 0.22% and 1.03% for queries “A” and “X”, 9.05% for queries “AX” in TF-IDF model; 0.39% and 1.42% for queries “A” and “X”, 9.08% for queries “AX” in BM25 for NS approach. Similar to No Stemming the loss ratios for F5 can be given as; 0.58% and 1.27% for queries “A” and “X”, 6.53% for queries “AX” in TF-IDF; 0.32% and 0.98% for queries “A” and “X”, 6.25% for queries “AX” in BM25. In additional to document and query expansion techniques another approximation to the solution of mistyping of Turkish characters, is applying token normalization to obtain synthetic types both for queries and documents. This technique gives better results from all other techniques applied in this thesis except if the documents are expanded “AX” and queries are typed in Turkish characters “A”. This assumption is valid for all stemming approaches and weighting methods applied in this study. The difference of the results of “AX_A” and “X_X” is not much noticeable. The differences of the ratios of these techniques are 0.53%, 0.22%, 0.36%, 0.18% for NS_TF, F5_TF, NS_BM25 and F5_BM..

(62) 31. Table 4.10. Retrieval performance in different evaluation measures for NS, Qm and the given test parameters. Doc Query WM. #ret. #rel_ret. map. gm_ap 0.167. Rbpref recip_rank prec 0.3195 0.3589 0.7164. A. A. BM 72000. 4760. 0.2793. A. A. TF. 72000. 4747. 0.2763 0.1644. 0.3579. 0.7152. A. AX. BM 72000. 4731. 0.2694 0.1548 0.3134 0.3555. 0.6435. A. AX. TF. 72000. 4686. 0.2647 0.1504 0.3085 0.3528. 0.6381. A. X. BM 70292. 3045. A. X. TF. 71208. AX. A. AX. A. TF. AX. AX. AX. AX. AX. X. AX. X. TF. X. 0.247. 0.4963. 3014. 0.1573 0.0316 0.1942 0.2466. 0.4906. BM 72000. 4762. 0.2785 0.1663 0.3207 0.3575. 0.7311. 72000. 4739. 0.2756 0.1633 0.3179 0.3571. 0.7268. BM 72000. 4393. 0.2335 0.1269 0.2788 0.3263. 0.6851. 72000. 4385. 0.231. 0.1256 0.2768 0.3255. 0.6874. BM 72000. 4696. 0.2744 0.1649 0.3192 0.3548. 0.7263. 72000. 4685. 0.2719 0.1623 0.3166 0.3542. 0.724. A. BM 70137. 3060. 0.1662 0.0339 0.1997. 0.248. 0.5384. X. A. TF. 71118. 3024. 0.1637 0.0337 0.1986 0.2462. 0.5362. X. AX. BM 72000. 4671. X. AX. TF. 72000. X. X. X. X. TF. TF. 0.16. 0.319. 0.27. 0.0314 0.1949. 0.1609 0.3151 0.3528. 0.6918. 4641. 0.2653 0.1564 0.3111 0.3505. 0.6783. BM 72000. 4692. 0.2745 0.1652 0.3181 0.3562. 0.7215. 72000. 4682. 0.2717 0.1626. 0.7143. 0.316. 0.3552. First of all; Table 4.10 shows that according to number of retrieved documents if only documents or queries do not include any Turkish terms the retrieval procedure cannot match many valid documents. Although all other query techniques retrieve at least 1000 documents for all queries, the tests “A_X” and “X_A” retrieves less than 1000 documents for some queries for NS. Most of the terms in the queries do not match because of their new synthetic forms. Table 4.11 shows that although the retrieval efficiencies are still worse for tests “A_X” and “X_A”, at least they retrieves 1000 documents for each queries, and as the number of relevant documents increase, efficiency improves by the F5 stemming method. This improvement can be seen for each different evaluation measure both for TF-IDF and BM25 weighting models..

(63) 32. Table 4.11. Retrieval performance in Different evaluation measures for F5, Qm and the given test parameters. R#ret #rel_ret map gm_ap prec bpref recip_rank 0.7928 BM 72000 5793 0.3433 0.2497 0.3734 0.4559 0.7862 TF 72000 5662 0.3288 0.2387 0.3614 0.4471. Doc Query WM A. A. A. A. A. AX. A. AX. A. X. A. X. AX. A. AX. A. AX. AX. AX. AX. AX. X. AX. X. X. A. X. A. X. AX. X. AX. X. X. X. X. BM 72000 TF 72000. 5460. 0.3065. 0.211. 0.3517 0.4345. 0.7455. 5324. 0.2938 0.1988 0.3403 0.4247. 0.7158. BM 72000 TF 72000. 4136. 0.1991 0.0565 0.2379 0.3299. 0.503. 4033. 0.1927 0.0571 0.2336 0.3254. 0.495. BM 72000 TF 72000. 5711. 0.3374 0.2428 0.3699 0.4544. 0.7743. 5594. 0.3235 0.2334 0.3586 0.4445. 0.7968. BM 72000 TF 72000. 5299. 0.3024 0.1956 0.3449 0.4274. 0.7709. 5173. 0.2888 0.1853 0.3305 0.4179. 0.7706. BM 72000 TF 72000. 5670. 0.3342 0.2403 0.3659 0.4514. 0.7883. 5558. 0.32. BM 72000 TF 72000. 4199. 0.2207. 0.2308 0.3554 0.4414. 0.7898. 0.062. 0.3325. 0.5597. 4082. 0.2122 0.0611 0.2483 0.3269. 0.5668. BM 72000 TF 72000. 5455. 0.3144 0.2175. 0.4336. 0.8034. 5316. 0.2983 0.2022 0.3418 0.4226. 0.7645. BM 72000 TF 72000. 5744. 0.3398 0.2463 0.3684 0.4536. 0.8099. 5628. 0.3239 0.2352 0.3592 0.4435. 0.7937. 0.254 0.355.

(64) 33. Table 4.12. P@ values for NS, Qm and the given test parameters. Doc Query WM A. A. BM. P5 0.5444. P20 0.4778. P100 0.3061. P1000 0.0661. A. A. TF. 0.5444. 0.4701. 0.3024. 0.0659. A. AX. BM. 0.5028. 0.4549. 0.3019. 0.0657. A. AX. TF. 0.5056. 0.4444. 0.2986. 0.0651. A. X. BM. 0.3389. 0.2861. 0.1787. 0.0423. 0.2854. 0.1786. 0.0419. A. X. TF. 0.3278. AX. A. BM. 0.5472. 0.4736. 0.3056. 0.0661. AX. A. TF. 0.5444. 0.4667. 0.3042. 0.0658. AX. AX. BM. 0.5222. 0.4187. 0.2697. 0.061. 0.416. 0.269. 0.0609. AX. AX. TF. 0.525. AX. X. BM. 0.5472. 0.4701. 0.3022. 0.0652. AX. X. TF. 0.5444. 0.4646. 0.3008. 0.0651. 0.3125. 0.1814. 0.0425. X. A. BM. 0.3694. X. A. TF. 0.3611. 0.3097. 0.1824. 0.042. X. AX. BM. 0.5333. 0.4646. 0.3017. 0.0649. 0.4556. 0.2974. 0.0645. X. AX. TF. 0.5278. X. X. BM. 0.55. 0.4715. 0.304. 0.0652. X. X. TF. 0.5417. 0.4646. 0.3004. 0.065. Table 4.12 points out that the changes of P@ values for each tests are not much distinctive for NS except for tests “A_X” and “X_A”. All other tests give approximate P@ values. For instance; the test results vary between the limits 0.5028 and 0.5444 for P@5 values. As in usual, the values decrease in each step. But although there is not a significant change between values, test results of P@ values are not directly proportional with the bpref or map values. For Example; the maximum limit for P@5 value is obtained from the tests “AX_X” and “AX_X” in BM25. Unlike NS, the change of P@ values are more clear for different document indexing and query processing methods for F5 according to the given values in Table 4.13. Table 4.12 and Table 4.13 show that there is not a significant change of each test for different weighting models..

(65) 34. Table 4.13 P@ values for F5, Qm and the given test parameters. Doc Query WM. P5. P20. P100 P1000. A. A. BM 0.6722 0.5486 0.36 0.0805. A. A. TF. A. AX. BM 0.5778 0.5035 0.334 0.0758. A. AX. TF. A. X. BM 0.3611 0.3167 0.229 0.0574. A. X. TF. AX. A. BM 0.6667 0.5458 0.357 0.0793. AX. A. TF. AX. AX. BM 0.5972 0.5097 0.325 0.0736. AX. AX. TF. AX. X. BM 0.6611 0.5431 0.354 0.0787. AX. X. TF. X. A. BM 0.4306 0.3653 0.246 0.0583. X. A. TF. X. AX. BM 0.6472 0.5236 0.337 0.0758. X. AX. TF. X. X. BM 0.6611 0.5417 0.357 0.0798. X. X. TF. 0.6611 0.5472 0.351 0.0786 0.5889 0.5007 0.325 0.0739 0.3556 0.3153 0.224 0.056 0.6667 0.5375 0.346 0.0777 0.575 0.5062 0.316 0.0718 0.6667 0.5347 0.344 0.0772 0.4139 0.3549 0.237 0.0567 0.6333 0.5194 0.328 0.0738 0.6556 0.5444 0.347 0.0782. Lastly, performance changes for precision at different recall values are given in the Figure 4.1 and Figure 4.2 for Stemming methods NS and F5 in TF_IDF for medium length queries. The major point that differs the values of each precision-recall graph is non existence of diacritics either in documents or in queries. Precision-recall values of the tests “A_X” and “X_A” prove this theory..

(66) 35. 0,9000 0,8000. A_A. 0,7000. A_AX. Precision. 0,6000. A_X. 0,5000. AX_A. 0,4000. AX_AX. 0,3000. AX_X. 0,2000. X_A X_AX. 0,1000. X_X. 0,0000 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0. . Figure 4.1 Precision / Recall graph of test results of NS in TF_IDF for Qm. 0,9000 0,8000. A_A. 0,7000. A_AX. Precision. 0,6000. A_X. 0,5000. AX_A. 0,4000. AX_AX. 0,3000. AX_X. 0,2000. X_A X_AX. 0,1000. X_X. 0,0000 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0. Figure 4.2 Precision / Recall graph of test results of F5 in TF_IDF for Qm.. Additional to given test results for a comparison of our study and referenced paper various effectiveness measure results for Qm is given in Table 4.14. Evaluation results taken from Bilkent’s referenced paper are obtained by using MF8..

(67) 36. Table 4.14 Various effectiveness measures for F5 and Qm by TF_IDF. bpref. MAP. P@10. P@20. Bilkent Our Study Bilkent Our Study Bilkent Our Study Bilkent Our Study 0.4322. 0.4471. 0.4092. 0.3288. 0.5917. 0.6181. 0.5653. 0.5472. 4.5.2 Performance for Different Stemming Methods Table 4.15. The increase rate of the performance of F5 to NS in bpref values of retrieval results using Query form Qm byTF-IDF weighting model. The values of this table are calculated from the values of Table 4.4.. Document. Query. A F5/NS A 24.92% AX 20.37% X 31.95%. AX. X. F5/NS 24.47% 28.38% 24.61%. F5/NS 32.77% 20.57% 24.85%. Table 4.16 The increase rate of the efficiency of F5 to NS in bpref values of retrieval results using Query Form Qm byBM25 weighting model. The values of this table are calculated from the values of Table 4.7.. Query. Document A F5/NS A 27.02% AX 22.22% X 33.56%. AX F5/NS 27.10% 30.98% 27.22%. A F5/NS 34.07% 22.90% 27.34%. Table 4.15 and Table 4.16 prove that there is always an increase in the efficiency when F5 stemming method is applied. The increase rates are better for BM25 weighting model in the range of 2% and 3%. And also the increase rates are better for the tests that have worse efficiency for NS andF5. For instance; F5 method has a better impact on the tests “A_X” and “X_A” by the improvement rates 33.56% and.

(68) 37. 34.07% that have less performance because of mistyping of Turkish characters than other tests like “A_A”. According to the given test results it is demonstrated that F5 always give better results than NS under different indexing, retrieval and weighting techniques. Even the retrieval efficiency is affected negatively by these parameters. In Our study applying F5 method for medium length queries by TF-IDF weighting model increases the performance %24.92 but in the referenced paper this ratio is 32.78% for MF8 and 28.93 on average. 4.5.3 Performance for Different Weighting Models. Query. Table 4.17 Bpref values of NS for the Query form Qm for weighting models TF-IDF and BM25.. A TF A 0.3579 AX 0,3528 X 0,2466. BM25 0.3589 0.3555 0.247. Document AX X X TF BM25 TF BM25 0.3571 0.3575 0.2462 0.248 0.3255 0.3263 0.3505 0.3528 0.3542 0.3548 0.3552 0.3562. Table 4.18 The increase rate of the performance of BM25 to TF-IDF weighting model in bpref values of retrieval results using Query form Qm by NS Stemming method. The values of this table are. Query. calculated from the values of Table 4.17.. Document A AX A BM25/TF-IDF BM25/TF-IDF BM25/TF-IDF 0.27% 0.11% 0.73% A 0.76% 0.24% 0.65% AX 0.16% 0.16% 0.28% X.

(69) 38. Query. Table 4.19 Bpref values of F5 for the Query form Qm for weighting models TF-IDF and BM25. A TF A 0,4471 AX 0,4247 X 0,3254. A BM25 0,4559 0,4345 0,3299. Document AX AX X X TF BM25 TF BM25 0,4445 0,4544 0,3269 0,3325 0,4179 0,4274 0,4226 0,4336 0,4414 0,4514 0,4435 0,4536. Table 4.20 The increase rate of the performance of BM25 to TF-IDF weighting model in bpref values of retrieval results using Query Form Qm by F5 Stemming method. The values of this table are. Query. calculated from the values of Table 4.19.. Document A AX A BM25/TF-IDF BM25/TF-IDF BM25/TF-IDF 1.96% 2.22% 1.71% A 2.03% 2.27% 2.60% AX 1.38% 2.26% 2.27% X. There is not a significant change of bpref values for different weighting models as it is seen from Table 4.18 and Table 4.20. The change ratio varies between the limits 0.11% - 0.76% for NS and 1.38% - 2.60% for F5. Although the obtained results are close to each other the change is more significant for the values that are obtained by F5. It is taught that the results of the tests generated for TF-IDF and BM25 are close to each other because of the Robertson TF factor in the TF-IDF implementation. Robertson TF factor implementation generates a different formula from standard TFIDF definition. 4.5.4 Performance for Different Query Lengths The test results in the given tables under this topic illustrate the query length effects on retrieval results with the combination of other defined parameters like stemming, weighting models. These results depend on the new synthetic forms of a query. An example of an original form of a query for different query lengths is given by Table A1 in Appendix A. New Synthetic forms of the query; based on the queries.

(70) 39. in Table A1; generated by query expansion and equivalence classes techniques are given in Table A2. Table 4.21. Bpref values of NS by TF_IDF for different Query Lengths. Document Query Length. Query. A. AX. X. A. AX. X. S. 0.3877 0.3841 0.2616. M. 0.3579 0.3571 0.2462. L. 0.3665 0.3657 0.2765. ALL. 0.4562 0.4531 0.3205. SM. 0.4366 0.434 0.2938. S. 0.3859 0.358 0.3781. M. 0.3528 0.3255 0.3505. L. 0.3636 0.3407 0.3633. ALL. 0.4534 0.4389 0.4548. SM. 0.4334 0.4056 0.4344. S. 0.2466 0.3767 0.3794. M. 0.2466 0.3542 0.3552. L. 0.2724 0.3651 0.365. ALL. 0.3167 0.4538 0.4567. SM. 0.2936 0.4339 0.4369. According to test results for NS by TF-IDF weighting model in Table 4.21 all the test results show that the most efficient test is run for the combination of all query lengths called “ALL”. It is taught that; this appreciable diversification for query length “ALL” occurs because of the increase of key frequency in queries of terms that has low term frequency in the collection. The behavior of Query lengths “S”, “M” and “L” differ for different document indexing and query processing methods. Short Queries give the best results according to the medium and long queries for most of the cases except the test “A_X”..

(71) 40. Table 4.22 Bpref values of F5 by TF-IDF for different Query Lengths. Document Query Length. Query. A. AX. X. A. AX. X. S. 0.4637. 0.46. 0.3323. M. 0.4471. 0.4445. 0.3269. L. 0.4308. 0.427. 0.353. ALL. 0.5259. 0.5222. 0.3985. SM. 0.5242. 0.5204. 0.38. S. 0.4407. 0.4429. 0.4377. M. 0.4247. 0.4179. 0.4226. L. 0.4086. 0.3947. 0.4106. ALL. 0.506. 0.5029. 0.5051. SM. 0.5. 0.4947. 0.4983. S. 0.3208. 0.4573. 0.4591. M. 0.3254. 0.4414. 0.4435. L. 0.3451. 0.4251. 0.4291. ALL. 0.3934. 0.5199. 0.5229. SM. 0.3784. 0.5166. 0.5201. Table 4.23 Bpref values of NS by BM25 for different Query Lengths. Document. Query. A. AX. X. Query Length. A. AX. X. S. 0.3872. 0.3838. 0.2618. M. 0.3589. 0.3575. 0.248. L. 0.3683. 0.3671. 0.2773. ALL. 0.4569. 0.4544. 0.3202. SM. 0.436. 0.4336. 0.2934. S. 0.3859. 0.3568. 0.3784. M. 0.3555. 0.3263. 0.3528. L. 0.3657. 0.3412. 0.3657. ALL. 0.4552. 0.4379. 0.4564. SM. 0.4338. 0.4041. 0.4342. S. 0.2462. 0.3766. 0.3792. M. 0.247. 0.3548. 0.3562. L. 0.2741. 0.366. 0.3674. ALL. 0.3166. 0.4555. 0.4578. SM. 0.2927. 0.4335. 0.4359.

(72) 41. Table 4.24 Bpref values of F5 by BM25 for different Query Lengths. Document Query Length. Query. A. AX. X. A. AX. X. S. 0.4633. 0.4597. 0.3311. M. 0.4559. 0.4544. 0.3325. L. 0.4411. 0.4353. 0.3562. ALL. 0.5282. 0.5256. 0.3985. SM. 0.5256. 0.5227. 0.3824. S. 0.4409. 0.4442. 0.4389. M. 0.4345. 0.4274. 0.4336. L. 0.4178. 0.4052. 0.4192. ALL. 0.5109. 0.5076. 0.5102. SM. 0.5039. 0.4975. 0.5036. S. 0.3192. 0.4572. 0.4588. M. 0.3299. 0.4514. 0.4536. L. 0.3492. 0.4338. 0.4397. ALL. 0.3932. 0.5235. 0.5259. SM. 0.3796. 0.5193. 0.5221. Table 4.23 and Table 4.24 show that; although BM25 gives better results for all tests; there is not a significant change when the weighting model changes. One of the certain points according to Table 4.21, Table 4.22, Table 4.23 and Table 4.24 is; F5 gives always better results than NS regardless of Query lengths. Some of the tests that are processed by the document and query parameters “A_X” and “X_A” with short length queries do not return any relevant documents for some queries. The main reason of this is non existence of Turkish characters in documents or queries. As a result of this some of the topics and relevant documents of these queries are not included in the process of calculating the information retrieval efficiency by trec_eval. Trec_eval is run for a one more iteration for this tests by ‘–c’ parameter. The related tests, their results with ‘-c’ parameter and without ‘-c’ parameter are given in Table 4.25..