A Large Vocabulary Online Handwriting Recognition System for Turkish

(1)

A Large Vocabulary Online Handwriting

Recognition System for Turkish

by

Esma Fatıma Bilgin Tas¸demir

Submitted to the Graduate School of Engineering and

Natural Sciences

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Sabancı University

June, 2018

(2)

(3)

(4)

c

(5)

Acknowledgments

Foremost, I would like to thank my advisor Prof. Berrin Yanıko˘glu for the continuous support during my research and for providing me with the opportunity to complete my PhD thesis at Sabanci University.

I would like to express my sincere gratitude to my friends Atia Shafique and Damla Arifo˘glu for giving me support, good advice and making my gradute years more joyful. I owe special thanks to my family, to my dad who always encourages his children in their academic journeys, to my mom who provides me the invaluable support whenever I need, to my sisters R¨umeysa, Zehra and Bet¨ul whose substantive, enthusiastic conversations on both academic topics and everything else are always a source of inspiration and to my brother Ahmed who helps me to take a break by changing the subject to arts and crafts.

I wish to thank my husband, Benna, who has been a constant source of support and encouragement during the challenges of graduate school. Without his enduring help, this thesis would not have been possible. Finally, I thank to my son ˙Ibrahim, an endless source of love, for patiently waiting for his mommy to finish reading and writing so that he could spend happy moments with her.

This thesis was financially supported by Scientific and Research Council of. Turkey (T ÜB˙ITAK) with the 2211-Graduate Scholarship Programme (2211-Yurt ˙Içi Doktora Burs Programı). Also part of this work is supported by T ÜB˙ITAK under the project num-ber 113E062. I would like to express my sincere gratitude for the supports provided.

(6)

A Large Vocabulary Online Handwriting Recognition System for Turkish

Esma Fatıma Bilgin Tas¸demir Computer Science and Engineering

Ph.D. Thesis, 2018

Thesis Supervisor: A. Berrin Yanıko˘glu

Keywords: Hidden Markov Models, Online Handwriting Recognition, Delayed Strokes, Turkish Handwriting Recognition

Abstract

Handwriting recognition in general and online handwriting recognition in particular has been an active research area for several decades. Most of the research have been focused on English and recently on other scripts like Arabic and Chinese. There is a lack of research on recognition in Turkish text and this work primarily fills that gap with a state-of-the-art recognizer for the first time. It contains design and implementation details of a complete recognition system for recognition of Turkish isolated words. Based on the Hidden Markov Models, the system comprises pre-processing, feature extraction, optical modeling and language modeling modules. It considers the recognition of unconstrained handwriting with a limited vocabulary size first and then evolves to a large vocabulary system.

Turkish script has many similarities with other Latin scripts, like English, which ma-kes it possible to adapt strategies that work for them. However, there are some other issues which are particular to Turkish that should be taken into consideration separately. Two of the challenging issues in recognition of Turkish text are determined as delayed strokes which introduce an extra source of variation in the sequence order of the handwritten input and high Out-of-Vocabulary (OOV) rate of Turkish when words are used as voca-bulary units in the decoding process. This work examines the problems and alternative solutions at depth and proposes suitable solutions for Turkish script particularly.

In delayed stroke handling, first a clear definition of the delayed strokes is developed and then using that definition some alternative handling methods are evaluated extensi-vely on the UNIPEN and Turkish datasets. The best results are obtained by removing all delayed strokes, with up to 2.13% and 2.03% points recognition accuracy increases, over the respective baselines of English and Turkish. The overall system performances are as-sessed as 86.1% with a 1,000-word lexicon and 83.0% with a 3,500-word lexicon on the UNIPEN dataset and 91.7% on the Turkish dataset.

Alternative decoding vocabularies are designed with grammatical sub-lexical units in order to solve the problem of high OOV rate. Additionally, statistical bi-gram and tri-gram language models are applied during the decoding process. The best performance, 67.9% is obtained by the large stem-ending vocabulary that is expanded with a bi-gram model on the Turkish dataset. This result is superior to the accuracy of the word-based vocabulary (63.8%) with the same coverage of 95% on the BOUN Web Corpus.

(7)

Türkçe ˙Için Genis¸ Da˘garcıklı C

¸ evrimic¸i El Yazısı Tanıma Sistemi

Esma Fatıma Bilgin Tas¸demir Bilgisayar Bilimi ve M¨uhendisli˘gi

Doktora Tezi, 2018

Tez Danıs¸manı: A. Berrin Yanıko˘glu

Anahtar Sözcükler: Saklı Markov Modelleri, Ç evrimiçi El Yazısı Tanıma, Gecikmis¸ Vurus¸, Türkçe El Yazısı Tanıma

¨ Ozet

El yazısı tanıma alanında yapılan pek çok çalıs¸ma ˙Ingilizce, Arapça ve Ç ince gibi dillerin yazılarını konu almaktadır. Türkçe için yapılmıs¸ sınırlı çalıs¸maların arasında çevrimiçi tanıma konusunda eksiklik vardır. Bu tez çalıs¸masıyla ilk kez olarak, en gelis¸mis¸ teknolojiyi içeren bir yalıtık ve kısıtsız s¸ekilde yazılmıs¸ Türkçe kelime tanıma sistemi gerçekles¸tirilmis¸tir. Saklı Markov Modelleri kullanılan sistem önis¸leme, öznitelik çıkarma, optik modelleme ve dil modelleme birimlerinden olus¸maktadır. Sistem, orta ölçekli bir da˘garcıkla tasarlanıp daha sonra büyük da˘garcıkla çalıs¸ır hale getirilmis¸tir.

Türkçe yazının Latin alfabesi kullanan di˘ger yazı sistemleri ile olan benzerlikleri, lit-eratürde kullanılan pek çok tekni˘gi Türkçe için de kullanılabilir kılar. Ancak Türkçe’ye has bazı özellikler tanıma is¸lemini güçles¸tirmektedir. Bunlardan ikisi gecikmis¸ vurus¸lar ve çok fazla sayıda olan da˘garcık dıs¸ı kelimelerdir. Bu tezde her iki problem de ayrıntılı s¸ekilde ele alınmıs¸ ve bazı çözümler üretilmis¸tir.

Gecikmis¸ vurus¸lar için net bir tanım olus¸turulmus¸ ve bu tanım kullanılarak bir dizi önis¸leme yöntemi arasından Türkçe’ye en uygunu bulunmus¸tur. ˙Ingilizce UNIPEN veri kümesi ve Türkçe verilerden olus¸an di˘ger bir küme üzerinde yapılan testlerde en iyi sonuç, bu vurus¸ların silinmesi yöntemi ile elde edilmis¸tir. Bu s¸ekilde yapılan önis¸leme ile ˙Ingilizce’de 1,000 kelimelik da˘garcık için %2.23 artıs¸la %86.1 tanıma bas¸arısı gözlenirken Türkçe’de %2.03 artıs¸la %91.7 tanıma oranı yakalanmıs¸tır.

Tanıma sisteminin çözümleme as¸amasında kelime-altı birimler kullanılarak da˘garcık dıs¸ı kelimelerin tanıma bas¸arısına olan olumsuz etkisinin giderilmesi sa˘glanmıs¸tır. Ayrıca, N-gram istatistiksel dil modelleri de kullanılmıs¸tır. Genis¸ da˘garcıklı tanıma için gövde-ekler s¸eklinde kelime-altı birimlerin kullanılması ile elde edilen %67.9 tanıma bas¸arısı, kelimelerin kullanılması ile elde edilen bas¸arıdan (%63.8) daha fazla olarak ölçülmüs¸tür.

(8)

7 Turkish Online Handwriting Recognition System 55 7.1 Optical Modeling . . . 55 7.1.1 System Architecture . . . 55 7.1.2 Preprocessing . . . 56 7.1.3 Features . . . 56 7.1.4 Training . . . 57 7.2 Language Modeling . . . 57 7.2.1 Decoding . . . 60 7.3 Post-processing . . . 62 8 Results 63 8.1 Evaluation of Delayed Stroke Handling Methods . . . 63

8.1.1 Methodology . . . 63

8.1.2 Results on UNIPEN . . . 64

8.1.3 Results on ElementaryTurkish Dataset . . . 65

8.1.4 Discussion . . . 65

8.1.5 State-of-the-art . . . 67

8.2 Evaluation of Vocabulary Design Alternatives . . . 68

8.2.1 Methodology . . . 68

8.2.2 Results . . . 69

(10)

9 Conclusion 74 9.1 Summary . . . 74 9.2 Future Perspectives . . . 75

A Out-of-Vocabulary Words 77

(11)

List of Figures

1.1 The recognition process as a pipeline . . . 4

2.1 A HMM with three states and 4-symbol observation alphabet. Observa-tion probability notaObserva-tion is simplified to bij as the probability of emitting jthsymbol at ith state . . . 12

2.2 Word HMM as concatenation of character HMMs . . . 17

2.3 HMM topologies : 1) Linear 2) Bakis-type 3) Ergodic . . . 18

3.1 A simple lattice (network) with no language model scores . . . 26

3.2 The simple lattice expanded with a bi-gram model . . . 27

3.3 The simple lattice expanded with a tri-gram model . . . 28

4.1 Delayed strokes in a sample of word ’Quantitative’. Red colored strokes are delayed. Pen-up positions are marked with green points. Dashed blue lines are the trace of the pen. . . 31

5.1 Samples of characters with potential delayed strokes: (a) ’i’ with dot, (b) ’t’ with cross, (c) ’ç’ and ’s¸’ with cedilla, (d) ’ü’ and ’ö’ with umlaut and (e) ’˘g’ with breve. . . 34

5.2 Detected delayed strokes (shown in red) are embedded as the arrows in-dicate. . . 39

5.3 Hat-feature value is 1 for points lying below the delayed strokes (in the pink rectangular area) and 0 for the rest. . . 40

5.4 The first derivation for the input x + x × x . . . 46

5.5 The second derivation for the input x + x × x . . . 46

6.1 Sample handwritten words from the UNIPEN dataset. Strokes that are written separately from character body are shown in red. . . 50

6.2 Sample handwritten words from the ElementaryTurkish dataset. Strokes that are written separately from character body are shown in red. . . 52

7.1 A simple grammar, and its corresponding word network and SLF repre-sentation . . . 58

7.2 A simple stem-morpheme grammar and its network . . . 59

7.3 A HTK network for a generic stem-ending grammar . . . 59

(12)

8.1 Examples for sources of error with resulting incorrect embeddings: (a) and (b) unaligned delayed strokes; (c) unusual writing style; (d) incorrect type detection. . . 66 8.2 Distribution of misrecognized parts with the total number of errors for

(13)

List of Tables

1.1 Previous results on word recognition accuracies obtained on public databases. 8

5.1 Accuracies of definitions. . . 38

5.2 Grammatical units obtined from BOUN corpus . . . 43

5.3 Coverage rates of word-based vocabularies . . . 47

5.4 Coverage rates for stem-ending-based vocabularies . . . 47

6.1 UNIPEN word contribution per writer distribution . . . 50

6.2 UNIPEN data set split . . . 51

6.3 BOUN Corpus details . . . 53

7.1 HTK decoding network sizes of alternative vocabulary types in terms of number of nodes and links . . . 61

7.2 Examples of modifications on the raw recognition results in the post-processing stage . . . 62

8.1 Results for the 3,500-word task on UNIPEN. . . 64

8.2 Results for the 1,000-word task on UNIPEN. . . 65

8.3 Results on the ElementaryTurkish dataset. . . 65

8.4 Alternative vocabulary designs using the BOUN Web Corpus . . . 68

8.5 Language model perplexities according to vocabulary unit type . . . 69

8.6 Word-based recognition results . . . 69

8.7 Stem-endings-based recognition results . . . 69

8.8 Stem-morphemes-based recognition results . . . 70

8.9 Examples from recognition errors with large lexicon stem-ending vocab-ulary . . . 72

8.10 Examples from recognition errors with large lexicon stem-morpheme vo-cabulary . . . 72

(14)

List of Algorithms

1 Proposed definition for detecting delayed strokes (see above for definitions) 37 2 DetectAll: Definition for detecting all (delayed or not) dots and crosses . . 38

(15)

Chapter 1 Introduction to Online Handwriting

Recognition

Online handwriting recognition (OHWR) is the task of interpreting handwritten input, at character, word, or line level. The handwriting is captured by a digitizer equipment, such as a tablet or pen-enabled smart phones, and represented in the form of a time series of pen tip coordinates that are captured. Other types of data such as pen pressure, pen up/down status, velocity, azimuth and altitude are collected by many modern input devices as well. The main component in a OHWR system is a recognizer which receives the input signal, runs an algorithm according to the technique it is designed with and outputs dig-ital text as the result with a probability. A language model can be employed in order to improve the results of the recognition module. Typically, a lexicon containing the vo-cabulary for a given task is used to restrict the word hypothesis to be one of the allowed alternatives.

The size of the recognition vocabulary has a direct effect on recognition performance. Recognition speed and performance drops as the number of alternative words get larger. On the other hand, a recognizer can only recognize words contained in the vocabulary, so, not included words translates into errors. Vocabularies of size over 5,000 are usually classified as ‘large’ while a typical modern large vocabulary recognition task has a lexicon of 20,000 or more words.

By the advancement of technology, OHWR is found more and more in everyday life. However, despite on-going research for several decades and significant improvements brought on with the use of deep learning techniques recently, recognition systems are far from being perfect in case of unconstrained handwriting recognition with a large vocab-ulary for general purpose. For this reason, systems sometimes constrain the writing style to be hand-print only (e.g. as in forms) in order to simplify the task of the recognizer. In addition to writing style limitations, some applications may require input written in a limited area like a box or over a baseline.

(16)

1.1 Issues in Online Handwriting Recognition

Handwriting recognition requires accurate and robust models to accommodate the vari-ability in time and feature space. The main sources of variations within handwriting styles are natural inter-writer (how different people write different characters and words) and intra-writer variances (how one’s handwriting vary from time to time), as well as technical specifications (e.g. sampling rate) of digitizers that record the input signal. Natural Variations

Handwriting is a complex activity that requires fine motor skills. Each individual develops a style of her own and even then his/her writings show variations from one sample to the other. The shapes of the characters, slant and skew of the writing, conformity to a base line, writing speed and pen pressure and order of strokes are all sources of variations affecting the style.

Writer-dependent systems are trained and tested with writings of the same writer(s). If a system is designed to be writer-dependent, its parameters are tuned to recognize hand-writings from a single writer or a few writers hence, the amount of variations drops dramatically. A writer-independent system is capable of recognizing handwriting from users whose writing are not seen by the system during training. Much larger training data, effective normalization before recognition and more complex recognizer architecture are needed for learning invariant and generalized characteristics of handwriting in writer-independent systems.

Writing styles can be grouped as hand-printed, cursive and a mixture of these two. Recognition of fully cursive style is the most difficult because of missing pen-up points which marks the boundaries of characters in hand-printed style.

In addition to the variations general to all writing systems, some scripts may be more challenging with number of characters, similarity between characters, allographs, liga-tures and writing order of the strokes in a character. A simple example of the latter one is differences in writing order of dots of ‘i’ letters and horizontal bars of ‘t’ letters in Latin-based alphabets, which is referred as ‘delayed strokes’. Turkish script has more of such characters in addition to ‘i’ and ‘t’ which introduces more variation in writing order. Vocabulary Size

Recognition vocabulary is another factor to determine performance of an OWHR. Most of the time, words that can be recognized are limited with a lexicon of arbitrary size. When the system assumes that the input will be one of the closed set of words (e.g. a list of words that contain all possible numbers on a bank check), it is called a closed-vocabulary system. An open-vocabulary system on the other hand is capable of recognizing words outside of the lexicon (e.g. character-based systems are example for open-vocabulary systems). Vocabulary size has direct impact on the system design. A closed-vocabulary task with a small vocabulary size can model these words directly in recognition phase. However, as

(17)

the vocabulary size increases, modeling of letters instead of words is preferred because of the computational complexity of modeling individual words.

Although open-vocabulary tasks are deemed to be harder than small-to-medium sized closed-vocabulary tasks, closed-vocabulary approach can be equally challenging with large vocabulary sizes. In this regard, Turkish language introduces additional difficulty due to its level of productivity in word formation.

Turkish is an agglutinative language where new words are formed by adding suffixes to the end of root words. There are grammatical rules governing which suffixes may follow which others, and in what order, but the number of possible words that may be generated by adding suffixes is practically infinite. As such, a finite-size vocabulary for Turkish would miss a significant percentage of Turkish words, causing a high Out-of-Vocabulary (OOV) rate. This makes vocabulary-based text recognition approaches un-suitable for Turkish, or other agglutinative languages. As a solution to OOV problem, sub-word-based vocabularies have been suggested recently.

Imperfections in Input Signal

Imperfections of a digitizer in capturing strokes of the writer may lead to missing data points which in turn causes to gaps within strokes. On the other hand, the digitizer can record extra points which makes the signal noisy. Both the missing data and noisy data problems should be handled before proceeding further in the recognition process. Inex-perienced writers may have difficulties in using the stylus and interface of the application. Some writers may even fail to properly record their writings once they create them. Mis-labeled data due to user fault constitute a serious problem for handwriting datasets. Evaluation Metrics

There are several evaluation metrics for OWHR. Word accuracy is the percentage of words recognized correctly as outputting the same label with the reference string. Word Error Rate (WER) is a popular metric which is initially used in speech recognition domain. WER provides a solution where comparison of sequences with different lengths is pos-sible. It is based on the Levenshtein distance [1]. It is calculated with a dynamic pro-gramming algorithm efficiently. WER is calculated as the minimum number of edits with substitution, insertion and deletion of words from the reference string to the output, nor-malized by the number of reference words:

W ER = #Substitutions + #Insertions + #Deletions

#Ref erences (1.1)

Similar to WER, a Character Error Rate (CER) is calculated by using characters in-stead of words for counting number of edits.

(18)

1.2 Online Handwriting Recognition Systems Overview

Handwriting recognition systems are typically designed as a large pipeline process where the output of a previous module is input to the next one.

In the pre-processing step, the raw data from the digitizer is cleaned from noisy and spurious data and missing data points are recovered. If the handwriting is in multiple lines, line segmentation is applied beforehand. Next, several normalization procedures are applied to reduce variations in writing order, size, slant, skew and writing speed. Fi-nally, relevant and discriminant features are extracted before the recognition phase. Most of the time, features are heuristic and handcrafted, designed for capturing the important characteristics of handwriting based on human knowledge. With the recent Deep Learn-ingtechnique, features can be learned from the training data itself, but it is a usual practice to use handcrafted features along with self-learned features in online handwriting domain [2].

Once the handwriting is represented as feature vectors, it can be used for training a recognizer. Data used for training (i.e the train set) is learned to tune parameters of the system according to the technique the recognizer is based on. Next, the recognizer is tested on hold-out data to evaluate its performance on unseen data. A language model can be employed in this phase to integrate language knowledge to obtain better recognition results. Sometimes a final step of post-processing can be used for further improvement of the output. The whole system can be seen in Figure 1.1

(19)

1.3 Literature Review

There have been many studies since early 1990s on online handwriting recognition prob-lem (see [3] for a survey). While much of this research is focused on recognition of Latin-based alphabets, especially English, handwriting recognition in other scripts has also been gaining attention in recent years [4, 5, 6]. Initially limited to the recognition of isolated characters and digits, state-of-the-art research now aims at recognizing uncon-strained word and sentence recognition.

Different techniques and approaches are used in recognition systems. One of the oldest and still popular techniques in handwriting recognition are Hidden Markov Models (HMM) [7, 8, 9, 10] and Artificial Neural Networks (ANN or NN) [11, 12, 13, 14]. More recently, deep learning techniques using recurrent neural networks have also been applied with very good results [15, 16, 17]. As each approach has its own strength, hybrid systems combining deep learning or other neural network approaches with hidden Markov models are proposed to combine the benefits of both approaches.

In the rest of this section, a comprehensive online handwriting recognition literature survey covering articles published after year 2000 is presented. Recognition systems are reviewed in subsections based on the recognition technique. Studies on Turkish handwrit-ing recognition are covered in a separate subsection.

1.3.1 Hidden Markov Models Based Systems

Hidden Markov Models (HMM) form the earliest and the most widely used approaches in online handwriting recognition problem. HMMs can be used for modeling strokes, characters or words and can be used by themselves or in conjunction with other methods in hybrid systems [16, 18].

In many systems the parameters of an HMM are typically optimized by the Maximum Likelihood (ML) approach. However there are some shortcomings of this approach, such as its sensitivity to the form of the model and the fact that it is linked to the error rate of the system indirectly. To alleviate these shortcomings, Biem proposes to use the Mini-mum Classification Error (MCE) criterion along with an allograph-based HMM [8]. The MCE-based training is a discriminative training method where the aim is to find a set of parameters that minimize the empirical recognition error rate. The recognition system is trained with 52K samples from more than 100 writers for a 5K-word lexicon without any pruning and 10K-word lexicon with pruning. Compared to a ML-based baseline, 9.75% and 6.19% relative word error rate reductions are obtained for the 5K lexicon and 10K-lexicons respectively.

Most of the online handwriting recognition systems obtain the input data via a digi-tizer like a Personal Digital Assistant (PDA) and Tablet PC which captures information about the pen tip position, velocity, or acceleration as the user writes on the input sur-face. To bring handwriting recognition to the classrooms, Liwicki and Bunke use online handwritten text acquired by capturing the trajectory of a pen writing on a whiteboard [9]. Although the data is in on-line format, they transform it to offline form in order to remove writing order variations. An HMM offline handwriting recognizer which is designed for

(20)

the offline IAM database is then used for recognition, achieving a word recognition rate of 64.3% on a small dataset with 1000 words. A bigram language model derived from the LOB corpus is employed in the recognizer. In more recent work, the authors achieve bet-ter results by increasing the training set size [10]. Using again a bigram language model, the recognition rate is raised to 68.6% on the IAM-On database.

1.3.2 HMM-NN Hybrid Recognizers

HMM systems are capable of modeling dynamic time sequences of variable lengths, which makes them appropriate for the handwriting recognition task. However, they can-not make use of the context such as inter-symbol dependencies (e.g. how the letter ‘e’ appears after the letter ‘t’). In contrast, artificial neural networks in their most common form (multi-layer perceptrons) are able to capture the contextual information while lack-ing the ability to handle time varylack-ing sequences and their statistical variations. Hybrid approaches combine the strengths of both methods: in HMM-NN-based systems, NNs are used for frame/character classification, while HMMs are used for modeling the whole sequence.

Garcia-Salicetti et al. takes an original approach to HMM-NN combination and pro-pose a hybrid system of handwritten word recognition based on HMMs with integrated NNs that are related to letter HMMs [14]. HMMs and NNs are simultaneously initial-ized and trained. By embedding NNs into the letter HMMs, their system remains within the HMM framework while extending the HMM with contextual information. They use Maximum Mutual Information (MMI) criterion for training the NNs. The system works in a writer-dependent framework with vocabulary sizes of 1,000, 10,600, 20,200. Word recognition rates for the dictionaries are reported as 96.93%, 91.13% and 88.67% respec-tively.

In the HMM-based system proposed by Schenk and Rigoll, a NN is used as part of feature extraction stage of an online handwriting recognition system [13]. After the feature extraction, a standard HMM is applied. On a 1,500-word database, they achieve 95.9% recognition rate using multi-layer perceptron networks and 95.2% recognition rate using a recurrent neural network architecture. In both cases a lexicon of 2,000 words is used for final word recognition.

In [12], Marukatat et al. use NNs to predict the emission probabilities in a hybrid system. They train the system with 30K words by 256 writers from UNIPEN dataset. With a 2,000-word lexicon, they report 80.1% and 77.9% word recognition rates for multi-writer and multi-writer-independent (omni-multi-writer) recognition tasks respectively.

Another work where this hybrid approach is used is [19]. Gauthier et al. propose classifier combination with an online HMM-NN and an offline HMM system. The offline recognizer uses the offline version of the online text. Using 40K words written by 256 writers from UNIPEN dataset for training, the combined classifier is tested on parts of a 1K sample set written by the same training set writers. They report a 87% recognition rate with a 1,500-word lexicon, which decreases to 79% when the lexicon size is increased to 10,000 words.

(21)

1.3.3 HMM- MSTDNN Hybrid Recognizers

Another type of hybrid systems used for online handwriting recognition task is a com-bination of a multi-state time delay NN (MSTDNN) with an HMM. Time delay NNs (TDNN) are time invariant NNs which can recognize a pattern regardless of its position in time. MSTDNN is an extension of TDNN with the dynamic time warping algorithm, aimed to integrate recognition and segmentation into a single architecture.

Jaeger et al. use MSTDNNs in conjunction with HMMs [20]. A tree representation of the dictionary employs HMMs that represent individual characters. Using an effi-cient search algorithm over the tree representation of the dictionary, they achieve a 96% word recognition rate with a 5,000-word dictionary. The recognition rates for 20,000 and a 50,000-word dictionaries are 93.4% and 91.2% respectively, using a combination of handwriting databases of Carnegie Mellon University, University of Karlsruhe and MIT.

Caillaut and Gaudin use the MSTDNN for computation of character likelihoods and observation probabilities in the HMM recognizer [18]. Searching for the best training scheme for the hybrid system, they find a new generic criterion combining Maximum Likelihood (ML) and Maximum Mutual Information (MMI) criteria with a global opti-mization approach defined at the word level. They achieve a 92.78% word recognition rate on IRONOFF word database.

1.3.4 BLSTM-based recognizers

Bidirectional Long Short Term Memory (BLSTM) architecture is another approach used for online handwriting recognition. BLSTMs consist of multiple recurrent neural net-works (RNNs) for capturing long-term dependencies in both past and future context dur-ing recognition.

The earliest work proposing an RNN with BLSTM architecture for unconstrained handwritten word recognition is Liwicki et al. [15]. They use the Connectionist Temporal Classification (CTC) objective function in the recognizer, in order to correct errors made at character level and recognize the word. They achieve 74% word accuracy on the IAM-OnDB which contains forms of handwritten English text acquired on a whiteboard. The dictionary size is 20,000 consisting of the most frequent words from LOB, Brown and Wellington corpora.

In another study, Liwicki and Bunke apply feature selection to improve a BLSTM rec-ognizer’s performance [22]. The best combination of features are searched via sequential forward and sequential backward searches over a set of 25 features. They report 75.2% word recognition rate with a BLSTM recognizer on IAM-OnDB dataset using 17 selected features. Comparatively, an HMM classifier achieves 73.8% recognition rate with 16 se-lected features.

Graves et al. improve performance of a similar BLSTM system by integrating a bi-gram language model at connectionist temporal classification layer to achieve a recogni-tion rate of 79.7% for a 20,000-word open lexicon [17]. A better result of 85.3% accuracy is achieved with a limited closed dictionary of size 5,597, without a language model. The BLSTM recognizer is compared to a state-of-the-art HMM recognizer and shown to

(22)

out-Table 1.1: Previous results on word recognition accuracies obtained on public databases.

Author Method DB Test DB Size Lexicon Size Writer depen-dency Acc. (%) Jaeger et al. [20] HMM-MSTDNN

CMU-UKA-MIT

4,105 20,000 I 93.4 50,000 I 91.2 Caillaut et al. [18] HMM-MSTDNN IRONOFF 10,448 - I 92.7 Gauthier et al. [19] HMM-NN UNIPEN 1,000 1,500 I 87.0 This work HMM 7,000 1,000 I 86.1 This work HMM 7,000 3,500 I 83.0 Marukatat et al. [12] HMM-NN ? 5,000 I 77.0 Marukatat et al. [12] HMM-NN ? 5,000 D 75.0 Liwicki et al. [21] HMM IAM-ON 6,204 2,337 I 70.8 Liwicki et al. [15] BLSTM ? 20,000 I 74.0 Liwicki et al. [10] HMM (offline

fea-tures)

1,240 11,050 I 68.6 Liwicki et al. [22] BLSTM ? 20,000 I 75.2 Graves et al. [16] HMM and BLSTM 3,859

lines

20,000 I 79.6 Graves et al. [17] BLSTM 3,859

lines

20,000 I 79.7 Salicetti et al. [14] HMM-NN Proprietary 1,500 1,000 D 96.9 10,600 D 91.1 20,200 D 88.6 Vural et al. [23] HMM (Turkish) Proprietary 200 1,000 I 91.4 This work HMM (Turkish) Proprietary 804 1,956 I 91.7

perform it under all test conditions. The relative error reduction is reported to be as much as 40% in some cases.

A summary of some of the most recent results reported on public databases or those that are done on Turkish is given in Table 1.1. Unfortunately, most of these results dis-cussed in this section are not directly comparable, since different researchers have used different datasets or different subsets of a dataset or different settings (multi or omni-writer problem). Nonetheless, it is intended to give an idea about the progress in the field.

1.3.5 Handwriting Recognition for Turkish

There is little research about recognition of handwritten Turkish text. A number of studies cover offline Turkish character recognition with some constraints like the style or the case [24, 25, 26].

(23)

In offline handwritten Turkish text recognition, Yanikoglu and Kholmatov use the HMM letter models previously developed for English, by mapping the Turkish characters to the closest English character (the input of the word g¨unes¸ is recognized as gunes). They report 56% top-10 word recognition rate using an 17,000-word lexicon obtained from a newspaper corpus. Resembling the work in [27], [28] use character-based word recog-nition method for offline lowercase mixed-style handwritten Turkish words and achieve 84% recognition rate by using a dictionary of size 2,500.

In online handwriting recognition, Vural et al. obtain 94% word recognition rate using a lexicon of 1,000-word lexicon and report that about 35% of the errors are due to delayed strokes [23].

1.4 Thesis Overview

With the ever increasing use of computers and digital appliances like tablet PCs or smart-phones, input modalities evolve towards the more natural interactions like touch, speech, sketching and handwriting. Handwriting recognition in general and online handwriting recognition in particular is an active research area which have a direct impact on everyday technology. Products employing handwriting recognition functionality are widely in use. Educational applications which have been employing handwriting recognition ability for some time, are likely to benefit from any improvement in that research area. Nevertheless, there is a lack of research on recognition of online handwritten text in Turkish and this work primarily aims to fill that gap with a state-of-the-art recognizer for the first time.

This thesis focuses on building an online handwriting recognizer with Hidden Markov Models for recognition of Turkish isolated words. It starts with development of a recog-nition engine with comparatively smaller vocabulary which is intended to be a part of an educational software for tablet PCs as described in [29]. Then, the system evolves to a large vocabulary recognition system by integration of state-of-the-art techniques of lan-guage modeling. The thesis provides solutions to the problems particular to recognition of Turkish language and script.

Chapter 2 presents theoretical background for Hidden Markov Models and their use for handwriting recognition. In Chapter 3, details of language modeling for handwriting recognition is explained. Two main problems in recognition of Turkish language script, delayed strokes and high rate of OOV are discussed in Chapter 4 and some solutions are proposed in Chapter 5. Datasets, text corpora and software resources are presented in Chapter 6. Chapter 7 describes the proposed recognition system as a baseline while Chapter 8 includes experimental results regarding each solution for attempted problems and the overall system performance. And finally Chapter 9 draws conclusions and sug-gests directions for future work.

(24)

Chapter 2 Hidden Markov Models for Online

Handwriting Recognition

Hidden Markov Models (HMMs), which are initially applied to the automatic speech recognition (ASR) problem [30], have been a popular method for handwriting recogni-tion as well. Capability of HMMs for time alignment and for the maximum likelihood formulation of the parameter estimation make them a suitable technique for modeling sequences in recognition problems in domains like speech, computational biology and online handwriting recognition. Starting from early 1990s ([31, 32]), HMMs became a preferable approach for script recognition; historically first for offline modality and then for online form ([33, 3]). In this section, first, a brief introduction will be made to HMMs theory. Next, use of HMMs for online handwriting recognition problem will be explained.

2.1 Hidden Markov Models

HMMs can be classified as discrete and continuous based on their observation densities. If the observations come from a categorical distribution, HMMs are said to be discrete. With continuous HMMs, observations are generated from a Gaussian distribution. For the sake of simplicity, the theory will be explained mostly with discrete HMMs. Later, application of the technique to online handwriting will cover continuous type of HMMs.

2.1.1 Definition

A Hidden Markov Model (HMM) is a statistical model in which a system is modeled with a set of unobservable/hidden states and probabilistic transitions between them. In HMMs, the sequence of states that the system goes through are unknown but states emit symbols from a predefined alphabet with a probability model which are observable themselves.

Discrete HMM formalism is a kind of discrete state Markov process with additional property of probabilistic outputs. A discrete state Markov process can be in one of n dis-tinct discrete states, {S1, S2, ..., SN} at any given time. It satisfies the first order Markov

(25)

condition such that its future behavior depends only on its present state which brings out the state-independence assumption.

According to the first order Markov assumption, the probability of being in a state Q at time n only depends on the observation at time n − 1:

P (Qn = Si|Qn−1 = Sj, Qn−2= Sa, ..., Q0 = Sb) = P (Qn = Si|Qn−1 = Sj), ∀i, j, a, b, n.

(2.1) The process can be in any one of its discrete states initially at time t0. Initial state

probabilities of a discrete state Markov process with N states are shown as Π = {πi}

where

πi = P (Q0 = Si), 1 ≤ i ≤ N. (2.2)

The process can make a transition from statei to statej with a probability aij at a

discrete time. State transition probabilities are shown as A = {aij} where

P (Qn = Sj|Qn−1 = Si), 1 ≤ i ≤ N and

X

j

aij = 1, ∀i. (2.3)

Π and A together define a discrete state Markov process. An extension with proba-bilities of observation symbol emission at visited states is required for definition of an HMM.

In HMM formalism, each state can emit an observation with a probability. Since observations are probabilistic, an observation sequence does not have a corresponding deterministic state sequence. In general, there are many possible state sequences which generate an observation sequence which makes the actual state sequence “hidden”.

If a model has N distinct states , M distinct observation symbols in each state and On

is the observation at time n and vkis an event for which the observation symbol is k then

the state observation probabilities B = {bivk} are

bivk = P (On = vk|Qn = Si), 1 ≤ i ≤ N and 1 ≤ k ≤ M. (2.4)

It is assumed that the output observation at a given time is dependent only on the current state, it is independent of previous observations and states which can be stated as

P (On= vk|On−1= va, On−2 = vb, . . . , O0 = vc, Qn = Sk) = P (On = vk|Qn = Sk),

∀k, a, b, c, n. (2.5) With the extension of state observation probabilities over discrete state Markov pro-cess, a HMM λ is formally defined by the following elements:

(26)

• A set of states Q = S1, ..., SN.

• A transition model, defined by the probabilities P (Qt = Si|Qt−1 = Sj), for all Si,

Sj ∈ Q where P (Qt = Si) denotes the probability of the process being in state Si

at time t.

• A probability distribution over initial states P (Q1 = S), ∀S ∈ Q.

• An emission model, defined by the probabilities P (X|S), where S ∈ Q, and X is an observation from the alphabet. In case of discrete HMMs alphabet is a finite set and it is set of real numbers RD with D dimension of observations in continuous HMMs.

Figure 2.1: A HMM with three states and 4-symbol observation alphabet. Observation probability notation is simplified to bijas the probability of emitting jth symbol at ith state

2.1.2 Three Basic Problems of HMMs

HMMs are characterized by three fundamental problems:

1. The Evaluation Problem: Given an HMM, λ = (A, B, Π) and an observation se-quence O = {O1O2. . . OT}, what is P (O|λ), the probability of the observation

sequence O ?

2. The Decoding Problem: Given an observation sequence O = {O1O2. . . OT}, and

an HMM λ = (A, B, Π) what is the optimal state sequence Q = {Q1Q2. . . QT}

(27)

3. The Training Problem: Given an observation sequence O = {O1O2. . . OT}, what

is the optimal model λ which maximizes P (O|λ) ?

The Forward Algorithm; A Solution to The Evaluation Problem

In HMM systems, each hidden state produces only a single observation but the actual state sequence Q = {Q1Q2. . . QT} is unknown for a given observation sequence O =

{O1O2. . . OT}. In order to compute probability of a particular observation, P (O|λ),

given the model parameter λ, probability of observation of O is summed up over all possible Q: P (O|λ) =X allQ P (O, Q|λ) =X allQ P (O|Q, λ)P (Q|λ). (2.6)

P (Q|λ) can be calculated from probability of being in each state in Q by using the state independence assumption 2.1 :

P (Q|λ) = πQ1aQ2Q1aQ3Q2. . . aQtQt−1. (2.7)

Similarly, by the output independence assumption 2.5, P (O|Q, λ) is calculated as:

P (O|Q, λ) = bQ1O1bQ2O2. . . bQTOT. (2.8)

Rewriting 2.6 using 2.7 and 2.8, the probability of an observation sequence given a model is obtained by summing over all state sequences:

P (O|λ) =X

allQ

(bQ1O1bQ2O2. . . bQTOT) · πQ1aQ2Q1aQ3Q2. . . aQtQt−1. (2.9)

It is infeasible to sum over all possible state sequences, especially when the number of possible states N, or the length of the observation sequence T increases. Instead of the naive method which has a complexity O(2T NT), a dynamic programming approach can be employed with much efficiency. The forward algorithm provides an efficient solution by a recursive computation of so-called forward variables with a complexity of O(N2T ). A forward variable, αt(j), keeps the probability of being in state j after seeing the first

tobservations, at time t, given the model λ.

αt(j) = P (O1O2. . . Ot, Qt = Sj|λ). (2.10)

With the dynamic programming approach, each forward variable is computed from the forward variable value of its previous time step. Following is the recursive definition for forward variables:

(28)

αt(j) =

(

πjbjo1 t = 1, 1 ≤ j ≤ N

PN

i=1αt−1(i)aijbj(ot); 1 ≤ j ≤ N, 2 ≤ t ≤ T

(2.11) After calculation of forward variables for all states at time T, probability of a sequence of length T is computed by summation of αT(j) values:

P (O|λ) =

N

X

j=1

αT(j). (2.12)

The Viterbi Algortihm: A Solution to the Decoding Problem

Given an observation sequence O = {O1O2. . . OT} and a model λ, the most probable

sequence of states that have generated O , Q = {Q1Q2. . . QT}, can be computed naively

by running the forward algorithm for all possible hidden state sequences. However, it is again infeasible for many real tasks due to exponentially large number of state sequences. The Viterbi algorithm, which is based on dynamic programming paradigm provides an efficient solution to the decoding problem.

Much like the forward variables of forward algorithm, the Viterbi algorithm keeps calculations at each time step as intermediate values and builds up on them to reach a final calculation. An intermediate value vt(j) represents the maximum probability of

being in state j after seeing the first t observations by passing through the most probable state sequence Q1Q2. . . Qt−1at time t, given the model λ.

vt(j) = max Q1Q2...Qt−1

P (Q1, Q2, . . . , Qt−1, O1, O2, . . . , Ot−1, Qt = j|λ). (2.13)

The recursive definition of vt(j) values make use of the calculations from the previous

time steps: vt(j) = (πjbjo1 t = 1, 1 ≤ j ≤ N max 1≤j≤Nvt−1(i)aijbj(ot), 1 ≤ j ≤ N, 2 ≤ t ≤ T (2.14) At the end of process, max

1≤j≤NvT value gives the maximum probability P* of seeing

the given observation sequence by passing through the most optimal state sequence Q*. Keeping track of hidden states that led to each state by means of back-pointers allows to recover the optimal state sequence with backtracking from the end state to the beginning. The Baum-Welch Algorithm; A Solution to Training Problem

Given an observation sequence, O = O1O2. . . Ot, computation of initial, transition

and observation probabilities that maximize the probability of the observation sequence, P (O|λ) is the training problem of HMMs.

(29)

The Baum-Welch algorithm [34, 35], which is also known as the forward-backward algorithm is the standard method of training HMMs. The algorithm itself is a version of Expectation-Maximization (EM) algorithm where starting off with the initial estimates for the transition and observation probabilities, it iteratively computes expected state oc-cupancy count and expected state transition counts and then re-estimates the probabilities from the calculated values.

Maximum likelihood estimation of the probability aij of a particular transition

be-tween states i and j can be calculated by counting the number of times the transition takes place divided by the total number of any transitions from state i:

aij =

number of transitions from Qito Qj

P

Q∈Qnumber of transitions from Qi to Q

(2.15) These counts cannot be computed directly in an HMM because the states are hidden and the sequence of states that generate a given input is unknown. The forward variable and a similar backward variable are used to compute estimations for these counts in The Baum-Welch algorithm.

A backward variable, βt(i), represents the probability of seeing the partial observation

sequence Ot+1Ot+2. . . OT at time t and state i in a given HMM λ:

βt(i) = P (Ot+1Ot+2. . . OT|Qt = Si, λ). (2.16)

Computation of backward variables is done recursively:

βt(i) = ( 1 t = T, 1 ≤ i ≤ N PN j=1βt+1(j)aijbj(ot+1); 1 ≤ i ≤ N, 2 ≤ t ≤ T (2.17)

Probability of a given observation sequence P (O|λ) can be defined in terms of both forward (2.11) and backward variables as :

P (O|λ) = N X j=1 αT(j) = N X i=1 β1(i)πi. (2.18)

An estimation for aij can be expressed as :

ˆ aij =

expected number of transitions from Qito Qj

P

Q∈Qexpected number of transitions from Qi to Q

(2.19) Expected number of transitions from Qi to Qj can be expressed as the sum of

(30)

observation sequence. ξt(i, j) is defined as the joint probability of being in state i at time

tand in state j at time t + 1, given an observation sequence O and a model λ:

ξt(i, j) = P (Qt= Si, Qt+1 = Sj|O, λ). (2.20)

ξt(i, j) =

P (Qt= Si, Qt+1 = Sj, O|λ)

P (O|λ) (2.21)

The numerator term of 2.21 can be expressed with forward and backward variables and transition probability of corresponding states:

P (Qt= Si, Qt+1 = Sj, O|λ) = αt(i)aijbj(ot+1)βt+1(j). (2.22)

Finally, ξt(i, j) is defined as:

ξt(i, j) = αt(i)aijbj(ot+1)βt+1(j) PN i=1 PN j=1αt(i)aijbj(ot+1)βt+1(j) . (2.23)

Using definition in 2.23, estimated transition probability ˆaij in 2.19 can be expressed

as: ˆ aij = PT −1 t=1 ξt(i, j) PT −1 t=1 PN j=1ξt(i, j) . (2.24)

Estimation of observation probability of symbol vkat state j, ˆbj(vk) can be explained

as: ˆ_b

j(vk) =

expected number of times in state j and observing symbol vk

expected number of times in state j (2.25) Probability of being in state j at time t, γt(j) with a given observation sequence O and

a model λ is defined as:

γt(j) = P (Qt= Sj|O, λ) =

P (Qt= Sj, O|λ)

P (O|λ) =

αt(j)βt(j)

P (O|λ) . (2.26) The expected number of times the model is in state j and symbol vk is observed is

calculated by summing γt(j) up for all time steps t in which the observed symbol is vk.

Following is the rewriting of equation 2.25 with γt(j) notation:

ˆ_b j(vk) = PT ∩Ot=vk t=1∩Ot=vkξt(j) PT t=1 . (2.27)

The forward-backward algorithm calculates iteratively first ξ and γ from current tran-sition and observation probabilities, A and B, of the model λ and then use the resulting values to make estimations ˆa and ˆb to recompute A and B.

(31)

2.1.3 HMMs for Online Handwriting Recognition

Online handwriting is represented with states in a HMM system. According to the cho-sen recognition unit, a set of states and transitions between them can reprecho-sent words, characters or sub-characters. Characters are the natural unit for many scripts.

When the recognition unit is chosen as the characters, each training word or sentence is represented by concatenating a sequence of character models together to form a one large model. In Figure 2.2, words {ONE, TWO, THREE . . . TEN} are represented with concatenation of character HMMs and each character is modeled by three states.

Tasks of handwriting recognition process can be expressed as special cases of the three problems of HMMs. For example, given a set of character models λ and an ob-servation sequence O, character recognition is evaluating each of the model λ for O to find the model that have the maximum probability, P (O|λ). During the training phase, for each training sample, a composite model is effectively synthesized by concatenating the phoneme models given by the transcription of that sample. The parameters of the character models are then re-estimated using the Baum-Welch algorithm.

On the other hand, word recognition with character-based HMMs is the decoding problem; given a set of character models λ, the optimal sequence of states Q* correspond-ing to the observation sequence is found uscorrespond-ing the Viterbi algorithm. Uscorrespond-ing a lexicon of possible words in the form of a word network, the most probable match is searched for a given test sample. Again, the words in the network are represented with concatenated character models as shown in Figure 2.2. Sentence recognition is similar to word recog-nition with the addition of a special model for space character.

(32)

HMM Configuration

As a graphical model, the design of topology is an important part of configuration in HMMs. The term topology defines valid transitions between states within the context of HMMs. Figure 2.3 shows some examples of possible topology designs. Ergodic or fully-connected topology where transitions between any two states is allowed is generally not suitable for handwriting recognition since handwriting data is chronologically organized. The models where transitions are limited to potentially relevant ones and irrelevant ones are suppressed by hard-coded transition probability of 0 are better in describing hand-writing. Linear and Bakis type are the most common HMM topologies in handwriting recognition. A transition can represent 1) a progress in time from Qi to Qj where j > i,

2) a self transition to the same state to match variable duration of a segment, 3) a skip one or more (usually two) states forward to match optional or missing parts.

Figure 2.3: HMM topologies : 1) Linear 2) Bakis-type 3) Ergodic

The number of states per model can be the same for all HMMs or can be designed according to the complexity of models. For example, in case of character models, number of states can be adjusted to acknowledge the different length of characters. Word models are formed by concatenating character models end-to-end.

For most of the pattern recognition applications, output are real-values coming from distributions over < which do not have parametric descriptions. Mixture of densities are used for approximating output probabilities in such system. Gaussian mixture models, which are very common in HMMs, can be defined as:

p(x|s) =

M

X

m=1

csmN (x; µsm, Σsm)

where N (.; µ, Σ) is a multivariate Gaussian distribution with µ mean and Σ covariance matrix. The subscript notation ”sm” indicates m-th mixture component in GMM of state s. The number of Gaussian mixtures is a crucial part of the HMM configuration. It is usually decided using an iterative approach as incrementation by splitting during training of models ([36]).

(33)

Although it is very common to choose HMM configurations based on experiments and heuristics, there are some other methods to adapt number of states, topology and number of mixtures to model structure automatically [37, 38, 39, 36]. The effects of individual configuration components are also reported in some studies [40, 41]. While the number of states per model is an important part of HMM configuration, the topology is found to have a stronger influence in improving the modeling capability of HMMs.

(34)

Chapter 3 Language Modeling

Language modeling is an essential part of all modern recognition systems. It is used for improving the recognition results by imposing some constraints on the decoding proce-dure and the output of the system.

During the HMM decoding process, the most probable path is searched through a probabilistically scored time/state lattice (see 2.1.2). A list of words comprising the lex-icon is used for limiting the search of probable paths to valid words designated for that system. The size of the lexicon is typically thousands of words, for large vocabulary recognition systems. A task related to the language modeling is to generate an appropri-ate recognition lexicon for the particular task, which is covered in Chapter 4.

It is not sufficient to constrain the output, but the order of the elements in the out-put should be correct as well. For example, not every sequence of valid words make a grammatical sentence. It is necessary to have a means of selecting the correct ordering of lexicon items. Another main motivation of language modeling is to choose the most likely observation sequence by likelihood estimation.

There are different methods of modeling sequence probabilities in a language. Statis-tical language modeling is a well-founded and popular approach in speech and handwrit-ing recognition [42]. Statistical language models that are based on N-gram statistics have been the dominant approach in language modeling because of their simplicity and low computational complexity. Other methods include Maximum Entropy Language Model (ME LM) [43] and Neural Network based language models [44]. In this work, N-gram models are utilized for language modeling.

3.1 N-gram Language Modeling

An N-gram is an N-token sequence of lexicon elements such as letters, syllables and words. An N-gram model indicates the conditional probability of observing a word given the history of the previous N − 1 words [45]. The chain rule of probability provides a way for the calculation of members of a joint distribution of some random variables x1. . . xn,

(35)

P (x1. . . xn) = P (x1)P (x2|x1)P (x3|x1x2) . . . P (xn|x1. . . xn−1) = n Y k=1 P (xk|x1. . . xk−1). (3.1) Using the chain rule, a relationship can be established between the joint probability of a sequence of words and the conditional probability of a word given the previous ones. The probability of a sequence of words, (w1, w2, ..., wN) can be estimated by multiplying

the conditional probabilities of the words on their history of previous words. as :

P (w1. . . wn) = P (w1)P (w2|w1)P (w3|w1w2) . . . P (wn|w1. . . wn−1) = n Y k=1 P (wk|w1. . . wk−1). (3.2)

Computation of exact probability of a word, given a long sequence of preceding words P (wn|w1. . . wn−1), is neither feasible nor always possible for larger values of N. Instead

of computing the probability of a word given its entire history, an approximation is possi-ble by limiting the history to a few words. For example, a bi-gram model (2-gram) uses a history of one word while a tri-gram model (3-gram) covers the previous two words in its history. In general, a N-gram model approximates the conditional probability of the next word by using previous N − 1 words. As the value of N increases, N-gram models are more successful in modeling the training data since they use longer context, at the cost of increased complexity.

With bi-gram models, the probability of a word given all the words previous to it P (wn|w1. . . wn−1), is approximated by using only the the conditional probability of

P (wn|wn−1) words:

P (wn|w1. . . wn−1) ≈ P (wn|wn−1). (3.3)

The probability of a complete word sequence is:

P (w1. . . wn−1) ≈ n

Y

k=1

P (wk|wk−1). (3.4)

The general equation of N-gram approximation for the conditional probability of the next word of a given sequence is:

(36)

3.1.1 Estimation of N-gram model parameters

Parameters of an N-gram model can be estimated with the Maximum Likelihood Estima-tion (MLE). Using bi-gram modeling, the probability estimaEstima-tion of a given a sequence of words (wn−1wn) is calculated as the count of the sequence normalized by sum of counts

for all word sequences starting with wn−1:

P (wn|wn−1) =

C(wn−1wn)

P C(wn−1w)

. (3.6)

which can be further simplified to :

P (wn|wn−1) =

C(wn−1wn)

C(wn−1)

. (3.7)

In general, based on normalization of observation frequencies, i.e. relative frequen-cies, MLE parameter for the N-gram probabilities can be written as:

P (wn|wn−N +1. . . wn−1) =

C(wn−N +1. . . wn−1wn)

C(wn−N +1. . . wn−1)

. (3.8)

Parameters that are estimated by the MLE, as its name suggests, maximizes the likelihood of the training set T, given a model M, P (T |M ).

3.1.2 Interpolation and Backoff

Interpolation and backoff are two mechanisms to estimate the higher-order N-grams that suffer from data sparsity, from lower-order probabilities. The backoff model uses the lower-order counts only if a higher-order count is zero. Interpolation, on the other hand, estimates the probability from all of the N-grams, such that counts of all N-gram orders are mixed to do a weighted interpolation. The weights are learned from a held-out dataset.

3.1.3 Smoothing

Since the maximum likelihood estimation of the model parameters is based on assigning probabilities from counts in a limited training set, N-gram sequences that are missing from the training set are set to zero probability. However, any valid sequence should have a non-zero probability.

The smoothing process modifies the MLE approach to assign non-zero probabilities to any N-gram, even if it is not observed in the training set. There are a number of smoothing algorithms like Laplace Smoothing, Good-Tuning Discounting, Witten-Bell Discounting and Kneser-Ney Smoothing. In this work, the Kneser-Ney method will be used for bi-gram and tri-bi-gram models.

(37)

Kneser-Ney Smoothing

The Kneser-Ney method [46] is a backoff model based on absolute discounting. Absolute discounting is an interpolation method but it subtracts a fixed discount δ ∈ [0, 1] from non-zero counts, instead of multiplying with a weight.

With the Kneser-Ney method, instead of the unigram MLE count, a heuristic that can more accurately estimate the number of times a word w is expected to be seen in a new, unseen context is used for the backoff distribution. That heuristic is the number of different contexts the wordw appears in. The more a word appears in different context, the more it is likely to appear in a new unseen context. The backoff probability of Kneser-Ney method, which is named as ”continuation probability” is given as:

PCON T(wi) =

|{wi−1: C(wi−1wi) > 0}|

P

w i|{wi−1: C(wi−1wi) > 0}|

(3.9) and the Kneser-Ney probability is:

PKN(wi|wi−1) =

(_C(w

i−1wi)−δ

C(wi−1) ), if C(wi−1wi) > 0

α(wi)PCON T, otherwise

with a suitable discount value δ and a coefficient α on the backoff value which makes the sum of all probabilities equal to 1.

3.2 Perplexity

Statistical language models are usually evaluated on the basis of an intrinsic measure named as perplexity. Perplexity is the metric of how well a probability model (i.e. lan-guage model) predicts probability of the test data. A more intuitive definition can be a measure of on average how many different equally most probable words can follow any given word. For a test set with N words w1, w2, ..., wN , the perplexity of the model on

the test set is defined as probability of the test set, normalized by the number of words, N : P P = (P (w1, w2, ..., wN))− 1 N = N s 1 P (w1, w2, ..., wN) (3.10) using the chain rule to expand W in Equation 3.10:

P P = (P (w1, w2, ..., wN))− 1 N = N v u u t N Y i=1 1 P (wi|w1, ..., wi−1) (3.11)

(38)

If the test set is the whole sequence of words, one after the other, then the sentence boundaries are also taken into consideration in probability computation through the be-ginning and ending markers i.e. < s > and < /s >. Ending marker < /s > is included in the total count of word tokens N as well.

As can be seen from Equation 3.11, minimizing the perplexity means maximizing the test set probability according to the language model. The model that assigns the highest probability to the test data predicts it more accurately. Based on this, lower perplexity generally indicates a better model. However, perplexity is not a definite way of determin-ing the usefulness of a language model. A model with lower perplexity can be a better one at predicting the next word when a set of previous words are given in a test set. But its performance may not be the same on the real life data if the test set is not a good sample of the real life data. In general, perplexity is a useful metric for comparing language models on the same test data. Also, two models can be compared by perplexity only if they use the same vocabulary.

Perplexity is derived from the cross-entropy concept of the Information Theory [47]. Entropy is a measure of information which can be adapted in language processing to measure to how well a given grammar matches a given language, how much information does a particular grammar contain and how predictive a given N-gram grammar is about the next word could be [45].

Entropy of a sequence of variables, as in the case of a word sequence in a grammar, can be calculated through a variable that will range over these sequences.

Per-word entropyor entropy rate is the entropy of the sequence divided by the number of words in the sequence:

1 nH(w1, . . . , wn) = − 1 n X (w1,...,wn)∈L p(w1, . . . , wn)logp(w1, . . . , wn) (3.12)

For calculation of the true entropy of a language, which can be accepted as a stochastic process, sequences of infinite length should be considered. Entropy rate of a language L is defined as: H(L) = − lim n→∞ 1 nH(w1, , wn) = − lim n→∞ 1 n X (w1,...,wn)∈L p(w1, . . . , wn)logp(w1, . . . , wn) (3.13)

If the language is ergodic and stationary1_{, the summation over all possible word}

se-quences can be discarded. In this case, a single, long-enough sequence of words can be used for estimation of the entropy and the entropy rate of a stochastic process.

(H)(L) = − lim

n→∞

1

nlogp(w1. . . wn) (3.14)

1_{A language is stationary if the probability distribution of the words do not change with time and a}

language is ergodic if its statistical properties can be deduced from a single, sufficiently long sequence of words

(39)

When the actual probability distribution p that generated some data is not known, then approximation to the entropy is calculated as the cross-entropy. With the cross-entropy, a model m which approximates p is used for calculation of sequence probabilities. Cross-entropy of m on p is defined as:

H(p, m) = − lim n→∞ 1 n X (w1,...,wn)∈L p(w1, . . . , wn)logm(w1, . . . , wn) (3.15)

According to Equation 3.15, sequences come from the probability distribution p but the log probabilities are calculated according to the model m. With the assumption that the process is stationary and ergodic as in Equation 3.14, the cross-entropy can be approx-imated by taking a single, sufficiently long sequence instead of summation of all possible sequences:

H(p, m) = − lim

n→∞

1

nlogm(w1, . . . , wn) (3.16) Since the probabilities are calculated according to the model m, the cross-entropy is an upper bound for the true entropy and H(p) ≤ H(p, m). Using that relation, two models can be compared over the cross-entropy values; the more accurate a model m, the lower the cross-entropy and the closer the cross-entropy to the entropy. Again, an approximation of cross-entropy of a model M = P (wi|w1. . . wN), is possible by using a

sufficiently long sequence, W , with a fixed length N , instead of an infinite one:

H(W ) = −1

NlogP (w1, . . . , wN) (3.17) The definition of perplexity (PP) of a model P on a sequence of words W is the exp of the cross-entropy: P P (W ) = 2H(W ) = P (w1, . . . , wN)− 1 N = N s 1 P (w1, w2, ..., wN) = N v u u t N Y i=1 1 P (wi|w1, ..., wi−1) (3.18)

Perplexity can be seen as the weighted average branching factor of a language express-ing the average number of equi-probable words that can follow a given word. It provides a measure on the prediction search space of the language, just as the entropy provides a measure of the size of a search space.

(40)

3.3 Integration of Language Models to The Decoding

Pro-cess

The Viterbi Algorithm as explained in Section 2.1.2 use a simplifying assumption that if the ultimate best path for the entire observation sequence goes through a state qi, then the

best path up to and including qimust be a part of the best path. That assumption limits the

algorithm to make use of a bi-gram model, since a trigram model violates it by allowing the probability of a word to be based on the previous two words [45]. A best trigram path may not be lying on the global best path of the entire word sequence, even though one of its components does.

A solution for this problem is the lattice expansion method to change the lattice so that all of the N-grams appear on it. Another method of language model integration is to generate multiple hypothesis in the decoding process and then re-rank them using a higher order language model, which is called lattice rescoring.

3.3.1 Lattice Expansion

Figure 3.1: A simple lattice (network) with no language model scores

The lattice expansion method is based on adding new nodes and edges as duplications of existing ones to fulfill representation of all possible N-grams. The transitions between nodes are then used to reflect the language model. Lattice expansion can be used in single-pass decoding. In order to accommodate all N-grams in a given network, a unique (N − 1)-word context should be created for each transition [48].

In Figure 3.1 a simple grammar network is drawn as a digraph. It has 14 nodes of which 5 is NULL nodes which are used for simplification of the network. In the initial setting no language model is applied to the network hence language model scores for transitions are empty. If the network lattice is expanded with a bi-gram language model, no new nodes are required for bi-gram score integration but the structure of the lattice and edges are fundamentally changed to represent each bi-gram uniquely, as shown in Figure 3.2. Finally in Figure 3.3, by integration of tri-gram scores to the original lattice, there are newly added nodes and many edges provide unique bi-gram context for each tri-gram transition. As can be seen from that example, lattice expansion with higher-order

(41)

Figure 3.2: The simple lattice expanded with a bi-gram model

language models increases the complexity of the decoding lattice in terms of nodes and transitions between them.

3.3.2 Lattice Rescoring

As an alternative to single-pass decoding, transcriptions of a decoding process can be generated through gradual integration of the language model. With the multi-pass ap-proach [49], a lattice can be generated with lower order knowledge sources, which are then rescored with higher order ones. Multi-pass decoding has been the dominating ap-proach in automatic speech recognition domain for some time [50, 51]. It aims to lower the computational complexity of building the optimized search network in the first-pass, while achieving still a reasonable accuracy with rescoring.

Within the context of lattice rescoring, a lattice is a weighted directed acyclic graph with paths from the start state to a final state representing an alternative decoding hy-pothesis, weighted by its recognition score for a given handwriting input. A lattice is created for each test sample by running the decoder with some decoding network. Each hypothesis in the lattice contains the optical model likelihood and the language model probability separately represented. Each of these two scoring parts can be replaced with a more sophisticated model to achieve better accuracy [45].

The motivation for using higher-order knowledge on a lattice is that a better scoring will direct the decoding process to the path which will result in the correct transcription. However, the correct transcription should be included in the lattice at the first hand. The lattice error rate is a measure to understand the quality of a lattice. It is the word error rate if a perfect knowledge source leads the decoder to the path that has the lower error rate.

(42)

A Large Vocabulary Online Handwriting Recognition System for Turkish

A Large Vocabulary Online Handwriting

Recognition System for Turkish

by

Esma Fatıma Bilgin Tas¸demir

Submitted to the Graduate School of Engineering and

Natural Sciences

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Sabancı University

June, 2018

A Large Vocabulary Online Handwriting Recognition System for Turkish

Türkçe ˙Için Genis¸ Da˘garcıklı C

¸ evrimic¸i El Yazısı Tanıma Sistemi

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction to Online Handwriting

Recognition

1.1

Issues in Online Handwriting Recognition

1.2

Online Handwriting Recognition Systems Overview

1.3

Literature Review

1.3.1

Hidden Markov Models Based Systems

1.3.2

HMM-NN Hybrid Recognizers

1.3.3

HMM- MSTDNN Hybrid Recognizers

1.3.4

BLSTM-based recognizers

1.3.5

Handwriting Recognition for Turkish

1.4

Thesis Overview

Chapter 2

Hidden Markov Models for Online

Handwriting Recognition

2.1

Hidden Markov Models

2.1.1

Definition

2.1.2

Three Basic Problems of HMMs

2.1.3

HMMs for Online Handwriting Recognition

Chapter 3

Language Modeling

3.1

N-gram Language Modeling

3.1.1

Estimation of N-gram model parameters

3.1.2

Interpolation and Backoff

3.1.3

Smoothing

3.2

Perplexity

3.3

Integration of Language Models to The Decoding

Pro-cess

3.3.1

Lattice Expansion

3.3.2

Lattice Rescoring