Karakter Tanımada Karar Kaynaşımı Ve Öznitelik Birleşimi

(1)

1

1. INTRODUCTION

Pattern recognition systems are designed to achieve the best performance in classification for the problem they deal with. In order to achieve best performance, a lot of different algorithms have been developed in many application areas. Then, the recognition system with the highest performance in the experiments is selected for handling the corresponding application. When the results of these experiments are analyzed, it is observed that although the set of misclassified patterns by different classifiers have some similarities, they are not identical [1]. Therefore, by complementing the different decisions it is possible to improve system performance. Combining classifiers is an important and extensively researched topic in pattern recognition applications because of the fact that the system becomes more reliable or accurate by fusing the decisions of different systems. Each recognition application has different methods depending on various feature extraction and classifiers, for getting the best possible result. However, a single system, i.e., a feature extraction method with a single classifier, cannot yield optimally low error rates because of the lack of appropriate pattern expression. In particular, in the case of handwritten characters, although some highly reliable systems are proposed, it is not possible to get the best performance with a single system [2]. In handwritten applications, various studies are done on using several feature extractor methods with complex structures of classifiers such as multistage classification and parallel classification, because of the large variety of feature extraction methods in that era [2].

In order to achieve higher accuracy, a lot of combination methods are proposed. In addition, powerful benefits of combining classifiers have been proven by strong theoretical foundations [3, 4]. There are mainly two groups of combination methods: classifier selection and classifier fusion [5]. In classifier selection, the main idea is to select classifiers as local experts according to their success in some local area of the feature space. The second group advocates fusion of classifiers according to the information they have such as confidences, belief values, decisions, or posterior probabilities. Some methods such as majority voting use only crisp label of

(2)

2

classifiers [6]. Decision rules such as maximum, minimum, average and product rules combine decisions according to the posterior probabilities of the classifiers [6]. These rules can be referred to as “unsupervised fusion” methods since they do not need any information about the classifier. Bayes and Dempster-Shafer fusion methods depend on the confusion matrices information of the classifiers [6]. Another method, behavioral knowledge space, constructs a look-up table from the decisions of classifiers [7]. Decision templates develop a template by using the average decision vector [8]. Although these methods achieve good results, they do not use all available knowledge. In this study, two new classifier fusion methods are proposed in order to achieve higher performance by using all available knowledge without requiring any additional complexity.

In this study, we work on improving the performance of Turkish handwritten character recognition and license plate character recognition by proposing new fusion algorithms. As stated before, in handwritten character recognition, a single system cannot achieve the best performance by itself. Thus, we need to complement information of different systems. However, selecting the appropriate systems, i.e. appropriate feature extraction methods and classifiers for fusion, is an important problem to be resolved. There is no detailed work on selecting the appropriate features for different fusion algorithms. We are going to use Karhunen-Loéve transform (KLT), Angular Radial Transform (ART), Zernike moments and structural moments as feature extraction methods because of their good representation of character patterns. KLT is a well-known technique in handwritten character recognition while ART and Zernike are orthogonal moments that efficiently represent pattern properties. In addition, we use k-nn, Bayes, Parzen, Size Dependent Log-Likelihood and Multi-Layer Perceptrons as classifiers. The detailed information on feature extraction methods and classifiers is given in the next section.

We apply fusion to character recognition in two ways. The first approach is to combine features in order to complement appropriate properties of extraction methods. In this approach, we combined KLT with geometric features since KLT does not consider structural and geometric properties of patterns when it is used as eigen-images of characters, even if it has a good recognition rate. The second approach is combining classifier decisions. In this approach, unsupervised and

(3)

3

supervised methods are applied. Two new methods, Weighted Decision Templates and Partial Fusion of classifiers are also proposed.

This thesis is organized as follows. Section 2 gives brief information on pattern recognition and provides background for character recognition. Preprocessing steps for a general character recognition system are also presented and the roles of feature extraction and classification methods in recognition are discussed. In addition, the feature extraction methods KLT, ART, Zernike moments, and structural features are analyzed. Moreover, information is given on the classifiers used in this study such as k-nn, Bayes, Parzen, SDNLL, and Multi Layer Perceptrons (MLP).

In Section 3, we present researches on combination methods depending on different features for handwritten character recognition systems. In addition we propose a new approach for combining features in handwritten character recognition. This approach is based on appending geometric features to KLT features in order to add geometric properties of patterns to KLT features.

Decision fusion methods are presented in Section 4. We group decision fusion methods into three groups and give detailed information on each method. The first group is composed of unsupervised combining methods, such as majority voting, Borda count, minimum rule, maximum rule, average rule and product rule. The second group is the supervised fusion methods such as Bayes, Dempster-Shafer, Behavioral Knowledge Space, and Decision Templates. In the third group we propose two new fusion methods that are Weighted Decision Templates and partial fusion. Weighted Decision Templates achieves higher performance by using all available knowledge of the classifiers.

In Section 5, the experimental results of fusion methods are analyzed using two different character databases. One of the databases is composed of 29 Turkish handwritten characters, which is formed by Multimedia Center at Istanbul Technical University. The second database is the license plate character database of ITU Multimedia Center which is formed by collecting plate images at Turkish customs and at ITU entrance. This database has 34 classes. We observe that the new methods proposed in this thesis achieve the best performance.

(4)

4

2. CHARACTER RECOGNITION

2.1 An Overview on Pattern Recognition

Pattern recognition is used to cover all stages of an investigation from problem formulation and data collection to the classification, assessment of results and interpretation. It may be characterized as an information reduction, information labeling, or information mapping of the real world data. The applications include a wide range of information processing areas such as speech processing, computer vision, character recognition, face recognition, seismic analysis, radar signal classification, fingerprint identification, and medical imaging.

Pattern recognition was developed significantly in 1960's. A lot of different research has been done since then. Pattern recognition is mostly an interdisciplinary subject, covering developments in the areas of statistics, engineering, artificial science, computer science, psychology, physiology and others. The goal of the research is to achieve a ‘brain-like’ recognition performance by using statistical approaches or by machine learning algorithms [9]. Significant progress has been made in many application areas.

A pattern is a structural description of an object or anything that is of interest. In addition, the collection of patterns, having some common properties, is referred to as a pattern class. Pattern recognition is the process of representing the real data with a corresponding pattern and assigning that pattern to its corresponding pattern class. A pattern recognition system may consist of several stages. Webb [9] has enumerated these stages as shown below:

1. Formulation of the problem: gaining a clear understanding of the aims of the investigation and planning remaining steps.

2. Data collection: making measurements on appropriate variables and recording details

(5)

5 4. Feature selection and extraction

5. Unsupervised pattern classification or clustering: exploratory data analysis 6. Apply discrimination or regression models as appropriate

7. Assessment of results

In this enumeration, some stages may not be present or some of them may be combined together, but it is a fair representation of the stages for a general pattern recognition system. However, some additional stages may be required for some different applications. A general processing flow is shown in the Figure 2.1. The stages 1 and 2 are represented in the first block. We represent stage 3 as pre-processing, stage 4 as feature extraction and stage 5 and 6 as classification.

Real world

Data ProcessingPre- ExtractionFeature Classification Pattern ClassID

Figure 2.1: Pattern recognition system

The main stages of pattern recognition are preprocessing of the data, feature extraction, and classification. In general, raw data are needed to be preprocessed to decrease environmental defects such as noise, background, light, etc. Feature extraction is required for representing the data in lower dimensions without losing the information of the data. Classification is the stage of finding decision boundaries of the different classes of the data. Brief information on these stages is given in the following section that is related to our works for character recognition.

2.2 Background Research on Character Recognition

Character recognition is one of the main problems in the pattern recognition field. Character recognition has been attracting much attention because of its applications in many areas such as office automation, bank check processing, recognition of postal addresses and ZIP Codes, signature verification, license plate recognition and document and text recognition.

The character recognition systems for document and text recognition are called optical character recognition (OCR). The early researches on OCR are as old as the history of pattern recognition and were developed in 1950's. The first system was an

(6)

6

invention, GISMO, by M. Sheppard in 1951. GISMO was a robot reader-writer. Later in 1954, J. Rainbow developed a prototype machine in order to read uppercase typewritten letters at very slow speeds. After 1967, companies such as IBM, begun marketing OCR systems. Since then, thanks to development in algorithms and technology, OCR systems have become less expensive and can recognize more fonts than ever before. Their performance is as good as human performance. However, they do not have the same performance in recognizing handwritten characters. In general, the systems for handwritten characters are called intelligent character recognition (ICR).

ICR systems can be divided into two groups according to their interest. The first group is on-line recognizer systems, which deal with a data stream originating from a transducer as the user writes on a special device. The second group is off-line handwriting recognition, which deals with a data set obtained from a scanned handprint document.

After having reliable performances in printed document and text recognition, research on characters focused on handwritten numerals and characters. Although the performance of OCR systems could not be achieved in ICR because of the variability of handwriting in different people, and even in the same person, the results are promising for the future. In order to have good results, Suen argues that the key factor is feature extraction. In handwritten character recognition, various kinds of feature extraction methods are analyzed. Geometric moment invariants (Hu, Zernike, Angular radial transform, Fourier, Fourier-Mellin), Fourier descriptors, Gabor features, deformable templates, unitary transforms (Karhunen-Loéve, Hadamart), and structural features such as end-points, and background regions are some of the features that have been used [14-19]. Although most of the ICR systems deal with the character itself, i.e. characters segmented from word, there are some methods that consider the word as a whole for recognition.

Many classification methods have been proposed for handwritten character recognition. More than 40 different handwritten character recognition systems were tested on the same database at the First Consensus Optical Character Recognition System Conference in 1992 [10]. The top ten systems among them used either a multi-layer feed forward neural network or a nearest-neighbor classifier. Since then, many new methods have been proposed. Table 2.1 is adopted from [11], in order to

(7)

7

show recognition rates of some systems. In this table, both handwritten and printed characters are concerned.

On the other hand, there has been no character recognition based research for license plate recognition. This is mainly because of the fact that license plate characters have less variation in character types but more in pre-processing.

Table 2.1: Character recognition researches

Patterns Authors Recognition Rate (%)

Denker et al. 86

Bottou et al. 91,9

Srihari 89-93

Numeral

Lee 99,5 Koutsougeras and Jameel 65-98

Liou and Yang 88-95

Srihari 85-93 Character Shustorovich 89,4-96,4 Edelman et al. 50 Lecolinet et al. 53 Leroux et al. 62 Chen et al. 64,9-72,3 Bozinovic et al. 54-72 Senior and Fallside 78,6

Simon 86 Guillevic and Suen 76-98,5

Cursive Word

Bunke et al. 98,27

In addition to single classifier methods, feature combination and decision fusion methods are introduced to ICR systems. In feature combination methods the main idea is to select local or global experts by different feature extraction methods using pattern properties [12, 13]. Some researchers used decision trees such that they used different kind of features in every level. Xu applied known classifier fusion algorithms to classify numerals and reduced the error rate [6]. Brief information on fusion will be given in the Feature Combination and Decision Fusion sections.

(8)

8

2.3 Preprocessing for Character Recognition

In general, raw data is not suitable to use directly in recognition. It has to be preprocessed to decrease environmental defects such as noise, background, light, etc. Since our topic is character recognition, we will discuss only the preprocessing for characters. The general preprocessing steps for character recognition are shown in Figure 2.2.

Real world

Data Digitization Binarization Segmentation Normalization

Figure 2.2: Basic Steps of Preprocessing

Since buffers have finite size, digitization is the step required to store data efficiently in computers. For the case of handwritten character recognition systems, scanners are the most flexible digitization devices. Among the variety of formats they are capable of producing image representation, bitmap format (BMP) is more commonly used. BMP files consist of a header part and data part. The header provides characteristics of image such as width, depth, and number of bits per pixel. A 24-bit data part follows the header and has three different 8 bits-length components for color, which are red, green, and blue (RGB). Although most applications on character recognition are done with binary images or gray level images at most, the scanning process is done in 24-bit color in order to have the color information of the characters for future methods.

Binarization is the process of converting a color or gray level image to a binary image. By binarization, the complexity of data can be reduced significantly. Instead of using 24-bit for one pixel, the required size is only one bit per pixel. In order to binarize an average gray value, the following formula (also the equation of the luminance) is used: ) , ( * 114 . 0 ) , ( * 587 . 0 ) , ( * 299 . 0 ) , (i j r i j g i j b i j Gray = + + (2.1)

where i shows the corresponding row, j is the corresponding column, r is value of red, g is value of green, and b is value of blue of the corresponding pixel (i, j). Gray(i, j) is the gray value of pixel (i, j) after converting the color image to gray

(9)

9

level. Green component of the image is the most important color for transformation to gray level since human beings are more sensitive to green. After transforming to gray level, thresholding is done. A single threshold is the most common approach, although different thresholds can be applied for different regions of the image. Gray levels below the specified threshold will be classified as black (0), and those above will be white (1). The binarization problem becomes selecting the appropriate value for the threshold T. T can be selected either by user or algorithmically.

In general, character recognition systems are designed to classify images containing only one character. Therefore, images containing more than one character should be segmented into small images such that each new image contains only one character. Segmentation can be combined with the removal of noise and extraneous non-textual strokes. We can represent the segmentation process as the combination of several sub-processes. Firstly, each connected component, called a blob, should be found. This process is named as blob coloring. A blob is defined to be a group of pixels, all continuously neighboring or connecting to each other. The blob coloring process initiates a search that begins at a specific starting point in the input image and proceeds to scan the input image in column major (top-to-bottom and left-to-right) order for a black pixel. Once found, the black pixel is marked with a specific label and it is grown into a complete blob region while labeling with the same label number. The next blob is processed, when all connected pixels of this blob are found and painted with a label number. At the end of blob coloring, all blobs should be colored with a different label (or color). The second step of a segmentation process is eliminating unnecessary blobs. A blob is said to be unnecessary (or a noise blob) if the following two conditions are met. First, number of pixels in the blob should be smaller than a specific value. This value can be determined by looking at the average size of the blobs. Second, the small blob should be over or under a big blob. In other words, if two lines are drawn, crossing at left and right ends of each relatively big blob, a small blob is said to be necessary if it is between these two lines and is not completely covered by the big blob. The final step of segmentation is designating the letters one by one and storing them. A letter can be constructed by one blob or more than one blob. In this step, the number of letters and assignments of blobs to letters are determined. Also, bounding box lines of each letter are found, so left, right, top and bottom ends of the letter are determined for using in further steps. Indeed,

(10)

10

segmentation of handwritten characters can have a lot of problems such as connected characters or discontinuity of character bodies.

Even if we work with the images of the same size, the size of the character in that image differs. We can have differently sized, differently oriented, and differently scaled characters. In addition, most of the feature extraction methods such as KLT require uniformly sized samples. Thus, to avoid these effects and to improve our recognition rate, size normalization process should be applied to the individual characters. However, there are rotation invariant or scale invariant feature extraction methods.

In addition to the above four steps, skeletonization can be required for some feature extraction methods such as structural features that are related to the end-points. Character thinning will ideally reduce the character representation to a single pixel width, while preserving all other relevant features. A good thinning algorithm does not remove end points, does not break connectedness, and does not cause excessive erosion of the region.

Although the preprocessing steps mentioned above are the main steps in general, some recognition systems do not need all of them. Some systems require gray level images, so they do not use binarization. Also, some character recognition systems do not need uniformly sized characters, so they do not require normalization.

2.4 Feature Extraction for Character Recognition

Feature extraction methods play an important role in pattern recognition. In general, all classifiers can perform classification of raw data directly. However, the sample size of raw data is huge, i.e., it is 1024 even for a 32x32 image. Analysis of this whole space takes a lot of time and decreases the performance since it is easily affected by noise. In order to represent the same sample information with fewer dimensions, we need to extract appropriate features from it. The main purpose is to extract information that is most relevant for classification purposes, in order to minimize the within-class pattern variability while enhancing the between-class pattern variability. During the feature extraction process the dimensionality of data is reduced. This is almost always necessary, due to the technical limits of memory and computation time. A good feature extraction scheme should maintain and enhance

(11)

11

those features of the input data, which separate distinct pattern classes from each other. At the same time, the system should be immune to variations produced both by the humans using it and the technical devices used in the data acquisition stage. We note that, a good extraction method for characters may not work as well in other applications since the type of data differs.

There are a lot of feature extraction methods for character recognition [14]. According to the aim of this thesis, these methods are divided into two groups; one group consists of extraction algorithms depending on the statistical properties of patterns such as principal component analysis based on Karhunen-Loéve transform; the second group is based on the structural features and transformations such as Angular Radial Transform, Zernike moments, and structural features. Since feature improvement is not our focus, only brief background on the used methods in this study will be given.

2.4.1 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) provides achievement of a transformation such that the feature space represents total variation as well as possible [15]. PCA is a well-known technique in which second order statistics are used for evaluating importance of the given variable. In addition, by providing a quantitative measure of the error, and by introducing omitting of any given components of the transformed vector, the number and identity of such components will be known and the dimensionality of the input vector may be reduced with minimum loss.

Consider that, an n-dimensional pattern vector, Z is to be mapped on to the m-dimensional feature vector X, where m < n. Principal Component Analysis (PCA) tries to find the invertible linear transformation T such that,

Y=T⋅X (2.2)

In 1901, Pearson introduced this transformation in a biological context to recast linear regression analysis into a new form. Hotelling [1933] used it in the context of psychometrics to transform discrete variables into uncorrelated coefficients. In 1947, it appeared independently in the setting of probability theory [Karhunen, 1947] for transforming continuous data. Loeve [1948, 1963] subsequently generalized it [14].

(12)

12

Hence, it was named as Karhunen-Loéve Transform (KLT). Koschman [1954] showed that the KLT minimizes the mean square truncation error.

As stated above, principal components are linear combinations of random variables having special properties with respect to variance. The first principal component is the normalized linear combination with maximum variance. The second has the next largest variance and so on. Principal components are ranked by their ability to distinguish the classes, i.e. from most important to less important.

The KarhunaLoéve transformation is an orthonormal transformation of an n-dimensional vector Z to an n-n-dimensional vector Y, achieving this property. KL Transformation can be viewed as a linear transformation of discrete K-L expansion to feature selection. If we consider

] ... [Φ₁ Φ_n =

Φ m<n (2.3)

to be the transformation matrix, then, the principal components are the coefficients of the K-L expansion, that is, for any pattern xi of class Li

i T

i x

pc =Φ (2.4)

Since _{Φ is an m*n matrix and x is an n-dimensional vector, we see that, if m<n,}T i

pc are principal vectors with lower dimensionality.

It can be shown that the optimum properties of the K-L Transform are satisfied if the columns of the transformation matrix are chosen as the m normalized eigenvectors, corresponding to the largest eigenvalues of the correlation matrix.

The important issue is that how many principal components are sufficient for a given particular situation? In some cases, a threshold value is used and principal components with associated eigenvalues less than the threshold are dropped. Sometimes the number of components is fixed a priori, as in the case of situations requiring visualization of feature space limitations that are imposed due to two- or three-dimensional space requirements.

2.4.2 Zernike Moments

Orthogonal moments are based on the theory of orthogonal polynomials in order to overcome the problem of regular moments, which is that recovering image from

(13)

13

these moments is quite difficult and computationally expensive. Zernike moments are one kind of these orthogonal moments. The advantage of Zernike moments among others is that they are rotationally invariant and have an ease of reconstruction. Thus the error between original image and the reconstructed image can be found easily. This property is useful for deciding feature dimension via evaluating the image representation ability by features of each order moments through comparison of the reconstructed image with the original one. These moments are only rotationally invariant and before calculating moments, the image should be scaled and normalized. In addition, Teh and Chin [16] showed that orthogonal moments including Zernike moments are better than other types of moments in terms of information redundancy and image representation. The following is a brief information about calculating Zernike moments.

In [17], Zernike introduced a set of complex polynomials which form a complete orthogonal set over the interior of the unit circle, x²+y²=1. Let the set of these polynomials be denoted {Vnm(x, y)}. The form of these polynomials is:

) exp( ) , ( ) , (x y V ρ θ R jmθ V_nm = _nm = _nm (2.5) where

n: Positive integer or zero.

m: Positive and negative integers subject to constraints n-|m| even and m ≤n ρ: Length of vector from origin to (x, y) pixel.

θ: Angle between vector and x axis in counterclockwise direction. Rnm: Radial polynomial defined as

∑

− = − − − − + − − = | /|2 0 ) 2 ( )! 2 | | ( )! 2 | | ( ! )! ( ) 1 ( ) , ( n m s s n s nm s m n s m n s s n R ρ θ ρ (2.7)

These polynomials are orthogonal and satisfy

∫∫

≤ + + = 1 * 2 2 1 )] , ( )] , ( [ y x mq np pq nm x y V x y _n V π δ δ with    ← = ← = else b a ab ₀ 1 δ (2.8)

Zernike moments are the projection of the image function onto these orthogonal basis functions. The Zernike moment of order n with repetition m for a continuous image function f (x, y) that vanishes outside the unit circle is

(14)

14

∫∫

≤ + + = 1 * 2 2 )] , ( ) , ( 1 y x nm nm f x yV x y dxdy n A π (2.9)

The integrals are replaced by summations for a digital image, i.e.

∑∑

+ = x y nm nm f x y V x y n A 1 ( , ) *( , )] π (2.10)

In computing the moments, the center of the image is referred to as the origin and the pixels are projected onto a unit circle with this center. The pixels outside the circle will not be used in computations. An important issue is the decision on the number of features to be used. In order to find the required number of features we reconstruct the image with respect to different order n, and decide n in which the error between reconstructed image and the original one is minimum. The reconstruction equation is as follows:

∑∑

> + + = n m n n nm nm R C m S m C y x f 0 0 0 ₍ ₎ 2 ) sin cos ( ) , ( θ θ ρ (2.11) where

{ }

= +

_∑∑

= x y nm nm nm f x y R m n A C ρ θ θ π ( , ) ( , )cos 1 2 Re 2 (2.11a)

{ }

=− +

_∑∑

− = x y nm nm nm f x y R m n A S ρ θ θ π ( , ) ( , )sin 1 2 Im 2 (2.11b)

We will not show the proof of this equation but it can be found in [18].

2.4.3 Angular Radial Transform

Zernike basis functions do not describe the radial and angular directional complexities equally, but they rather tend to emphasize the radial directional complexities because the Zernike basis functions are defined only whenn≥m. As a result, it becomes an undesirable property of Zernike moments as a shape descriptor, since the angular directional complexity plays more important role to human perception than radial directional complexity. The new shape descriptor was derived from the motivation of making a new basis function to preserve two desirable properties of Zernike basis function, their orthogonality and their rotation invariance of the magnitude, with an additional shape analysis capability that takes into account both complexities in radial and angular directions [19]. So, a set of new basis

(15)

15

functions was derived in polar coordinates by combining two separable orthonormal basis functions in both radial and angular directions so that their magnitude would be invariant to any rotation. This new transform is called angular radial transform (ART). ART expresses pixel distribution within a 2-D object region and it can describe complex objects consisting of multiple disconnected regions as well as simple objects with or without holes. The Angular Radial Transform is a 2-D complex transform defined on a unit disk in polar coordinates,

( ) ( )

₌

_{∫ ∫}

∗

( ) ( )

= ρ θ ρ θ 2π ρ θ ρ θ ρ ρ θ 0 1 0 , . , , , , f V f d d V F_nm _nm _nm (2.12)

Here, F is an ART coefficient of order n and m;_nm f(ρ,θ) is an image function in polar coordinates, and V_nm(ρ,θ) is the ART basis function [16]. The ART basis functions are separable along the angular and radial directions,

(

ρ θ

)

m

( ) ( )

θ n ρ

nm A R

V , = (2.13)

The angular and radial basis functions are defined as follows:

( )

(

θ

)

π θ jm A_m exp 2 1 = , (2.13a)

( )

₍

₎

   ≠ = = 0 cos 2 0 1 n n n Rn ρ _π _ρ (2.13b)

The imaginary parts have shapes similar to the corresponding real parts but with different phases.

2.4.4 Structural Features

In statistical based feature extraction methods such as KLT, the features are extracted by using the relation between different samples of the same class. They do not consider the structural properties of the pattern. On the other hand, the structural features do consider these properties but they do not consider within class relationships. Geometric features may be geometric moments such as Hu moments, Zernike moments, Fourier moments, end points related features, features based on background regions, etc.

Moment-based feature extraction has received considerable attention in pattern recognition and especially in handwritten character recognition because of its

(16)

16

invariance to some transformations. Moment-based feature extraction involves determining the integral, over the image of a character, of the product of the binary image and each of a set of polynomials defined over a 2D region. After Hu proposed to use linearly independent moments in images, moment invariants have been commonly used in extracting features for characters. Later, orthogonal moments such as Zernike, pseudo Zernike, Fourier, Fourier-Mellin, Legendre, etc., are proposed to be used in extraction in order to overcome defects of regular moments.

In addition to moment-based features, end-points can be used as structural features. End-points can be found after skeletonization of the character image. The number of end-points, and the region where these end-points are found, are useful features for character recognition.

Moreover, background-region information is a good resource for structural properties of the character. Analyzing the size, shape, and position of the background regions of the character gives important information about the character. A good example is searching holes in the character; for example, ‘B’ has two holes, while ‘O’ has one in the middle. Also, openings of the character, number of openings, and the direction of the openings can be used as features. For example, ‘Z’ has two openings, one at the west and the other at the east where ‘A’ has a hole and an opening at the south. Moreover, number of blobs in a character may be used for Turkish character recognition since there are characters in the Turkish alphabet with more than one blob such as ‘Ű’, ‘İ’, ‘Ő’ [11].

2.5 Classification for Character Recognition

After preprocessing the raw data and extracting features from the preprocessed data, the next step is assigning an identity to the corresponding data. Actually, classification is the process of assigning an input pattern P with the extracted feature X to one of the specified classes. In some applications, the classes are not pre-specified, so an unsupervised classification, called clustering, is used but this is out of the scope of our study. Namely, we focus on supervised classifications, and all methods used in this study are supervised methods. In this type of classification algorithms, a training process is required in order to give specific information about the data to the classifier. This information represents the properties of each class and

(17)

17

therefore the classifier is trained to predict the identity of the pattern to be recognized.

In supervised classification, the number and the identity of classes are known. Let x∈Rⁿ be a feature vector and {1, 2, ..., c} be the label set of c classes. Then a classifier C is the function of assigning a class label µ(x) to the pattern x. µ(x) is a c-dimensional vector that isµ(x)=

{

µ₁(x),µ₂(x),...,µ_c(x)

}

∈[0,1]. Classifier is mapping of pattern to the space of class labels:

c n

R

C: →[0,1] (2.14)

The components of class label x, µ₁(x),µ₂(x),...,µ_c(x) can be posterior probabilities of classes, belief values, typicalness, possibility, certainty. Three different types of classifiers are defined according to the class label by Bezdek et al [20]:

1. Crisp classifier: )µ_i(x ∈ {0, 1},

∑

c_i₌₁µ_i(x)=1 for all x. Namely, only one component of the class label, which is the classification result, is 1, and others are zero. K-nearest neighbor is an example of this kind.

2. Fuzzy classifier: )µ_i(x ∈ [0, 1],

∑

c_i₌₁µ_i(x)=1 for all x. The components have a value in the range [0, 1], and in addition their sum is 1. The classification result is the class with the largest value of the components of class label. The probabilistic classifiers are in this group.

3. Possibilistic classifier: )µ_i(x ∈ [0, 1],

∑

c₌ >

i 1µi(x) 0 for all x. In this group, the

components do not have probabilistic values, thus the sum does not need to be 1. The classification is again the class that has the largest value of the components of class label. Neural networks are of this kind.

The second and the third groups can be easily converted to the group 1 by assigning

{

( ), ( ),..., ( )

}

1 max ) (x = ₁ x ₂ x _c x = i µ µ µ µ (2.15a) 0 j∈{1,2,..,ι−1,ι+1,...,χ} = µ (2.15b)

Another grouping of pattern classifiers can be done according to their algorithm. They can be divided into four main groups as template matching, statistical pattern

(18)

18

recognition, syntactic or structural matching and Artificial Neural Networks (ANN) [21].

Template matching is the easiest way to classify a pattern. No additional information except for the feature itself is necessary. Features are called templates or prototypes of the patterns to be recognized. The pattern to be recognized is matched against the stored patterns and is recognized by a similarity measure. The most similar class is predicted for this pattern. K-nearest neighbor is a type of template matching but differs by the fact that after finding the k similar patterns, it identifies the pattern as the class having the biggest vote.

Statistical pattern recognition techniques use the results of statistical detection and estimation theory to obtain a mapping from the representation space to the interpretation space. They rely on the determination of an appropriate combination of feature values that provides measures for discrimination of classes [9]. Each pattern is represented in terms of d-features, and is viewed as a point in a d-dimensional space. The goal is to choose those features that allow pattern vectors belonging to different categories to occupy compact regions in the d-dimensional feature space. The effectiveness of the feature set is determined by how well patterns of different classes can be separated. Bayes quadratic classifier, nearest mean classifier, and Parzen classifiers are of this type.

In syntactic pattern recognition, a pattern is viewed as being composed of simple sub-patterns, which are themselves built from simpler sub-patterns. The simplest sub-patterns to be recognized are called primitives and the given complex pattern is represented in terms of relationship between these primitives.

Artificial Neural Networks is another important method among pattern classifiers. An artificial neural network is an information-processing paradigm inspired by the way of the densely interconnected, parallel structure of the human brain information processing. Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The key element of the ANN paradigm is the novel structure of the information processing system. The following three characteristics of an ANN have played an important role in a wide variety of

(19)

19

applications, (not only for Pattern Recognition Systems): Adaptiveness, Nonlinear Processing and Parallel Processing [22].

Now, information on the classifiers used in this thesis will be given.

2.5.1 K-Nearest Neighbor

K-nearest neighbor is a simple classifier based on finding k sample patterns in training database which has k minimum distances to the pattern to be recognized. Then decision is made by choosing the class with the most votes of k-samples. K-NN has a low error rate despite its simplicity. The distance metric can be selected as any of the known metrics which suits the distribution of data. The training set should be big enough to cover all possible samples. On the other hand, increasing number of training samples causes an inefficient classifier since it will take more time to find all distances. There has been research on selecting a representative sample space from all available database by eliminating unnecessary samples [23], or elimination in some sense [24]. In character recognition applications, it is found that when k=1, the error rate is minimum [25].

2.5.2 Nearest Mean Classifier

Nearest mean classifier depends only on the first order statistics of data which is the mean of each class. The goal of this algorithm is to find the class which has the nearest mean to the pattern to be classified. The mean of each class is calculated by using the samples in the training space. First order statistics do not represent pattern space explicitly; therefore the recognition rates are not high.

2.5.3 Bayes Quadratic Classifier

Bayes decision rule is a discrimination rule based on the knowledge of probability density function of each class. The objective is to minimize classification error. We have a measurement vector x and our aim is to assign x to one of the known classes. A decision rule based on the probabilities is to assign x to classw_j, if the probability of class w_j given the vectorx , i.e. p(w_j |x) is greatest among all classesw ,...,₁ w_c. Namely, assign x to class w_j if

{

( | ), ( | ),..., ( | )

}

max ) | (w x p w₁ x p w₂ x p w x p _j = _c (2.16)

(20)

20

The measurement space is divided into c regions by this decision rule. In addition, a posteriori probabilities p(w_j|x) can be expressed in terms of a priori probabilities,p(w_j), and the class conditional density functions p(x|w_j)using Bayes theorem as ) ( ) ( ) | ( ) | ( x p w p w x p x w p _j = j j (2.17)

and by using this equation, the decision rule may be written as

{

( | ) ( ),..., ( | ) ( )

}

max ) ( ) | (x w_j p w_j p x w₁ p w₁ p x w_c p w_c p = (2.18)

and it minimizes the classification error [26]. If we assume equal a priori probabilities, equation 2.18 is reduced to:

{

( | ),..., ( | )

}

max ) | (x w_j p x w₁ p x w_c p = (2.19)

It is proved in [26] that we get the minimum error if we identify the distribution of the classes accurately. The application of Bayes decision rule in character recognition is performed by assuming normal distributions for all classes. The objective is to find the appropriate mean vectors and covariance matrices of each class. Mean vectors and covariance matrices are found by using training samplesx , _i where each x is n-dimensional vector, of each class as follows: _i

∑

= = Nj i i j j x N 1 1 µ (2.20)

where N_j is the number ofx_i∈w_j:

∑

− − − = Γ = T j i j i j j j x x N Cov ( )( ) 1 1 _µ _µ (2.21)

Then, we get the conditional probability as ) 2 exp( | | ) 2 ( 1 ) | ( 1 2 / 1 x x w x p T j n j − Γ − Γ = π (2.22)

(21)

21

logarithm since it is easier to compute. Then, the discriminant function is given in equation 2.23; j j j j T j j x x x P g ln( ) ln 2 1 ) ( ) ( 2 1 ) ( ₌₋ ₋_µ _Γ−1 ₋_µ ₋ _Γ−1 ₊ _(2.23)

with µ is the mean of the samples in class j, and _j Γ_jis the covariance matrix of the samples in class j. P_j is a priori probability of class j which is assumed as the same for all classes so we can ignore it.

The problem in Bayes classification is that it is not easy to estimate the density functions. Normal distribution is the one used in general. In order to findp(x|w_j), we need to estimate mean vectors and covariance matrices of each class. The number of the required samples for estimating mean vector depends on the dimension of the measurement vector and on the number of classes. Estimating covariance matrices is much harder. As the number of samples increases, we will have a better estimation on parameters and get a higher recognition rate [26].

2.5.4 Parzen Classifier

Parzen window is a non-parametric method to estimate the density f(x) and the probability P(x) from a training set. Consider the estimation of a density function f(x) from the set of training data. Let Ω be a region in the feature space with volume V, and let P be the probability that x falls in this region:

V x f x

P( ∈Ω)≈ ( ) (2.24)

wherex∈Ω. Let k be the number of elements of that fall into region Ω. Then an estimate of P is Pˆ =k/n and n is the number of samples. Combining the two above equations, we get an estimate of:

nV k x

fˆ( )= (2.25)

As n goes to infinity, Pˆ converges (in probability) to P, and fˆ x( )converges to the average value of f(x) over the region Ω. In order to get an estimate for given value of x, we must let the volume of the region Ω go to zero in such a way that x is always contained in Ω.

(22)

22

Let φ (u) be a probability function, i.e. φ (u) ≥ 0 and it integrates to 1. Define Vn as

hnd, where hn is the parameter representing the width of the window. Then a Parzen

window: ) ( 1 ) ( n n n h x V x = ϕ ∂ (2.26)

and the estimated probability of x is given as

∑

= = − = − ∂ = n i n i n i n i n n h x x nV x x n x f 1 1 ) ( 1 ) ( 1 ) ( ˆ _ϕ _(2.27)

The parameter hn affects both the magnitude and width of δn. If hn is large, δn is broad

and has small amplitude. If hn is small, δn is narrow and approaches to dirac delta

function.

In order to design the Parzen classifier, we use the method of Hummels and Fukunaga [27]. This method uses the following Gaussian kernel

2 2 2 ) ( ) 2 ( 1 ) ( h x d n c n _h e x − Γ = π ϕ (2.28)

where d is the Mahalanobis distance. The class conditional probability density at a point is estimated by

∑

= − = n k i k i x x n w x p 1 ) ( ₎ ( 1 ) | ( ϕ (2.29)

Hummels and Fukunaga use a threshold, t, for biases. They proposed four different options for these biases, based on the assumption of the distribution type. The one used in this thesis assumes the distributions to be nearly Gaussian. The bias for this case is i i P h h h t ln 1 1 ) ln( ) 1 ( 2 1 2 2 2 + + Γ + = (2.30)

The second term of bias, t, can be considered out since we assume equal probabilities for classes. The discriminant function becomes

t w x p x g_i( )=−ln ( | _i)− (2.31)

(23)

23

The class with the maximum discriminant value is the classification result.

2.5.5 Size Dependent Negative Log-Likelihood

As stated in Bayes quadratic classifier, although using Bayes discriminant function will yield minimum error rate, it is hard to estimate the covariance matrix accurately. It is assumed that sufficient number of samples for an accurate estimation is present, but it is not the case in practice. Thus we need a metric that uses the available information efficiently. In order to handle this problem Hwang has introduced three belonginess terms, which are likelihood, Mahalanobis distance, Euclidean distance, together in the metric [28]. It can be seen that Mahalanobis distance requires fewer samples than covariance matrices since it is the average of covariance matrices of classes. On the other hand, Euclidean distance is the variance of classes, thus depends on only one parameter and requires fewer samples than Mahalanobis. If we negate the discriminant function given in Eq. (2.32), we get the negative-log-likelihood, ) ln( 2 1 ) ( ) ( 2 1 ) ( ₌ ₋ _Γ−1 ₋ ₊ _Γ−1 − = j j j T j j j g x x x G µ µ (2.32)

In this equation, if Г is replaced with within-class matrix, Sw, then this is

Mahalanobis metric, if Г is replaced with a diagonal matrix of variances, σ2І, then this is Euclidean metric. If Г is covariance matrix, it is Gaussian, or can be called Bayesian.

Although some dimensions are more important in Euclidean NLL, Euclidean NLL treats all the dimensions in the same way. On the other hand, Mahalanobis NLL rotates the basis of the sample subspace so that the correlation among the dimensions will vanish. These two NLL have linear decision boundaries. Gaussian NLL treat each class individually and has a quadratic decision boundary.

Hwang used the combined metric in the nodes of a decision tree where in each node there are q clusters [28]. However, one can use this metric in general pattern recognition applications because of the fact that we do not have enough samples in practice. And especially in character recognition which has many classes, it is a harder task. With regard to applications, it can be seen that usage of this metric increased the recognition rate of Bayes quadratic classifier and it is also the best among the classifiers used in this study (see Results and Discussion).

(24)

24

The aim is to combine these three metric in one. But what will be the contributions of each metric? It is obvious that the coefficients of the metrics will depend on the number of samples of each class and as the number of samples increases it will tend to put more weight on covariance metric. Now, weight of the Euclidean metric should be found first, since the other two will depend on that. Let n be the number of samples and q the number of clusters (or classes). σ2І has n-1 elements since variance cannot be estimated from one sample, then weight of Euclidean NLL will be

(2.33) with ns represents switch point for next metric. We should decide how large ns will be. Consider a series of independent random variables of a distribution with variance σ2_{, the expected sample mean of n random variables has a covariance σ}2_{/ (n-1). We} can choose a switch confidence value ρ for 1/ (n-1) =ρ. In this case, the estimate has a 50 percent weight. Therefore, n=1+ (1/ ρ). In particular, let ρ =0.1 which is to say trust on estimation is 50 percent when the expected variance of the estimate reduced to 10 percent of that of a single variable. This gives 11 for the value of ns.

In calculating within- class matrix, we have (n-q) independent vectors since every class used one for estimating its own mean. Thus the number of independent elements is (n-q) (q-1). There are (q-1) q estimated elements in Sw, thus the weight of

Mahalanobis NLL will be             − =             − − − = s s m _q n q n n q q q q n b ,0 , min max 2( ),0 , 2 / ) 1 ( ) 1 )( ( max min (2.34)

For the weight of Gaussian NLL we should consider the number of samples in the class which has the minimum samples,

      − = ≤ ≤ _q n b i q i g ) 1 ( 2 min 1 (2.35)

Instead of using the above coefficient, average number of classes may be found as

2 1 ) ( 2 ) 1 ( 2 1 q q n q n q b q i i g − =       − =

∑

= (2.36)

{

_s

}

e n n b =min −1,

(25)

25

Thus the combined metric of three metric is found by using the normalized value of these three weights, i.e. dividing each weight with the sum of three of them.

i g w m e i w I w S w W = _ρ2 + + Γ _(2.37)

And by using this metric, the negative log-likelihood is

) ln( 2 1 ) ( ) ( 2 1 ) ( 1 j j j T j j x x W x W G = −_µ − −_µ + (2.38)

2.5.6 Artificial Neural Networks (Multi Layer Perceptrons)

An artificial neural network (ANN) is an information process based on the processing of signals by the densely interconnected, parallel structure of the human brain. Artificial neural networks are collections of mathematical models that emulate some observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The first known neural network application was in 1943 by McCulloch and Pitts [1943]. Many improvements were done after the first application. Although developments in technology have enabled the use of computers which can do mathematical operations much faster than humans, they are still not good enough to analyze complicated real world data. Therefore, there is a great need for research in developing more neuron-like systems and the research in this area is quite popular.

ANN is composed of a large number of highly interconnected processing elements that are analogous to neurons and are tied together with weighted connections that are analogous to synapses. The weights of connections are calculated by a learning process. The important issue in ANN is deciding the number of processing elements that will be used, what type of architecture of the network will be selected, which learning process, i.e. learning paradigm and learning rule, will be chosen. First step in designing an ANN is to choose the learning paradigm. Different types of architectures are multi-layer perceptrons, recurrent networks, feed-forward, competitive, ART, Hopfield, and Kohonen self-organizing map. We will not go into detail on different architectures but more information about these methods can be found at [29]. In this study, we will use a feed-forward multi-layer perceptron with two layers. In a feed-forward network, no element has a connection from the next

(26)

26

layer or within the same layer. Each element has an input of outputs from previous layer and passes on information to the next layer (Figure 2.3).

Input

Output

Layer 1 Layer 2

Input layer Hidden Layer Output Layer Figure 2.3: Feed-forward 2-layer perceptron

Learning process can be supervised, unsupervised or hybrid. In supervised paradigm, network has information about classes. In unsupervised paradigm, the network has no information about classes. In the hybrid case, it is partially supervised and partially unsupervised. Gradient-descent algorithm which is based on error- reduction is the algorithm that is used commonly in pattern recognition applications. Besides this learning algorithm, other learning algorithms are Boltzman, linear vector quantization, linear discriminant analysis, Kohonen, radial basis functions, principal component analysis, ADALINE and MADALINE [30].

In this study we used error back-propagation algorithm which is based on gradient-descent in learning process. Let N be the number of training patterns and let d(i) represent the original vector , which is class identity of training sample i for pattern recognition, and y(i) be the output vector of the network. Then, the following algorithm minimizes the squared-error function, given in equation 2.39 using training samples.

∑

− = N i i i _d y E () () 2 2 1 (2.39)

The process starts with assigning random numbers to the weights. Then a randomly chosen sample pattern propagates through the network. At the output layer delta function, δ, which is the derivative of error function with respect to d is calculated in

(27)

27

order to update weights. Let L be the number of layers in the network, the delta function for element i at the output layer, δiL, is as follows:

] )[ ) ( ) ( ( ' 1 1 L i i i L i ji L L i =g

∑

w n y n d −y − − δ (2.40)

with the sigmoid activation function

) exp( 1 1 ) ( a a g − + = (2.41)

Then in order to find delta functions of the previous stages this delta function is propagated back. For the preceding layers, the deltas are

∑

− − + + = j l j l ij i l i l ji l i g'( w (n)y (n)) w (n) (n) 1 1 1 1 _δ δ (2.42)

The deltas are calculated for all layers. To update weights, a learning coefficient η is needed which defines the learning rate of the network. The new weights are updated as follows: 1 − + = l j l i l ji l ji w y w ηδ (2.43)

After updating weights, the algorithm is run again for all training samples and the new error function is calculated. If the new error is less than a pre-specified threshold, then this is the final values of weights. Otherwise the algorithm is repeated until the error is below the threshold level.

(28)

28

3. FEATURE COMBINATION

Even though pattern recognition systems have reached reliable performance levels, it is hard to obtain very low error rates with single feature extraction method and single classifier, especially for handwritten characters [2]. In addition, the misclassified patterns space of independent different classifiers does not overlap [1]. In order to achieve higher performances, researchers are focused on combining this complementary information of classifiers.

There are mainly two approaches for using the complementary information of classifiers. The first one is using hybrid and hierarchical classifier systems. The second approach is combining classifier decisions by using relevant properties of classifier. We discuss this approach in detail in Decision Fusion section. In this section, we discuss the background research for the first approach and propose a new method for character recognition, which is based on the combination of different feature types.

3.1 Background Research

The hybrid and hierarchical classifier systems can be hybrid systems of statistical and syntactical or structural methods, multilevel hierarchical classifiers, decision tree classifiers, and feature selection. Ross discussed two combination methods by using combination of feature extraction algorithms and by combining classifier decisions for Biometrics [31].

Due to complexity in handwritten characters, a single system cannot afford low recognition rates [32]. Ping has proposed a hybrid system which is a two-level classification where three layer MLP is used in the first classification and decision tree is used in the second stage [33]. Krishna presented decision tree extraction from trained neural networks [34]. Heutte introduced formation of feature based on structural and statistical properties of characters [2]. Cai integrated the information of structural and statistical features for handwritten numerals [35]. Alpaydin and Jacobs

(29)

29

introduced using of local experts with a decision tree [12, 13]. Duerr has introduced a hybrid system of using statistical and structural classifiers [32]. Suen introduced the concept of using experts in recognition [36]. Huang enlarged the concept of using multiple experts in handwritten character recognition [7].

In handwritten character recognition, the main research has been done in selecting features. Feature selection is important for deciding experts, deciding local experts, constructing a decision tree, combining structural and syntactical classifiers. Baird presented feature identification for hybrid systems [38], while Cao proposed to use multi-stage classifiers with multiple features [39, 40]. Chen used a hidden Markov model for recognition [41]. Heutte introduced a combined feature of structural and statistical using seven different families of features [2]. Capar has introduced appending structural features to the well-known feature extraction method KLT [11]. In this study we propose a feature combination method based on combining KLT features with geometric features.

3.2 New Proposed Method

Feature extraction methods play an important role with classification method in order to have a good performance in recognition. If the extracted features cannot represent the pattern accurately, it is impossible to design a classifier to get very low error rates. In handwritten character recognition systems one type of feature extraction method cannot reach the best performance. Therefore at least two different types of features, which are selected from structural, statistical and syntactical features, should be used.

KLT is the most well-known feature extraction method based on the statistical properties of the pattern. Detailed information on KLT is given in the previous section. Although KLT is a good feature extraction method and achieves good performance especially with statistical classifiers, it does not reflect the geometric and structural properties of the characters exactly. On the other hand, Zernike moments and Angular Radial Transform are two orthogonal moment generators which accurately represent the geometric properties of pattern. In addition, structural features related to end-points, back-ground regions, moments, can represent structural properties of characters. Capar proposed to combine KLT with the structural features and got a better performance for Turkish handwritten characters

(30)

30

[11]. Instead of using structural features, we propose combining KLT features with geometric features, such as Zernike moments and ART, in order to have a better performance.

The feature combination is appending ART or Zernike moments to KLT after some processing; normalization and weighting. For example, a fused feature of KLTand ART will have a dimension of 100, 64+36, which is the sum of the dimensions of KLT and ART. Normalization is required especially for ART and Zernike since their values are spread away. On the other hand, normalizing KLT features is not necessary since they have small values.

We proposed three different approaches for appending features. The first approach, A1, is appending after normalizing geometric features to 1 while keeping the KLT features same. Let XK be KLT of pattern and XA be ART of pattern. XA will normalized by the following formula,

∑

= = A d i Ai Ai ANi x x x 1 2 (3.44)

with xAi , xANi corresponding to ith index of ART and normalized ART feature vectors

respectively and dA is dimension of ART. Then fused features are obtained by

appending XK, and XAN, X_F =

[

X_K,X_AN

]

.

The second approach, A2, is appending after normalizing both geometric features and KLT features to 1. Let XK be KLT of pattern and XA be ART of pattern. XK, will normalized by the following formula,

∑

= = K d i Ki Ki KNi x x x 1 2 (3.45)

with xKi , xKNi corresponding to ith index of KLT and normalized KLT feature vectors

respectively and dK is dimension of KLT. Then fused features are obtained by

appending XKN, and XAN, X_F =

[

X_KN,X_AN

]

.

Since dimension of different features extracted with different methods used in this study is not same, a third approach, A3, is proposed: normalizing to feature dimension. In this case, features are normalized as follows:

(31)

31

∑

= ⋅ = K d i Ki Ki K KNi x x d x 1 2

∑

= ⋅ = A d i Ai Ai A ANi x x d x 1 2 (3.46)

The performance is increased for Turkish handwritten characters by appending geometric features to KLT. The best increment on the recognition performance is 2% and is achieved by the first approach which is appending the normalized geometric features to KLT features. The detailed results with the database used for experiments are given in the Results and Discussion.

(32)

32

4. DECISION FUSION

Combining decisions of classifiers in order to achieve higher performance and accuracy is an important topic in pattern recognition. The aim of fusion is complementing the misclassified patterns of different recognition systems. In order to get good results by fusion, the systems are supposed to be independent in most fusion algorithms. Classification systems with different training sets, or with different feature extraction methods or with different classifiers are assumed to be independent. Decision fusion is the method of combining systems, after all systems get their own decisions.

Combining classifiers actually consists of two main topics, classifier selection and classifier fusion. In classifier selection, local or global experts [12, 13], divide-and-conquer classifiers [42], dynamic classifier selection [5], etc., methods are used in literature. Ho analyzed the combination of multiple classifiers [44]. However, after that, a lot of research has been proposed on fusion for different applications in pattern recognition. Van Breukelen discussed the combination schemes for handwritten digits [45]. In this thesis, the classifier selection methods are not investigated.

Fusion of classifier decisions can be grouped as unsupervised and supervised fusion methods. In unsupervised fusion, no training step is required but for the supervised case the training step is a must. Minimum, maximum, average, product, voting, Borda count methods are unsupervised methods, while behavioral-knowledge space (BKS), Bayes fusion, Demspter-Shafer (DS), decision templates (DT), weighted decision templates are supervised methods.

Fusion methods depend on the type of the output of the classifier. Thus, one should consider classifier outputs before combining. As stated in the previous section, a classifier is a mapping from feature vector space to the class labels. The class labels can be viewed as the estimate of posterior probabilities, belief values, certainty,

(33)

33

possibility, typicalness, etc. Let µ(x)=

{

µ₁(x),µ₂(x),...,µ_c(x)

}

∈[0,1] denote these labels for each class. Classifiers can be grouped into three according to these labels. Some classifiers such as k-nn, nearest mean produce a crisp value where µ_i(x)=1 if i is the decision and µ_i(x)=0 for other classes. Some classifiers such as Bayes, Parzen produce posterior probabilities of each class where µ_i(x)∈[0,1] and

∑

= =

c

i 1µi(x) 1. In type three labels are possibility values as in neural networks where

] 1 , 0 [ ) (x ∈ i

µ and

∑

c_i₌₁µ_i(x)>0. To change labels to crisp type is to find the maximum value and make it, i.e.,µ_i(x)=max

{

µ₁(x),µ₂(x),...,µ_c(x)

}

=1 and make the rest 0. On the other hand, transforming crisp labels to possibility values differs according to classifiers and it is not as easy as the previous one. In order to transform we should keep the related information, i.e. votes for k-nn, or distances for distance metric classifiers. Otherwise, there is no way to transform. For k-nn,

k i vote x i ) ( ) ( = µ (4.47)

where )vote is the number of votes for class i. On the other hand, for classifiers (i using distance metric

∑

= ) ( 1 ) ( 1 ) ( x d x d x i i i µ or

∑

= ) ( 1 ) ( 1 ) ( 2 2 x d x d x i i i µ (4.48)

can be used for transformation. Before combining decisions, type of all classifiers must be the same. Fusion methods can also be grouped according to type of output they use in the process. Table 4.1 shows grouping of combining methods. We grouped type 3 and type 2 labels as type-2 labels.