A classification learning algorithm robust to irrelevant features

(1)

Robust to Irrelevant Features

H. Altay Gfivenir

Bilkent University,

Department of Computer Engineering and Information Science 06533 Ankara, Turkey

guvenir@cs, bilkent, edu. tr http ://www. as. bilkent, edu. tr/~guvenir/

A b s t r a c t . Presence of irrelevant features is a fact of life in many real- world applications of classification learning. Although nearest-neighbor classification algorithms have emerged as a promising approach to machine learning tasks with their high predictive accuracy, they are adversely affected by the presence of such irrelevant features. In this paper, we describe a recently proposed classification algorithm called VFI5, which achieves comparable accuracy to nearest-neighbor classifiers while it is robust with respect to irrelevant features. The paper compares both the nearest-neighbor classifier and the VFI5 algorithms in the presence of irrelevant features on both artificially generated and real-world data sets selected from the UCI repository.

1 I n t r o d u c t i o n

Inductive classification or concept learning algorithms derive some f o r m of classification knowledge f r o m a set of training examples. In m o s t real-world applications of classification learning, it is c o m m o n to include all available information a b o u t the domain in the training data, and expect the learning algorithm some- how select the relevant portions [2]. This is a valid assumption since exactly which features are relevant to the target concept being learned m a y be unknown.

In recent years, instance-based nearest-neighbor (NN) classification algo- r i t h m s have emerged as a promising approach to machine learning, with researchers reporting excellent results on m a n y real-world induction tasks [1]. T h e nearest neighbor algorithm normally represents instances as feature-value pairs. In order to predict the class of a novel instance, first its distance to each of the training instances is computed. T h e n the class value of the test instance is predicted to be the class of the training e x a m p l e with shortest distance, t h a t is the nearest neighbor. Learning in nearest-neighbor classifiers consists of simply storing the training instances in m e m o r y , leaving all the c o m p u t a t i o n to the classification phase. For t h a t reason, these kind of algorithms are called lazy learners [8]. T h e kNN algorithm is a generalization of the NN algorithm, where the classification is based on a m a j o r i t y voting of the nearest k neighbors.

(2)

One solution to the problem of irrelevant features is to separately learn weights for features so that the irrelevant ones are assigned low weight values and therefore their effect on the distance measure is reduced. Feature selection is the extreme case of feature weighting, where only zero and one are used as weight values. T h e nearest-neighbor classifier is then run with only these selected features t h a t have one as their weight value. Although feature selection is a special case of feature weighting, Kohavi et al. reported t h a t increasing number of possible weights beyond two (zero and one) has very little benefit and sometimes degrades performance [12]. Wettschereck et al. provide a good review and an empirical evaluation of feature weighting methods for a class of lazy learning algorithms [16]. Some researchers have developed algorithms just for the selection of relevant features [3, 13-15].

In this paper we present a classification learning algorithm t h a t achieves high accuracy, comparable to nearest-neighbor classifier, and is not adversely affected by the presence of irrelevant features. T h e VFI5 (Voting Feature Inter- vals) algorithm described here is quite robust with respect to irrelevant features, yet achieves good performance on existing real-world datasets. The VFt5 algorithms eliminates the adverse effect of irrelevant features by its inherent voting mechanism.

The rest of the paper is organized as follows. Section 2 explains the VFI5 classification learning algorithm in detail. Section 3 presents an evaluation of the VFI5 algorithm on artificially generated d a t a sets that contain a varying number of irrelevant features. Section 4 evaluates the VFI5 algorithm on some existing data sets with artificially added irrelevant features. Section 5 concludes the paper.

2 V F I 5 C l a s s i f i c a t i o n L e a r n i n g A l g o r i t h m

T h e VFI5 classification algorithm is an improved version of the early VFI1 algorithm [5,7], which is a descendent of the C F P algorithm [11]. It has been applied to the problem of differential diagnosis of E r y t h e m a t o - S q u a m o u s diseases [6] and a r r h y t h m i a analysis of ECG signals [9]; and very promising results were obtained. Here, the VFI5 algorithm is described in detail.

The VFI5 classification learning algorithm represents a concept in terms of feature value intervals, and makes a classification based on feature votes. It is a non-incremental learning algorithm; t h a t is, all training examples are processed at once. Each training example is represented as a vector of feature values plus a label that represents the class of the example. From the training examples, the VFI5 algorithm constructs feature value intervals for each feature. The term

interval is used for feature value intervals throughout the paper. An interval represents a set of values of a given feature where the same set of class values are observed. Two neighboring intervals represent a different set of classes. For each interval, a lower bound of the values and the number of examples of each class in t h a t interval are maintained. Thus, an interval m a y represent several classes by storing the number of examples for each class.

(3)

f

₂ _B 8 ~ . . . . . . . . . . . . . Q - " i i i i i i A t e s t ' B 6 . . . . . . . O - O . . . O A ' O i i B i 4 . . . . . . . 9 - Q . . . , B , O A ' ' 2 - - - O . . . i _i _i i _i _i i _i _i i I I ill 0 2 4 6 8

f

1

F i g . 1. A sample training dataset with two features and two classes.

A : 0 A : I A : I A : I A : 0 A : 0 A : 0 A : 0 A : 0 B : 0 B : 0 B : 0 B : 0 B : 0 B : I B : 2 B : I B : 0

I

2 4 5 8 f l A : 0 A : I A : 0 A : 0 A : I A : I A : 0 B : 0 B : 0 B : 0 B : I B : I B : I B : 0

I

2 3 6 A : 0 A : 0 B : I B : 0 f2

F i g . 2. Intervals constructed by VFI5 with their class counts for the sample dataset.

t

i

A : 0 A : I A : I A : I A : 0 A : 0 A : 0 A : 0 A : 0 B : 0 B : 0 B : 0 B : 0 B : 0 B : I B : I B : I B : 0

~

I

~

I ~

I

~-

I ~-

2 4 5 t 8

1

A : 0 A : J . A : 0 A : 0 A : 0 . 5 7 A : 0 . 5 7 A : 0 A : 0 B : 0 B : 0 B : 0 B : I B : 0 . 4 3 B : 0 . 4 3 B : 0 B : I i 2 1 i 1 2 3 i i 2 5 i i 2 7 i 2 3 6 8 A : 0 B : 0 1 2 9 f l f2

(4)

In order to describe the VFI5 algorithm, consider the sample training dataset in Figure 1. In this dataset, we have two linear features f l and f2, and there are 3 examples of class A and 4 examples of class B. There are 9 intervals for each feature. The intervals formed in the training phase of the VFI5 algorithm are shown in Figure 2.

The training process in the VFI5 algorithm is given in Figure 4. The lower bounds of intervals are learned by finding the

end poinls

for each feature and for each class. T h e procedure

find_end_points( TrainingSet, f, c)

finds the lowest and the highest values for feature f from the examples of class c in the

TrainingSet.

The lowest and highest values are called the

end points,

and for each feature there are 2C end points where C is the number of distinct classes in the domain. VFI5 constructs a

point interval

at each distinct end point. Further, for linear features a

range interval

is constructed between every consecutive end points. These range intervals do not cover the end point values. M a x i m u m number of intervals constructed for linear features is 4C + 1.

Each interval is represented by a vector of <

lower, vote1,..., votec

> where

lower

is the lower bound of t h a t interval,

votei

is the vote given to class i by that interval. These votes are computed as

interval_class_vote[f , i, c] = interval_class_count[f, i, c]

class_count[c]

where

interval_class_count[f,i, c]

is the number of examples of class e which fall into interval i of feature f . The individual vote of feature f for class c,

interval_class_vote[f,

i, c], is then normalized to have the sum of votes of feature f equal to 1. Hence, the vote of feature f is a real-valued vote in [0,]]. This normalization guarantees that, unless otherwise specified, each feature has the same weight in the voting. Class votes of the intervals for the d a t a set given in Figure 1 are shown in Figure 3.

Note t h a t since each feature is processed separately, no normalization of feature values is required.

The VFI5 classifier is shown in Figure 5. The process starts by initializing the votes of each class to zero. The classification operation includes a separate preclassification step on each feature. The preclassification of feature f involves a search for the interval on feature f into which e I falls, where

e]

is the value test example e for feature f . This search is performed by the

find_interval

function in Figure 5. If that value is unknown (missing), that feature does not participate in the classification process. Hence, the features containing missing values are simply ignored. Ignoring the feature about which nothing is known is a very natural and plausible approach.

If the value for feature f of example e is known, the interval i into which

e/

falls is determined. An interval m a y contain training examples of several classes. The classes in an interval are represented by their normalized votes. The votes of an interval are already stored as part of its representation. These votes of the interval is used as the vote vector of the corresponding feature. After every feature completes their preclassification process, the individual vote vectors are

(5)

t r a i n ( T r a i n i n g S e t ) :

begin

for each feature f ff f is linear

for each class c

E n d P o i n t s [ f ] = E n d P o i n t s [ f ] U find_end_points(TrainingSet, f, c);

sort( E n d P o i n t s [ f ] ) ;

for each end point p in E n d P o i n t s [ f ]

form a point interval from end point p

form a range interval between p and the next endpoint r p else /* f is nominal */

form a point interval for each value of f

end.

for each interval i on feature f for each class c

interval_class_count[f, i, c] = 0

count_instances(f, T r a i n i n g S e t ) ;

for each interval i on feature f for each class c

interval_class_vote[f, i, c] = int . . . . l_cl . . . tly, ~, c] _{class-count[c]}

normalize interval_class_vote[f, i, e];

/* such t h a t ~ c interval_class_vote[f, i, c] = 1 */

F i g . 4. Training in the VFI5 Algorithm.

classify(e):

/* e: example to be classified */ begin

for each class c vote[c] = 0 for each feature f

for each class c

f e a t u r e _ v o t e [ y , c] = 0 / * vote of feature f for class e

*/

if ey value is known

i = find_interval(f, el)

feature_vote[f,c] = interval_class_vote[f, i, c]

for each class c

vote[c] = vote[c] + feature_vote[f,c];

return class e with highest vote[c];

end.

(6)

s u m m e d up to get a total vote vector <

vote1,..., votec

>. Finally, the class with the highest vote from the total vote vector is predicted to be the class of the test instance.

3 E m p i r i c a l E v a l u a t i o n on Artificial D a t a S e t s

In order for an empirical comparison of kNN and VFI5 algorithms, we have artificially generated data sets with varying number of relevant and irrelevant features and measured the predictive accuracies of these algorithms.

We have generated data sets where the number of relevant features ranges from 1 to 6. We call these d a t a sets Rn, where n represents the number of relevant features. These artificial data sets contain two classes. T h e instance space is divided into two regions of equal volume. 50 randomly generated instances are distributed to each of the regions uniformly. Therefore such a d a t a set contains 100 instances. Once an artificial d a t a set Rn with n relevant features is generated, we further added varying number of irrelevant features to the data set. The number of irrelevant features ranged from 0 to 20. For each such a data set, we computed the 5-fold cross-validation accuracies of both NN and VFI5 algorithms. We have repeated this process for 100 times and reported the results in Figure 6. We have run the kNN algorithm for k values of 1, 3 and 5.

It is clear from Figure 6 that VFI5 is much less affected by the existence of irrelevant features in the d a t a set. On the other hand, the predictive accuracy of the kNN algorithm almost linearly drops as the number of irrelevant features increases. Also the slope of the accuracy plot decreases as the number of relevant features increases, as expected.

4 E m p i r i c a l E v a l u a t i o n on E x i s t i n g D a t a S e t s

In order to compare the kNN and VFI5 classifiers we also tested them on six existing data sets selected from the UCI repository [4]. Since most of the datasets in the UCI repository are carefully constructed by eliminating irrelevant features, we modified the d a t a sets by artificially adding increasing number of irrelevant features. We used 1, 3 and 5 as the values of k in the kNN algorithm. The comparison of the classification accuracies kNN and VFI5 algorithms on six UCI-Repository d a t a sets with increasing number of artificially added irrelevant features is depicted in Figure 7.

The experiments indicate that, although, both algorithms achieve about the same predictive accuracy without relevant features, the accuracy of the nearest- neighbors classifier drops quickly when irrelevant features are added. On the other hand, the accuracy of the VFI5 classifier remains at about the same level as the case without the irrelevant features.

This shows t h a t the VFI5 algorithm is robust with respect to the existence of irrelevant features. The robustness of the VFI5 algorithm is due to the voting mechanism used in the classification. Since the votes of an interval, in turn a feature, are normalized, an irrelevant feature gives about the same vote to all

(7)

1.0

~

0 . 9

~

0 . 8

|

0.7 R1 data set 9 . . . " . . 1 2 3 4 5 8 7 8 9 1011121314151S17181920 Number o[ bxelevant features added

1.0

i

0.9 ~ 0 . 7 R 3 data set 1 o.~ . . . ' . . . . ' ' ' . . . . 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 S l S 1 7 1 8 1 9 2 0 N I ~ of in~levaat fl~Im+,l~ 1.o

~

0 . 9

~

0.8 m rd 0.7 o.e IL5 data ~ t 1

i

I 2 3 4 S 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 N u m b a ~ bxelevam featuzes ,,v,,,a

1.0

~

0 . 9 =

~

0 . 8

i

0.7 0 . 6 1.0

~

0 . 9 g o~ 17,2 data set J I t . . . , 1 2 3 4 S 6 7 8 91011121314151617181920 Number of inelev~,,t feaoxes added

R 4 data set

0 . 6 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0

Number of kxelevcqt features ~ded

R6 data set = .~_ 0.8 ~ 0 . 7 i i , i * , , , J i ~ t , '6 ~ 1 2 I 4 S e 7 e 9 lo111213141sl 171819~

N u m b ~ of inelevent feaunes added

F i g . 6. The comparison of the average classification accuracies for kNN and VFI5 on some artificially generated d a t a sets. Rn represents a d a t a set with n relevant features and two classes. Accuracy value is the average of 100 5-fold cross-validation accuracies.

(8)

1.0

Dermatology data set

0.9 0.8 0.7 O.~t 0.5 0.9 0.8 0.7 U :7~ -~ :"~-':--~_~:~_ _-~.. --:7-_--_--_~7__-_ ._-L.7. ~ . 7 _ ~ ~ - - [NN i t I . . . i i ~ t I J - I I 2 3 4 5 6 7 8 91011121314151617181920 Number of imHevam feauues added

Iris data set

1.0

0.5 I i , , , , , , , , , , , i i , ,

0 1 2 3 4 5 6 7 8 91011121314151817181920 N.mimr of intlevant features added

Veldclr data set

0.9 O.8 0.7 0.6 . ~ 0.4 ~ 0.3 0.2 ,.0 i

J

J 0,1 t 0.0 . . . ' ' ' 0 1 2 3 4 S 6 7 8 91011121314151617181920 Nuedaer of Melevam feaZmes added

1.0 ,

Glass data set 0.9 i 0.8 ! >, i 0.71 t o.sl .! o.s! a ~ o.4 ~ _~ o.3 U O 2 t L i i i i ~ i i i t I L 0 " 0 0 1 2 3 4 S 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 N u m ~ o f inelev~t featmes added 1.0

New-thyroid data set

0.9 0.8 0.7 "i 0.6 ~"~ ~.-,~. VFI$ ' " - ... " . . . : .... 3 N N l ... --- 5 N N I i i i i i i L r i i ~ 0"50 1 2 3 4 5 6 7 8 9 10111213141S1617181920 Number of JzRlevant feaUu~s added

Wine data set

1.0 0 . 9 0.8

~

0.7 0.6 .=~0.S 0 2 ~ ~ 0.1 . . . ; ' ' ;1'1 . . . . 0"00 1 2 3 4 5 8 9 1 1213141S1617181920

N u m b e r o f irrelevant feat=ms added F i g . 7. The comparison of the average classification accuracies for kNN and VFI5 on some of UCI-Repository data sets with increasing number of artificially added irrelevant features. Accuracy given is the average of 100 5-fold cross-vMidation accuracies.

(9)

the classes in the domain. Therefore have no effect on the outcome of the voting. T h e main advantage of the VFI5 algorithm is that it achieves this robustness without requiring an external help for feature selection.

These experiments also indicate t h a t , for higher values of k, the kNN algo- r i t h m becomes more robust to irrelevant features.

5 C o n c l u s i o n

In this paper, a voting based classification algorithm called VFI5 is described. T h e VFI5 algorithm is compared with the nearest-neighbor algorithm which has been reported to achieve high accuracy values. These algorithms were tested on b o t h artificially generated and existing data sets with increasing number of artificially added irrelevant features. Our experiments showed that, in most d a t a sets, both algorithms achieve a b o u t the similar predictive accuracy without relevant features. However, when irrelevant features are added, the accuracy of VFI5 algorithm remains at about the same level or exhibit very small a m o u n t of decrease, while the accuracy of the nearest neighbor classifier drops quickly. This shows t h a t the VFI5 algorithm is robust with respect to the existence of irrelevant features. The VFI5 algorithm achieves this by the voting mechanism used in the classification, where the votes of an irrelevant feature are about the same for all classes, and therefore have no effect on the outcome. The main advantage of the VFI5 algorithm is that it achieves this robustness without requiring an external help for feature selection.

R e f e r e n c e s

1. Aha, D., Kibler, D., Albert, M.: Instance-based Learning Algorithms. Machine Learning. 6 (1991) 37-66

2. Almuallim, H., Dietterich, T.G.: Learning with many irrelevant features. In: Pro- ceedings of the 9th National Conference on Artificial Intelligence: AAAI Press, Menlo Park (1991) 547-552

3. Cardie, C.: Automating Feature Set Selection for Case-Based Learning of Linguistic Knowledge. In: Proceedings of the Conference on Empirical Methods in NaturM Language Processing, University of Pennsylvania (1996) 113-126

4. Christopher, J.M., Murphy, P.M.: UCI repository of machine learning databases. At http ://~ww. its. uci. edu/,~mlearn/MLReposit ory. html (1998)

5. Demir6z, G.: Non-Incremental Classification Learning Algorithms based on Voting Feature Intervals. MSc. Thesis. Bilkent University, Dept. of Computer Engineering and Information Science. Ankara, Turkey (1997)

6. Demir6z, G . , Giivenir, H.A., Ilter, N.: Differential Diagnosis of Erythemato- Squamous Diseases using Voting Feature Intervals. In: Ciftcibasi, T., Karaman, M., Atalay, V. (Eds.): New Trends in Artificial Intelligence and Neural Networks (TAINN'97), K~z,lcahamam, Turkey, (May 22-23, 1997), 190-194

7. Demir6z, G., Giivenir, H.A.: Classification by Voting Feature Intervals. In: van Someren, M., Widmer, G. (Eds.): Machine Learning: ECML-97. Lecture Notes in Computer Science, Vol. 1224. Springer-Verlag, Berlin (1997) 85-92

(10)

8. Domingos, P.: Context-sensitive feature selection for lazy learners. Artificial Intel- ligence Review 11 (1997) 227-253

9. Giivenir, H.A., Acar, B., DemirSz, G., ~ekin, A.: A Supervised Machine Learning Algorithm for Arrhythmia Analysis. In: Computers in Cardiology 1997, 24 Lurid, Sweden (1997) 433-436

10. Giivenir, H.A., Akku~, A.: Weighted K Nearest Neighbor Classification on Feature Projections. In: Kuru, S., ~a~layan, M.U., Akin, H.L. (Eds.): Proceedings of the Twelfth International Symposium on Computer and Information Sciences (ISCIS XII). Antalya, Turkey. (1997) 44-51

11. Giivenir, H.A., ~irin, I.: Classification by Feature Partitioning. Machine Learning 23 (1996) 47-67

12. Kohavi, R., Langley, P., Yun, Y.: The Utility of Feature Weighting in Nearest- Neighbor Algorithms. In: van Someren, M., Widmer, G. (Eds.): Machine Learning: ECML-97. Lecture Notes in Computer Science, Vol. 1224. Springer-Verlag, Berlin (1997) 85-92

13. Langley, P.: Selection of Relevant Features in Machine Learning. In: Proceedings of the AAAI Fall Symposium on Relevance. New Orleans, USA, AAAI Press, (1994) 14. Liu, H., Setiono, R.: A probabilistic approach to feature selection - A filter solution. In: Saitta, L. (Ed.): Proceedings of the Thirteenth International Conference on Machine Learning (ICML'96) Italy (1996) 319-327

15. Skalak, D.: Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Proceedings of the Eleventh International Machine Learning Conference (ICML-94). Morgan Kauffmann, New Brunswick (1994) 293- 301

16. Wettschereck, D., Aha,D.W., Mohri, T.: Review and Empirical Evaluation of Fea- ture Weighting Methods fo~ a Class of Lazy Learning Algorithms. Artificial Intel- ligence Review 11 (1997) 273-314.