Online anomaly detection with kernel density estimators

(1)

ONLINE ANOMALY DETECTION WITH

KERNEL DENSITY ESTIMATORS

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

electrical and electronics engineering

By

Mine Kerpi¸c¸ci

July 2019

(2)

Online Anomaly Detection with Kernel Density Estimators By Mine Kerpi¸c¸ci

July 2019

We certify that we have read this thesis and that in our opinion it is fully adequate, in scope and in quality, as a thesis for the degree of Master of Science.

S¨uleyman Serdar Kozat(Advisor)

H¨useyin ¨Ozkan(Co-Advisor)

Sinan Gezici

Umut Orguner

Approved for the Graduate School of Engineering and Science:

Ezhan Kara¸san

(3)

ABSTRACT

ONLINE ANOMALY DETECTION WITH KERNEL

DENSITY ESTIMATORS

Mine Kerpi¸c¸ci

M.S. in Electrical and Electronics Engineering Advisor: S¨uleyman Serdar Kozat

Co-Advisor: H¨useyin ¨Ozkan July 2019

We study online anomaly detection in an unsupervised framework and introduce an algorithm to detect the anomalies in sequential data. We first sequentially learn the density for the observed data with a novel kernel based hierarchical approach for which we also provide a regret bound in a competitive manner against an exponentially large class of estimators. In our approach, we use a binary partitioning tree and apply the nonparametric Kernel Density Estimation (KDE) method at each node of the introduced tree. The use of the partitioning tree allows us not only to generate a large class of estimators of size doubly exponential in the depth that we compete against in estimating the density, but also to hierarchically organize the class to obtain a computationally efficient final estimation. Moreover, we do not assume any underlying distribution for the data so that our algorithm can work for data coming from any unknown arbitrarily complex distribution. Note that the end-to-end processing in our work is truly online. For this, we exploit a random Fourier kernel expansion for sequentially exact kernel evaluations without a repetitive access to past data. Our algorithm learns not only the optimal partitioning of the observation space but also the optimal bandwidth, which is locally tuned for the optimal partition. Thus, we solve the bandwidth selection problem in KDE methods in a highly novel and computationally efficient way. Finally, as the data density is sequentially being learned in the stream, we compare the estimated density with a threshold to detect the anomalies. We also adaptively learn the threshold in time to achieve the optimal threshold. In our experiments with synthetic and real datasets, we illustrate significant performance improvements achieved by our method against the state-of-the-art anomaly detection algorithms.

(4)

iv

Keywords: Online anomaly detection, kernel density estimation, bandwidth se-lection, regret analysis.

(5)

¨

OZET

C

¸ EK˙IRDEK YO ˘

GUNLUK TAHM˙INC˙ILER˙I ˙ILE

C

¸ EVR˙IM˙IC

¸ ˙I ANOMAL˙I TESP˙IT˙I

Mine Kerpi¸c¸ci

Elektrik ve Elektronik Mühendisli˘gi, Yüksek Lisans Tez Danı¸smanı: Süleyman Serdar Kozat

˙Ikinci Tez Danı¸smanı: H¨useyin ¨Ozkan Temmuz 2019

Ç evrimi¸ci anomali tespitini gözetimsiz ¸cer¸cevede ¸calı¸smakta ve ardı¸sık olarak gözlemlenen veride anomali tespiti yapan bir algoritma tanıtmaktayız. Öncelikle, ¨

ozgün ¸cekirdek temelli hiyerar¸sik yakla¸sım ile ardı¸sık gözlemlenen verinin yo˘gunluk tahminini yapmakta ve üstel olarak geni¸s bir sınıf olu¸sturan yo˘gunluk tahmincilerine kar¸sı yarı¸san bu yakla¸sımın pi¸smanlık sınırlarını vermekteyiz. Yakla¸sımımızda, ikili bölümleme a˘gacı kullanmakta ve a˘gacın her dü˘gümüne parametrik olmayan Ç ekirdek Yo˘gunluk Tahmini (Ç YT) yöntemi uygulamak-tayız. Bölümleme a˘gacının kullanımı, hem yo˘gunluk tahmininde yarı¸stı˘gımız ve a˘gacın derinli˘giyle ¸cift üstel orantılı boyuttaki geni¸s tahminciler sınıfını olu¸sturmamızı sa˘glamakta hem de son tahmini verimli bir hesaplama ile yap-mamız i¸cin bu sınıfı hiyerar¸sik olarak organize etmemizi sa˘glamaktadır. Ayrıca, veri da˘gılımıyla ilgili herhangi bir varsayımda bulunmadı˘gımız i¸cin algoritmamız bilinmeyen herhangi karma¸sık bir da˘gılıma sahip veri i¸cin ¸calı¸sabilmektedir. Ç alı¸smamızda süre¸c ba¸stan sona ¸cevrimi¸ci ilerlemektedir. Bunun i¸cin, ardı¸sık ¸cekirdek hesaplamaları yerine rastgele Fourier ¸cekirdek a¸cılımını ge¸cmi¸s verilere tekrar eri¸sim gerektirmeden uygulamaktayız. Algoritmamız gözlem uzayının en iyi bölünmesini ö˘grenmesinin yanında en iyi bölünme i¸cin bölgesel olarak en iyi bant geni¸sli˘gini de ö˘grenmektedir. Böylece, Ç YT yöntemlerindeki bant geni¸sli˘gi se¸cimi problemini de olduk¸ca özgün ve hesaplama olarak verimli bir yöntemle ¸cözmekteyiz. Son olarak, verinin yo˘gunlu˘gunu ardı¸sık olarak ö˘grenirken, yo˘gunluk tahminini e¸sik de˘geriyle kar¸sıla¸stırarak anomalileri tespit etmekteyiz. Ayrıca, en uygun e¸sik de˘gerine ula¸smak i¸cin de bu e¸sik de˘gerini zaman i¸cinde ¨

o˘grenmekteyiz. Sentetik ve ger¸cek veri setleri ile ger¸cekle¸stirdi˘gimiz deneyler-imizle, en geli¸skin anomali tespit y¨ontemlerine kar¸sı elde etti˘gimiz performans kazan¸clarını g¨ostermekteyiz.

(6)

vi

Anahtar sözcükler : Ç evrimi¸ci anomali tespiti, ¸cekirdek yo˘gunluk tahmini, bant geni¸sli˘gi se¸cimi, pi¸smanlık analizi.

(7)

Acknowledgement

First of all, I would like to thank my advisor, Prof. Süleyman Serdar Kozat for his great guidance and support throughout my M.S. study. I would like to express my profound gratitude to him for sharing his vision and immense knowledge. I would also like to thank my co-advisor, Asst. Prof. Hüseyin Özkan for sharing his invaluable knowledge and support during my studies.

I would like to thank T ¨UB˙ITAK and acknowledge that this work is supported by T ¨UB˙ITAK B˙IDEB 2210-A Scholarship Programme.

I would like to thank and express my deepest gratitude to my mother Aynur for her endless support and encouragement throughout my life. I owe her everything. I wish to thank my sister Sibel and my brother-in-law Mustafa Can for their invaluable support and continuous encouragement. I would also like to thank my lovely family: my uncle Ergin and my aunt ¨Ulviye. They have been always with me whenever I need them.

Last, but not the least, I thank Mustafa Sak for his unconditional support, encouragement and devotion during this whole adventure.

(8)

List of Figures

2.1 In this example, a depth-2 binary tree partitions the observation space S = [−A, A] × [−A, A]. (a) Partition regions are illustrated shaded at their corresponding nodes. (b) Every pruning of the introduced partitioning tree defines a unique partition of the ob-servations space. The set of all possible such partitions are shown for the depth-2 tree. . . 9

5.1 The ROC performance of the compared algorithms on a mixture of two Gaussian components. . . 31 5.2 The ROC performance of the compared algorithms on the

Occu-pancy dataset. . . 32 5.3 For the Occupancy dataset, kernel bandwidth probability

varia-tions of our algorithm in time. . . 33 5.4 For the Occupancy dataset, ROC performances of our algorithm

and its static bandwidth version with fixed bandwidths. . . 34 5.5 The ROC performances of the algorithms for the Breast Wisconsin

dataset. . . 35 5.6 The ROC performances of the algorithms for the Pen digit dataset. 36

(11)

LIST OF FIGURES xi

(12)

Chapter 1 Introduction

Anomaly detection, as an unsupervised one-class machine learning problem, finds use in various applications ranging from cyber security [1], surveillance [2], failure detection [3], novelty detection [4] and data imputation [5]. We study this problem for sequential data in a truly online setting; and propose a novel algorithm that observes xt ∈ Rd at each time t, and decides whether xt is anomalous based on

the past observations {x1, x2, . . . xt−1}. In our framework, each instance xt is

processed only once without being stored and no label information is required. Therefore, the proposed algorithm is online and unsupervised, and appropriate for real time anomaly detection applications.

One of the main difficulties in anomaly detection problems is typically the ex-treme rarity of anomalies and their unpredictable nature [6], which differentiates the anomaly detection from the binary classification. In the binary classification problem, one has observations from both classes. However, in anomaly detection, there are no or extreme rare observations from the anomaly class. This essentially poses a one-class classification problem where it is reasonable to assume unifor-mity for anomalies [7]. Then, the Neyman-Pearson (NP) test [8] corresponds to comparing the density f (xt) with an appropriate threshold [9, 10] that maximizes

the detection power at a desired false alarm rate. This in turn amounts to density estimation and learning the desired threshold. Therefore, a common approach

(13)

in the literature [6] consists of two steps, which are assigning a density score to the observed data instance, and then, comparing this score with a threshold [6]. Scores below the threshold indicate anomaly.

Our technique is also two fold. We first estimate the probability density of the data in a nonparametric manner without any model assumption, and then find the anomalies by thresholding the estimated density for each instance.

Our density estimator is regret-optimal [11], where the loss is the accumulated negative log-likelihood, against a large class of estimators. To be more precise, our density estimator asymptotically performs at least as well as the best class estimator in terms of the data likelihood. To obtain an estimator in this class, we use a kernel density estimator (KDE) [12] with a different bandwidth parameter in each region of a given partition of the observation space. Hence, the class of estimators are restricted to KDEs defined on the tree where varying the band-width parameter in each region and also varying the partition itself yield the aforementioned class. The proposed algorithm learns the optimal partitioning (that can be defined over the introduced tree) of the observation space and also learns the optimal region-specific bandwidth parameters for the optimal partition even in a time-varying manner. Namely, the optimal bandwidth of the KDE that we use for each region in the optimal partition might well be different as well as nonstationary (the distribution that the data instances are drawn/coming from might be changing in time, i.e., might be time-varying).

The introduced density estimation for anomaly detection is essentially a mix-ture of experts [13] based approach. These experts are “piecewise” KDEs, “pieces” of which are generated by a hierarchical observation space partition-ing via a depth-D binary tree. Pieces correspond to the partition regions; and for each region, the optimal time varying KDE bandwidth from a finite set with cardinality k is separately learned. Our algorithm produces its density estima-tion by combining these expert estimaestima-tions. The proposed combinaestima-tion is based on a time-varying weighting that is used to mitigate overfitting/overtraining is-sues, and to asymptotically pick the optimal partition. The purpose in hierar-chically organizing the KDEs is to achieve computational efficiency while still

(14)

maintaining an exponentially large competition class of estimators. Specifically, the introduced tree based hierarchical structure provides us with a class of size ap-proximately (1.5)2D whereas the overall processing complexity in our framework scales only linearly with O(T Dk) where T is the number of processed instances. The result is a novel computationally efficient combination of hierarchically organized piecewise online KDEs with bandwidths that are optimally varying in time and space. We mathematically prove and guarantee the convergence of the density estimation performance of our algorithm to the best piecewise constructed class estimator in terms of the data likelihood. Finally, we decide whether the observed data instance is anomalous or not by comparing the estimated probabil-ity with a threshold. In our approach, an optimal NP test can be achieved with an appropriate threshold choice for a constant false alarm rate. Alternatively, we provide an algorithm to adaptively learn the optimal threshold in terms of the detection accuracy when the true labels are available. Also, we demonstrate sig-nificant performance gains by our algorithm in comparison to the state-of-the-art methods through extensive set of experiments.

1.1 Related Work

Kernel based techniques are extensively used in the machine learning literature and in particular for density estimation [14] due to their nonparametric high mod-eling power. Nevertheless, kernel bandwidth selection is one of the most critical issues in these approaches [15] since the bandwidth choice significantly affects the performance of the density estimator. There are two main approaches to handle this issue, which are based on cross validation and plug-in [16]. Cross validation based approaches mostly depend on leave-one-out scheme to find the bandwidth [16]. Plug-in based methods are based on minimization of asymptotic mean integrated squared error (AMISE) [16], where pilot bandwidths are introduced for an iterative solution. Another method is proposed in [17], which pointwise varies the kernel bandwidth based on the minimization of error such as AMISE. Although these techniques and others such as [18, 19] are impressive in their

(15)

own batch processing setting, they suffer from one or both of the following two issues: i) they use one global time-invariant bandwidth for the complete observa-tion space and time span, and ii) their computaobserva-tional complexity is prohibitively large, preventing their use in our sequential setting. There are also online KDE methods, which use compression techniques to adapt the kernel method to an online framework, with time-varying bandwidth optimization as in [20, 21, 22]. These methods aim to reduce the number of stored data instances to lower the complexity. However, this compression approach requires iterative steps to up-date the compressed model and bandwidth at each time, which is prohibitively costly in an online framework. An online adapted version of AMISE-optimal bandwidth approach is proposed in [20], which requires iterative plug-in process for each observed data instance. Compressed model of samples is used in [21] to find the AMISE-optimal bandwidth at each time after the model is updated with the observed data instance. However, they are not able to learn the local variations of data due to their bandwidth optimization approaches. Our method, however, rigorously address all of these issues by the introduced highly adaptive hierarchical KDEs and their locally optimized bandwidths with low computa-tional complexity.

There exists numerous methods [23, 24, 9, 25] that solve the anomaly detection problem in the nonsequential batch setting. For example, [24, 9] compute pair-wise cross kth distances between the observed instance and training samples for detecting an anomaly. Another method is proposed in [25], which is less sensitive to the neighborhood parameter k compared to the conventional distance based techniques. Since these methods are all batch processors, repeated use of all of the available data are required that leads to a too large memory and computational complexity. Among the Neyman-Pearson based anomaly detection techniques [10, 7, 26], minimum volume set approach is used in [10] for detecting anomalies. Also, local nearest neighbor distances are used for the anomalies in video observations in [7]. These are again both batch techniques for which an online extension is proposed in [26] that can only work when the data is Markovian. In contrast, our technique is truly online and model free, which is appropriate for real time processing of challenging data sources.

(16)

Anomaly detection problem is solved in the online setting in [27, 28, 29, 4, 30]. For example, [28] proposes a k-nearest neighbor technique, with required training steps. Online training is introduced in [30] to adapt the batch algorithms such as support vector machine (SVM) to real time applications. Since these meth-ods [28, 30] still require quadratic optimization, the computational running time significantly increases with the data length. Anomaly detection is considered as a classification problem in [29] and it is solved via a cost-sensitive approach, which requires true label information, whereas our technique is completely un-supervised. [27, 4] estimate the probability density of the observed data, and compare it with a threshold. For this, [27] assumes that the data distribution is from an exponential family. An information theoretic approach is proposed in [4] to increase the detection performance when the number of training data samples is small. These methods [27, 4] assume that the data distribution is Gaussian or a mixture of Gaussians, and solves the anomaly detection problem with a parametric approach. Unlike these methods, we do not assume any para-metric distribution shape for the observed sequential data, hence our technique can model any unknown arbitrarily complex distribution in a data driven online manner with random Fourier features [31]. Kernel approximation with random Fourier features [31] is used for binary classification problem in [32] where the proposed algorithm learns (in online manner) the linear discriminant in the ap-proximated and linear kernel space with online gradient descent based on the true label information. Our method also uses random features for obtaining an online implementation but, on the other hand, solves the online anomaly detec-tion problem by learning both the optimal partidetec-tion in a hierarchical tree and the optimal bandwidths for all regions in this partition without requiring any true label information.

1.2 Organization of the Thesis

The organization of the thesis is as follows. In Chapter 2, we first define our problem setting where we also introduce the hierarchical partitioning of the ob-servation space. Next, we present our combination of hierarchically organized

(17)

online KDEs for our density estimation in Chapter 3. Chapter 4 introduces our optimal time-varying thresholding for the final anomaly detection. In chapter 5, we demonstrate our performance improvements on artificial and real datasets. We conclude with final remarks in Chapter 6.

(18)

Chapter 2 Problem Description

We sequentially observe1 xt ∈ Rd at each time t. Our aim is to decide whether

the observed data xt is anomalous or not based on the past observations, i.e.,

x1, x2, . . . xt−1. For this purpose, we introduce a two step approach for anomaly

detection where we first sequentially estimate the probability density of xt by

using previously observed data sequence xt−1 = {xi}t−1i=1 as ft(xt). Then, we

compare the estimated probability ft(xt) with a threshold to decide whether xt

is anomalous or not. We declare a decision ˆdt at each time t as

ˆ dt =    +1, ft(xt) < ς (anomalous) −1, ft(xt) ≥ ς (normal) , (2.1)

where ς is the threshold. We also provide an algorithm that updates the threshold ςt in time to reach the best choice in terms of the detection accuracy, when

the true label dt is available. We emphasize that our framework is completely

unsupervised, i.e., no knowledge of the true label is required; but if the true label is available for an instance, then it is incorporated.

1_{In this thesis, all vectors are column vectors, and they are denoted by boldface lower case}

letters. Matrices are denoted by boldface uppercase letters. For a vector w, w0 is its transpose. Also, for a scalar x, |x| represents its absolute value and for a set X, |X| represents the cardinality of the set. The time index is given as subscript, and a sequence of {xt}Nt=1is represented as xN.

Also, 1{·}is the indicator function returning 1 if its argument condition holds, and returning 0

(19)

We assume that the observation sequence {xt}t≥1 ∈ S ⊂ Rd is generated from

an i.i.d. source where S is a bounded compact domain. In order to estimate the probability density of xt, we introduce a novel algorithm based on the context tree

weighting method [33], [34] in which we construct a class of density estimators and asymptotically achieve -at least- the performance of the best class estimator. A hierarchical partitioning tree of depth-D is constructed to partition the observation space S into various regions such that each tree node θ corresponds2

to a specific region in S. For an illustration, in Fig. 2.1a, we consider two dimensional bounded xt ∈ R2, i.e., xt = [xt,1, xt,2]0 and |xi| < A, and use a

depth-2 tree. There exist 2D+1 − 1 nodes for a depth-D tree in general, and each node region is the union of the regions of its left and right children. For example, the children of the root node θo are defined as θol = [−A, 0] × [−A, A]

and θor = [0, A] × [−A, A] in Fig. 2.1a. The root node θo corresponds to the

complete observation space such that θo = θol∪ θor = S = [−A, A] × [−A, A].

We combine ni disjoint regions on the tree to form a partition Pi such that

Pi = {θi,1, θi,2, . . . , θi,ni} with θi,k∩ θi,j = ∅ if k 6= j and Sni

j=1θi,j = S. Examples

of such partitions in the case of a depth-2 tree are given in Fig. 2.1b. In general, we construct np ≈ (1.5)2

D

[33] partitions for depth-D tree. Note that each of these partitions can be obtained by collecting the regions of the leaves after an appropriate pruning. One can obtain 5 different partitions for D = 2 as in Fig. 2.1b. We assign a kernel density estimator (KDE) to each region θi,j in each

partition Pi such that the KDE is trained with only those instances falling in

the corresponding region. Note that each partition essentially collects ni many

region-specific KDEs yielding a “piecewise” constructed KDE (where pieces are regions). Consequently, the set of (1.5)2D partitions from all possible prunings of the tree defines a class of piecewise constructed KDEs. For each assigned region-specific KDE, we also sequentially learn the optimal bandwidth in a data driven and time varying manner instead of setting it a priori before the processing starts. Hence, the resulting piecewise KDE corresponding to each partition is a collection of KDEs of locally and temporally tuned bandwidths.

2_{Here, θ denotes both the node and the corresponding region for notational simplicity, where}

(20)

Our goal is first to sequentially estimate the density of each instance xtcausally

in time by combining the class estimators and asymptotically achieve at least the performance of the best (in terms of the accumulated likelihoods) class estima-tor. This combination can be seen as a mixture of experts with each expert being a class estimator. Second, we optimally (in terms of the anomaly detection accuracy) threshold our estimated density for anomaly detection.

A -A A -A X1 X2 θo θol θor A -A X2 -A 0 A A -A X2 -A 0 A X1 X1

θoll θolr θorl θorr

A -A -A A -A A -A A -A A A -A A -A A -A 0 X1 0 X1 0 X1 0 X1 0 X2 X20 X20 X20 (a) A -A A -A X₂ X₁ Ƥ₁ A -A A -A X₂ X₁ Ƥ₂ θ_ol θo θ_or A -A A -A X₂ X₁ Ƥ₃ θ_ol θorr θorl A -A A -A X₂ X₁ Ƥ₄ θor θolr θ_oll A -A A -A X₂ X₁ Ƥ₅ θolr θ_oll θ_orr θ_orl (b)

Figure 2.1: In this example, a depth-2 binary tree partitions the observation space S = [−A, A] × [−A, A]. (a) Partition regions are illustrated shaded at their corresponding nodes. (b) Every pruning of the introduced partitioning tree defines a unique partition of the observations space. The set of all possible such partitions are shown for the depth-2 tree.

(21)

Remark: Instead of the introduced combination, i.e., or mixture of experts, of KDEs for density estimation, one can perhaps think of applying one KDE for the whole observation space. However, we emphasize that a single KDE with a single time invariant, i.e., fixed, bandwidth is highly likely to be suboptimal due to the local variations in the targeted density as well as the continuous increase in the available data size in the stream. On the contrary, our approach sequentially learns the optimal time varying bandwidth for each local region of the support of the targeted density in close accordance with the data manifold. This is another novel contribution of our presented work regarding the well-known bandwidth selection issue in KDEs [16].

We use log-loss, i.e., negative log likelihood, as our pointwise loss function: l(ft(xt)) = − log(ft(xt)) since this loss successfully encodes the data likelihood.

Then, our goal of asymptotically performing at least as well as the best density estimator in the introduced class is equivalent to the minimization of the following regret R(T ), resulting in a regret-optimal final estimator:

R(T ) = T X t=1 − log(ft(xt)) − min Pi T X t=1 − log( ˆfPi t (xt)) , (2.2)

where T is the length of the sequence xT _{= {x}

i}Ti=1 and the regret is the

perfor-mance (in terms of the data likelihood) difference between our proposed estimator ft and the best class estimator. Here, ˆfPi is the piecewise constructed KDE for

the partition Pi. In the next chapter, we achieve the introduced goal by proving

that this regret is bounded by a term that is convergent to 0 after normalization with T .

We compare the estimated density ft(xt) with a time-varying threshold ς to

optimally decide (in terms of the detection accuracy) whether the observed data xt is anomalous as in (2.1). For this purpose, by using the logistic function

rt = log(1 + exp(−(ςt− ft(xt))dt), we update the threshold to maximize the

detection power, when the true label dt ∈ {−1, +1} is available. If the true label

is never available, then one can straightforwardly adjust the threshold to bound the false alarm rate in a similar technical manner. In this work, we consider the possibility of occasional availability and opt to not update when no label

(22)

is provided for an instance. Recall that the proposed algorithm is completely unsupervised and can certainly operate without requiring the label information.

(23)

Chapter 3 Novel Density Estimator

In this chapter, we propose a novel density estimator that is based on hierarchical partitioning of the observation space using a binary tree. Our approach is to sequentially train nonparametric kernel density estimators (KDE) [12] at each node of the introduced tree by those instances falling in the corresponding region, while separately optimizing (also sequentially) the bandwidth parameter [15] for each region. This process and thus the proposed algorithm is online and sequential with no storing. For each observation xtat time t, our algorithm computationally

efficiently and optimally combines the sequentially trained KDEs to obtain the final density estimation. Our optimal combination is also learned sequentially in a truly online manner, which is shown both mathematically and experimentally to yield a highly superior anomaly detection performance after thresholding.

We opt to use the nonparametric kernel density estimation (KDE) for estimat-ing the density of the data, since the shape of the underlyestimat-ing true distribution is unknown and complex in most of the real life applications [6]. However, the bandwidth selection is an important issue in KDEs [15] and the existing solutions often attempt to select a single time invariant bandwidth (via, e.g., leave-one-out cross validation [16]) for the whole observation space, which is certainly not the best strategy. Consider a dataset consisting of instances from a bimodal (two-component with different covariance structures) distribution. Then the optimal

(24)

KDE for each component would employ different bandwidths and hence using a single bandwidth for the combined data (mixture density) is suboptimal. In addi-tion, the bandwidth parameter should be time-varying and inversely proportional to the size of the available data. To address both of these issues, we partition the whole observation space into small regions, train a KDE with a time varying bandwidth for each region and combine our locally trained KDEs for a strong final estimation. The result is a mixture of local KDEs where the bandwidths are locally tuned to the curvatures of the local density and the data manifold with adaptive scaling with respect to the number of instances.

In the following, we start with explaining the conventional KDE technique that we use at each node of the introduced binary partitioning tree, i.e., in regions of the observation space. Note that this conventional kernel density estimation technique requires to calculate the pairwise kernel evaluations between the current instance xtand all of the previous instances {xi}t−1i=1, which obviously hinders the

online processing as another issue in addition to the bandwidth selection. To also solve this, we exploit a computationally efficient kernel expansion [31, 35] that projects the data onto a space of random Fourier features. As a result, by only keeping track of the mean of projection, we not only achieve an arbitrarily well approximation (almost exact) of the kernel evaluations but also drop the requirement of storing previous observations while also effectively addressing the bandwidth selection issue. Hence, the overall process in our framework is locally bandwidth optimized (in a time-varying manner w.r.t. the available data size), truly online and unsupervised end-to-end, and our algorithm successfully exploits the modeling power of kernels.

3.1 Bandwidth

Optimized

Piecewise

Online

Kernel Density Estimator (KDE)

The main building block of our anomaly detection algorithm is the online band-width optimized KDE employed in regions of the observation space, i.e., at each

(25)

node of our partitioning tree. In Section 3.1.1 and 3.1.2, our discussion about this building block of kernel density estimation is confined to a single region from the observation space. In Section 3.2, we introduce the combination of KDEs from all possible regions.

3.1.1 Conventional KDE

We use the KDE method at each node θk to estimate the local density in the

corresponding region of the observation space based on the instances falling in that region. Let x be any point in θk, i.e., x ∈ θk, at time t before observing xt

along with the past data {xi}t−1i=1= xt−1. Then, KDE is obtained as

ˆ fθk t (x; δ) = 1 τθk t−1δd X r:xr∈θk,1≤r≤t−1 K x − x_r δ , (3.1) where τθk t−1 = Pt−1

r=11{xr∈θk} is the number of instances falling in θk until time t − 1, δ is the bandwidth and K(·) is a kernel function [12]. In this work, we use the Gaussian kernel (or radial basis) function [31]

K(x) , 1 (2π)d2 e− x 2 2 . (3.2)

After inserting (3.2) into (3.1), we obtain ˆ fθk t (x; g) = 1 τθk t−1 X r:xr∈θk,1≤r≤t−1 k(x, xr; g), (3.3) where k(x, xr; g) , 1 Ng e−gkx−xrk2 _(3.4) with g = _2δ12 being the bandwidth parameter and Ng = (π_g)

d

2 _{is the normalization} constant.

(26)

3.1.2 Online KDE

The summation in (3.3) requires all of the pairwise kernel evaluations between the test instance x and all the previous instances, which is computationally pro-hibitively complex. Therefore, for an efficient kernel evaluation, we explicitly map the data points (x, xr) in the kernel evaluation into a high dimensional

ker-nel feature space with an explicit and randomized “compact” (cf. the remark below for compactness) mapping φ : Rd → Rm _{such that the kernel evaluation}

can be arbitrarily well approximated as an inner product of the mapped points [31]

k(x, xr; g) ≈ φ(x; g)0φ(xr; g). (3.5)

In order to find such a mapping φ(·), we exploit the fact that Fourier transform of a symmetric shift invariant and continuous positive definite kernel represents a probability distribution [31], yielding

k(x, xr; g) = Z (w,b)∈(Rd×1_×R)fw ,b(w, b; g)× g π d/2√ 2 cos(w0x + b)g π d/2√ 2 cos(w0xr+ b)dwdb =Ew,b g π d/2√ 2 cos(w0x + b) g π d/2√ 2 cos(w0xr+ b) (3.6) where fw,b(w, b; g) = (4gπ)− d 2(e)− w 2 4g × 1

2π1{0≤b≤2π} is a valid probability

den-sity function obtained as a result of the Fourier transform of the kernel being used together with an auxiliary uniform randomization over b [31]. Then, by approximating the expectation result for the kernel evaluation in (3.6) through a sample mean (due to law of large numbers), we obtain the final approximation as given in (3.5) with the mapping function φ(·)

φ(x; g) , r 2 m g π d/2 cosp2gw0₁x + b1 , cosp2gw0₂x + b2 , ..., cosp2gw0_mx + bm 0 , (3.7) where {(wi, bi)}mi=1 is a random sample from fw,b(w, b;1₂) and (wi, bi) are i.i.d.

(27)

(i.e. not only Gaussian) symmetric shift invariant and positive definite kernel by only using the correct randomization fw,b(w, b; 1₂) (which is the Fourier transform

of the kernel) and modifying φ(x; g) accordingly [31].

Using the introduced sample mean approximation, we can obtain the density estimation in (3.3) as ˆ fθk t (x; g) = 1 τθk t−1 X r:xr∈θk,1≤r≤t−1 k(x, xr; g) (3.8) ' 1 τθk t−1 X r:xr∈θk,1≤r≤t−1 φ(x; g)0φ(xr; g) (3.9) = t − 1 τθk t−1 φ(x; g)0φθk t (g). (3.10) where φθk t (g) , 1 t − 1 X r:xr∈θk,1≤r≤t−1 φ(xr; g),

which can be recursively implemented in an online manner through sequential updates (after normalization with the total probability mass in θk) as in (3.12)

or line 11 in Algorithm 1.

Remark: This approximate (but asymptotically exact as m → ∞) form in (3.10) for the density estimation in (3.3) elegantly removes the pairwise kernel evaluations, and allows us to evoke only a single dot product for the whole ex-pression. This leads to a super efficient algorithmic description (cf. Algorithm 1) and an online processing framework at the cost of the explicit and randomized mapping of each instance through φ(·). At this point, we emphasize that this mapping is “compact” and the addition of mapping cost is fairly low: the qual-ity of the approximation of the expectation through the sample mean in (3.10) improves at an exponential rate in m due to the Hoeffding’s inequality (and it is asymptotically exact). Namely, m can be chosen small, when compared to the conventional mappings to the high dimensional space; for instance, when compared to the mapping based on the Taylor expansion in the case of Gaus-sian kernel. Consequently, we obtain an online kernel density estimation at a negligible transformation cost.

(28)

3.1.3 Piecewise Online KDE

We point out that every pruning of our tree defines a partition of the observation space with regions attached to the leaves of the pruned tree. In addition, at each region, i.e., at each node of our tree, there is an online KDE as explained. Fol-lowing this observation, for a given partition P = {θ1, θ2, · · · , θN}, we combine

the online KDEs ˆfθk

t ’s at the partition regions θk’s, i.e., pieces, and form a

piece-wise online KDE ˆf_tP to model the density for the complete data. Recall that an online KDE is responsible for only those instances falling in its region, hence each online KDE must be normalized with the corresponding probability mass τθk

t−1 in

its region θk while combining them. As a result, for each partition P (there are

approximately (1.5)2D

many partitions that can be possibly defined over our tree via prunings), we define a piecewise online KDE ˆf_tP(x; g) as

ˆ f_tP(x; g) = τ θk t t − 1 ˆ fθk t−1(x; g) = φ(x; g) 0 φθk t (g), (3.11)

which matches the estimation (after normalization) by the local online KDE in region θk3 x.

The updates for the piecewise online KDE are based on the recursions of the local online KDEs. Note that a local online KDE needs to be updated only when its region observes an instance. Hence, we keep the last update time τ_θ0

k and use it for the next coming observation in that region with the recursion

φθk t+1(g) = τ_θ0 kφ θk t (g) + φ(x; g) t . (3.12)

3.1.4 Bandwidth optimized piecewise online KDE

In addition, to best account for the possibly non-stationary and complex data distributions, our piecewise online KDE also learns the optimal bandwidths sep-arately for its constituent local estimators. For this purpose, we allow the band-widths for the constituent local estimators to be different from region to region,

(29)

Algorithm 1 Hierarchical Kernel Density Estimator (HKDE)

1: _{Set tree depth D, bandwidth set G and learning rate h} 2: Initialize bandwidth priors αθk

0 (g) = 1 |G|

3: _{Draw m i.i.d. samples (w, b) ∼ fw}_,b(w, b,1₂_{): w ∈ R}d and b ∈ [0, 2π]

4: Evaluate φθ₀k(g) = q 2 m g π d/2

[cos(b1), cos(b2), ..., cos(bm)]0 for all (k, g)

5: for t = 1, 2, . . . do

6: Calculate φ(xt; g), ∀g ∈ G

7: Calculate the loss lθk

t = − log τ0 θk t−1 P ∀gGα θk t (g)φ(xt; g)0φθtk(g) , ∀k

8: Obtain and return the estimation ft(xt) =

PD k=0c θk t exp(−l θk t ) 9: Update τ_θ0 k = t, ∀k 10: Update Zθk t+1(g) = Z θk

t (g) exp(h log φ(xt; g)0φθtk(g)) for all (k, g)

11: Update φθk t+1(g) = τ0 θkφ θk t (g)+φ(xt;g) t for all (k, g) 12: Update αθk t+1(g) = Z_t+1θk (g) P ∀gZt+1θk (g) for all (k, g) 13: Update cθk

t+1 for k = 0, ..., D, i.e., nodes for xt in each level

14: end for

and also time varying in line with the increasing data length. Hence, our tech-nique is tuned to local variations of the data manifold as well as the rate of increase in the data size of the stream even in a non-stationary environment. In contrast, conventional approaches use one global KDE with a fixed time invariant bandwidth which is certainly a strong limitation that we remove.

For this purpose, we use a finite set of bandwidths, and run our online lo-cal KDEs in parallel with all bandwidth values in the set. The computational complexity of this parallel running in our approach is only O(D), since the local KDEs of only those regions containing xt must be updated. In time, our

ap-proach identifies the optimal bandwidth g∗ in node θk, which has the highest log

likelihood (after scaling with h that is exactly the log-likelihood when h = 1) Zθk t (g ∗ ) = max g Y j:x_tj∈θk,1≤tj<t exp(h log ˆfθk tj (xtj; g)).

Note that the introduced optimality can be improved to any desired degree by using a larger finite bandwidth set G = {g1, g2, . . . , g|G|}, from which the optimal

(30)

Our convergence result is in line with the weighted majority approach in the mixture of experts [33, 36], details of which is also explained in Section 3.2 in the context of combining piecewise online KDEs. Briefly, we assign time varying probabilities αθk

t (g) to each node on the tree for all bandwidth values in the

set G such that Pg|G|

g=g1α

θk

t (g) = 1, α θk

t ≥ 0, ∀t. Initially, all bandwidths at a

node θk have equal priors, i.e., αθ0k(g) = _|G|1 , ∀g ∈ G. Then, we update these

probabilities as in line 10 and 12 in Algorithm 1 at each time t to converge to the best bandwidth option in the set in the sense of likelihood maximization by learning the combination parameters αθk

t (g) sequentially in time. Namely, the

final density estimation of the node θk is a combination of density estimations

evaluated with different bandwidth values as ˆ fθk t (x) = X ∀g∈G αθk t (g) ˆf θk t (x; g), (3.13) where ˆfθk

t (x; g) is the density estimation evaluated with bandwidth g as in (3.10)

and ˆfθk

t (x) is asymptotically tuned to the best g∗ since α θk

t (g∗) → 1 as t →

∞. Similarly, for a given partition P = {θ1, θ2, · · · , θN}, we obtain bandwidth

optimized piecewise online KDE as ˆ f_tP(x) = X ∀g∈G αθk t (g) ˆf P t (x; g). (3.14)

Hence, we emphasize that, since αθk

t (g∗) → 1 as t → ∞, our approach

success-fully identifies the best bandwidth in region θk in a time varying manner as time

progresses. The presented convergence result here indicates that we essentially favor the KDEs with smoother bandwidths in the beginning of the stream and gradually switch to the ones with steeper bandwidths as the data overwhelm, which is specifically to handle the overfitting issue directly from the bandwidth perspective. This is separately true for all partition regions for a given partition, leading to a piecewise online KDE with bandwidths that are chosen optimal in space and time as desired.

(31)

3.2 Efficient Combination of Hierarchically

Or-ganized Bandwidth Optimized Piecewise

Online KDEs

We point out that one can produce a bandwidth optimized piecewise online KDE for every given partition of the observation space. On the other hand, it is possible to obtain a doubly exponential large number, i.e., (1.5)2D _{with D being}

the depth, of many different partitions by all possible prunings of our partitioning tree. Hence, it is possible for one to approximate any desired partition with a tree-defined partition by using a relatively small D.

It is certainly critical to use the right partition in the case of anomaly detec-tion with density estimadetec-tion. The most important reasons are as follows. First, the right partition, for instance in the case of a mixture of Gaussian components, should obviously have its regions well-aligned in space with the mixture compo-nents. Second, for a given batch set of data, one should avoid partitions with unnecessarily large number of regions because the learning rate or the estimation quality degrades with complexity of the partition that is employed. The situa-tion is more challenging in our case of streaming data where the instances are received sequentially. In this case, the optimal partition is also not fixed but time varying. Simpler partitions with less number of regions are advantageous at the beginning of the stream, whereas one needs and can robustly use partitions with more complexities as more data become available. For these reasons, by exploit-ing the introduced large class of tree-defined partitions of size doubly exponential in depth, our goal is to produce a bandwidth optimized piecewise online KDE that is also partition optimized. Our algorithm produces a weighted combination of the bandwidth optimized piecewise online KDEs each of which corresponds to a specific class partition. The combination weights are obtained based on the performance, i.e., the data likelihood, of each partition. Hence, the weights are time-varying (since the individual partition performances are time-varying) and

(32)

our algorithm relies mostly on the best-performing partitions due to the weigth-ing. Hence, a better performing partition always receives a higher weight. Con-sequently, our proposed algorithm as a combination of the bandwidth-optimized piecewise online KDEs elegantly switches from less complex partitions to more complex ones in an optimal manner as the number of processed instances in-creases. In addition, we mathematically prove that our algorithm asymptotically picks the best partition (in the sense of data likelihood and by the fact that the best partition asymptotically gets the full weight), i.e., it asymptotically performs at least as well as the best bandwidth optimized piecewise online KDE from the introduced large class of tree-defined partitions.

To this end, we introduce a computationally highly efficient combination of all of the bandwidth optimized piecewise online KDEs of prunings of our partitioning tree. 1 _{The resulting final density estimator is given as}

ft(x) = np X k=1 ˜ cPk t fˆ Pk t (x). (3.15) Here, np ≈ (1.5)2 D and ˜cPk

t is the weight of the partition Pk at time t as

˜ cPk t = 2−ρ(Pi)_{exp − h}P u:xu∈θku,1≤u<t(l θku u ) ˜ C with normalization constant ˜C = P(1.5)2D

i=1 2 −ρ(Pi)_{exp − h}P u:xu∈θk,1≤u<t(l θiu u ) where P u:xu∈θku,1≤u<t(l θku

u ) is the total loss of the partition Pk and θku is the corresponding node of partition Pk at time u. We define our loss for the node θk

as

lθk

t = − log( ˆf θk

t (x)), (3.16)

which is the point-wise log-likelihood based on the estimator at that node. Also, ρ(Pi) = ni+li−1 corresponds to the number of bits required to represent the

par-tition Piwhere ni is the total number of nodes in partition Pi and liis the number

of nodes with depth smaller than D in the partition. As discussed, the weights are obtained in proportional to the exponentiated dynamic performances of the 1_{Note that for notational simplicity, we represent the normalized bandwidth optimized}

den-sity estimation of the node θkat time t as ˆftθk(x) in the rest of the paper by ˆf θk t (x) ← τ_t−1θk t−1fˆ θk t (x).

(33)

partitions (hence better partitions get higher weights) in a time varying manner with special regards to overfitting and ρ(Pi) measures the partition complexity

that is used to obtain an initial weighting, which is eventually overwhelmed by the data.

On the contrary, combining (1.5)2D _{partitions or running (1.5)}2D _many

band-width optimized piecewise online KDEs in parallel before combining them at each time is computationally intractable. However, there exists an efficient solution [34, 33] to combine them with computational complexity only O(D). Instead of combining the partition specific estimations as in (3.15), one can obtain the same exact final estimation result of (3.15) by only combining D +1 many node specific estimations as ft(x) = D X k=0 cθk t fˆ θk t (x) (3.17) where ˆfθk

t (x) is the density estimation of the corresponding node at the kth layer

of the tree, and cθk

t is the weight of the node θk at time t.

Remark: Recall that the density estimation of a partition is equal to the estimation of the corresponding node in that partition. Hence, although we have (1.5)2D

many different partitions, we only have D + 1 unique estimations since the instance travels through only D layers. Therefore, combining the partitions as in (3.15) exactly matches to combining the corresponding nodes at layers as in (3.17) with re-collected weights in (3.18). This allows us to solve the problem in a more efficient way since it requires combining only D + 1 density estimations.

The weight cθk

t of the node θk can be collected as [37]

cθk t = X Pk: ˆftPk(x)= ˆftθk(x) ˜ cPk t . (3.18) The weight cθk

t can also be recursively calculated by the evaluated loss or reward

of the nodes as explicitly shown in [34]. We directly provide the recursion here while referring the reader to [34] for the details of the proof of this recursion.

(34)

Then, one can obtain the weight of the node θk as cθk t = 2−(Dk+1)×QDk j=1Lθj(x t−1₎ Lθo(x t−1₎ × exp − h X u:xu∈θk,1≤u<t (lθk u ),

where θo corresponds to the root node, Dk is the layer of θk on the tree, θj is

the other child of the parent of θk at the jth layer of the tree and Lθj(x

t−1_{) is an}

auxiliary variable2 _{that allows the recursive calculation [34] for any node θ}

j using Lθj(x t−1 ) =                exp(−hP u:xu∈θj,1≤u<tl θj u ), (θj is leaf) 1 2Lθjl(x t−1_)L θjr(x t−1₎ +1₂exp(−hP u:xu∈θj,1≤u<tl θj u ), (otherwise) (3.19)

with θjl and θjr being the left and right child nodes of the node θj, respectively.

We point out that this recursion is of computational complexity O(D).

Based on this recursion (3.19) (cf. line 13 in Algorithm 1) and the description in (3.17) (cf. line 7 and 8 in Algorithm 1), which is based on the bandwidth probability assignments (3.13) (cf. line 7, 10 and 12 in Algorithm 1), we obtain our computationally efficient algorithm presented in Algorithm 1 (HKDE).

The following theorem provides the bound for the regret of our final density estimator (2.2), which combines the density estimations of the partitions.

Theorem 1. When our KDE based hierarchical model is applied with Algorithm 1, the regret defined in (2.2) is upper bounded as

R(T ) = T X t=1 − log(ft(xt)) − min Pi n_XT t=1 − log( ˆfPi t (xt)) o ≤ ρ(Pi) log(2) h ≤ (2D_{− 1) log(2)} h , ∀i (3.20)

where Pi is the ith partition in the competition class of estimators, T is the length

of sequence xT _{= {x}

i}Ti=1, h is the learning rate and ρ(Pi) = ni+ li− 1 measures

2_{This variable can be interpreted, cf. [34], as a universal probability assignment based on}

the losses accumulated over the tree and any legitimate weighting of the prunings depending on the complexities of the resulting pruned trees, i.e., prior probabilities before observing the data.

(35)

the complexity of each partition by the number of bits required to represent that partition Pi. Here, ρ(Pi) is defined based on the total number ni of nodes in

the partition and also the number li of nodes with depth smaller than D in the

partition.

Proof of Theorem 1. The proof follows similar lines to the one in [33], with the difference that we use the negative log-loss of the point-wise likelihood instead of the squared error loss. Following the definition in (3.19), Lθo(x

T_{) for the root}

node can be written as [34]

Lθo(x T_{) =} (1.5)2D X i=1 2−ρ(Pi)_exp(−h T X t=1 lθti t ),

where θti is the corresponding node of partition Pi at time t and l

θk t = − log( ˆfθk t (x)) as in (3.16). Then, we obtain Lθo(x T_{) ≥ 2}−ρ(Pi)_exp(−h T X t=1 lθti t ). (3.21)

Also, when we apply (3.19) recursively to the root node, we observe that Lθo(x t₎ Lθo(x t−1₎ = D X k=0 cθk t exp(h log ˆf θk t (xt)). Since PD k=0c θk

t = 1 and exp(h log(·)) is concave for the learning rate 0 ≤ h ≤ 1,

we obtain by Jensen’s inequality Lθo(x t₎ Lθo(xt−1) ≤ exp h log D X k=0 cθk t fˆ θk t (xt) , where ft(xt) = PD_k=0cθtkfˆ θk

t (xt) is final density estimation (3.17). Since we have

Lθo(x T_{) =} T Y t=1 Lθo(x t₎ Lθo(xt−1) ≤ T Y t=1

exp h log(ft(xt)) = exp

h T X t=1 logft(xt)) ,

(36)

where Lθo(x 0_{) = 1, we write (3.21) as} exp h T X t=1 logft(xt)) ≥ Lθo(x T₎ ≥ 2−ρ(Pi)_exp(−h T X t=1 lθ_tti). Then, T X t=1 − log(ft(xt)) ≤ log(2)ρ(Pi) h − (−hPT t=1l θ_ti t ) h . Finally, we obtain T X t=1 − log(ft(xt)) − T X t=1 − log( ˆfPi t (xt)) ≤ ρ(Pi) log(2) h .

Since ρ(Pi) ≤ 2D− 1, we obtain the following regret bound

R(T ) ≤ (2

D_{− 1) log(2)}

h ,

where h must be chosen in [0, 1] and it defines the learning rate.

Hence, the normalized regret R(T )_T is convergent to 0 at a O(1/T ) rate. This directly shows that our density estimation algorithm is regret optimal, i.e., it asymptotically performs (in terms of the accumulated log-likelihood) as well as the piecewise online KDE that corresponds to the best tree-defined partition.

In the following, we describe our anomaly detection algorithm, which decides whether the observed data instance is anomalous based on the final density esti-mation in (3.17).

(37)

Chapter 4 Anomaly Detector

We compare the estimated probability density ft(xt) at each time t with a

thresh-old to decide if xt is anomalous. A data instance xt having a sufficiently low

probability density is considered anomalous with respect to the rule ˆ dt =    +1, ft(xt) < ςt (anomalous) −1, ft(xt) ≥ ςt (normal) . (4.1)

Based on this rule, we find the optimal threshold that achieves the best anomaly detection performance by updating its value in time. For this pur-pose, we use the logistic loss function of τt that is to be minimized [38]:

rt= log(1 + exp(−(ςt− ft(xt))dt).

Then, the threshold is updated by using the Online Gradient Descent (OGD) method [11], which learns the optimal parameter minimizing the desired loss, as in Algorithm 2. We calculate the gradient of the loss function with respect to ςt

as ∇ςtrt= −dt

1 + exp (ςt− ft(xt))dt

−1

and the update follows as ςt+1 = PT(ςt− ηt∇ςtrt),

where ηt is the learning rate, PT(·) is the projection onto convex set T such that

P_T(x) = arg min

y∈T

(38)

Algorithm 2 Anomaly Detector

1: Specify ˜η and initialize the learning rate η1 = 1_η_˜

2: _{Specify the threshold set T and initialize ς}1

3: for t = 1, 2, . . . do

4: Calculate the learning rate ηt= _ηt_˜1

5: Compare ft(xt) with ςt and declare the decision ˆdt

6: Observe the actual dt and calculate ∇ςtrt

7: Update the threshold ςt+1 = PT(ςt− ηt∇ςtrt)

8: end for

Since the loss function rtis convex, our algorithm chooses to the optimal threshold

(39)

Chapter 5 Experiments

In this chapter, we extensively evaluate the performance of our anomaly de-tector on an artificial dataset of a mixture of two Gaussian components, and several other real datasets such as Occupancy [39], Breast Cancer Wisconsin [40], pen digits [40] and Susy [41]. Comparisons with one-class Support Vec-tor Machine (SVM) [23], incrementally trained SVM (we represent as “I-SVM” throughout this section) [42], K-nearest neighbors (K-NN) [24], K-D tree near-est neighbor search (K-D Tree) [43], K-Localized p-Value Estimation (K-LPE) [9], online oversampling Principal Component Analysis (online osPCA) [44] and Fourier Online Gradient Descent (FOGD) [32] are presented, which are commonly used for anomaly detection. We also include comparisons with anomaly detec-tion method based on KDE with local bandwidths where the bandwidth depends on the distance to the nearest neighbor of the observed point as in [45], and we represent this technique as “K-KDE”. Finally, we include the performance of our method without partitioning, which is represented as “Kernel Density Es-timation with Adaptive Bandwidth (KDE-AB)”. We use the receiver operating characteristics (ROC) and the Area Under Curve (AUC) to compare the anomaly detection performances of the algorithms. The ROC curves are obtained by plot-ting the rates of true detection versus false alarm while varying the corresponding

(40)

threshold1 _{in each tested algorithm. For this, we use the scores, i.e., values that}

are thresholded to detect anomalies, assigned to the observed instances by the algorithms except K-NN and K-D Tree. For them, we consider the score as the proportion of the neighbors with positive class label. The parameters (ν and γ for the one-class SVM, I-SVM and K for the K-NN, K-D Tree and K-LPE) of the compared algorithms2 are optimized separately with 10-fold cross validation (within the training set), where we reserve the 75% of each dataset for training, and the remaining for the test. Since the other methods including our algorithm AD-HKDE does not need cross validation as they are online, we set their param-eters based on a small preliminary experiment. In Algorithm 1, the tree depth and the learning rate are fixed as D = 3 and h = 0.01, respectively, in all experi-ments except the Breast Cancer Wisconsin dataset for which the learning rate is increased to h = 0.75 since its size is significantly smaller. Since in general the higher-layer nodes on our partitioning tree observe exponentially smaller number of data instances compared to the lower-layer ones, we form the bandwidth set G = {g1, · · · , gk} with relatively small g values for the root node, and relatively

large g values for the leaf nodes. We use |G| = k = 4, and form the sets such that G = × {20, 21, · · · , 2k−1}, where we increase = L(l) (l denotes the layer and L = {0.01, 0.5, 1.5, 2}) from the root to the leaves. Note that, in general, opti-mal kernel bandwidth δ is proportional to N−15 (N : number of data points) [12] (g = 1_δ is proportional to N15). Since our method works in an online framework, the number of data points changes from 1 to T in our case (T : total number of data instances). Therefore, in our experiments, we define bandwidth set such that its maximum is proportional to T15 based on our largest dataset which has 106 _{data points. Hence, the defined bandwidth set covers all datasets that we use}

in our experiments.

Finally, the dimensionality is reduced to d = 3 for all datasets (except the artificial one which is already 2-dimensional) by using PCA for only partitioning purposes since we fix the tree depth as D = 3, otherwise the full dimension is 1_{For all experiments, SVM, I-SVM, K-LPE, K-NN and K-D Tree are trained with the}

training sets, and the ROC/AUC results are obtained based on the test sets. Since the other methods including our method AD-HKDE does not need training, all results for them are based on all data points.

(41)

used in the remaining parts such as estimating the density. In Algorithm 2, the learning rate parameter is set as ˜η = 0.001 ∈ (0, 1) and the threshold set is T = {10−5, 2 × 10−5, · · · , 1}.

Throughout this chapter, “Anomaly Detector with Hierarchical Kernel Density Estimators (AD-HKDE)” represents our technique.

5.1 Artificial Dataset

In the first part of our experiments, we generate a 2-dimensional dataset of length T = 2000, whose probability density is a mixture of two Gaussian components f ∼

1 2N " −1 −1 # , " 1 0 0 4 # ! +1₂N " 1 1 # , " 4 0 0 8 # !

as the normal data. Our algorithm does not need anomalous data to work, however, we replace the normal data at 400 randomly picked instances with uniformly distributed anomalous observations to test the anomaly detection performance.

As can be seen in Fig.5.1, our algorithm AD-HKDE outperforms all the others in terms of the AUC score, while being largely superior in either relatively smaller or higher false alarm rate regions. We point out to understand the reason that the data consists of two Gaussian components, where the second component has 4 or 2 times larger variance in its dimensions compared to the other component. This specifically detrimentally affects the methods K-NN, K-D Tree and K-LPE. The reason is that, in regions of the higher variance component, one has to base the estimations on wider neighborhoods (compared to the other component) in order to observe the same k number of instances. This certainly degrades their estimation since the underlying density is probably far from being uniform in such wide neighborhoods. K-KDE suffers from the placements of the data points in the observation space due to its point-wise bandwidth. Since its kernel bandwidth is directly proportional to the distance between the observed instance and its nearest neighbor, it is not able to learn the optimal bandwidth and estimate the density correctly. Online osPCA uses only the first principal direction, and hence it loses

(42)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE KDE-AB K-KDE FOGD K-LPE Online osPCA K-NN K-D Tree I-SVM SVM AUC of AD-HKDE=0.9201 AUC of K-LPE=0.9058 AUC of KDE-AB=0.9076

AUC of Online osPCA=0.8665 AUC of FOGD=0.8778 AUC of K-KDE=0.8590 AUC of K-D Tree=0.8957 AUC of K-NN=0.8957 AUC of I-SVM=0.9027 AUC of SVM=0.9040

Figure 5.1: The ROC performance of the compared algorithms on a mixture of two Gaussian components.

the information carried by the second component, which decreases its performance significantly. FOGD, however, suffers from the fixed kernel bandwidth, which does not allow the algorithm to learn the local variations. For the same reason, our algorithm AD-HKDE outperforms SVM and I-SVM since they fit symmetric rbf kernels with equal bandwidths at each instance for each component regardless of the component covariances and the time varying data size. In accordance, our algorithm exploits the local structure and the curvature of the density and learns the optimal bandwidths for each component even in a time varying manner. Therefore, our algorithm AD-HKDE also outperforms its non-partitioned version, i.e., KDE-AB, which learns the bandwidth based on the whole observation space. We emphasize that the algorithms SVM, incrementally trained SVM, K-NN, K-D Tree and K-LPE with separate training and test phases are specifically chosen in our experiments due to their competing high performance and com-parable model free working environment, but their computational complexity is significantly larger than the one of our online detector. For example, SVM has

(43)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE KDE-AB K-KDE FOGD K-LPE Online osPCA K-NN K-D Tree I-SVM SVM

AUC of Online osPCA=0.9292 AUC of K-KDE=0.9368 AUC of FOGD=0.9490 AUC of KDE-AB=0.9531 AUC of SVM=0.9828 AUC of K-LPE=0.9708 AUC of AD-HKDE=0.9907 AUC of K-NN=0.9854 AUC of K-D Tree=0.9854 AUC of I-SVM=0.9750

Figure 5.2: The ROC performance of the compared algorithms on the Occupancy dataset.

complexity O(T3

tr) in training, and complexity O(TtrTtest) in test phase (both are

in the worst case). Incrementally trained SVM aims to reduce the computational complexity of SVM. This method proposes to divide the training data into N subsets, which has complexity O(Ttr3

N2) in training, and complexity O(

TtrTtest

N ) in

test phase. The complexity of K-LPE is approximately O(T_tr2) in training, and O(TtrTtest) in test phase. The complexity of K-NN is O(TtrTtest), which is

pro-hibitively costly when the number of data points is high. K-D Tree decreases the computational complexity of K-NN by introducing a tree structure. This method has complexity O(Ttrlog(Ttr)) in training and O(Ttestlog(Ttr)) in test

phase, which is still high for large number of data points. On the other hand, the complexity of our detector is only O(DT k) (T : data length, D: depth of the partitioning tree, k: number of g values in the bandwidth set). Moreover, our algorithm is truly online without separate training or test or cross validation phases, where increments are straightforward with new data; however, increments in these other competing ones are computationally demanding.

(44)

100 101 102 103 104 Time(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandwidth Probability ( t r )

Kernel Bandwidth Probabilities

g=0.01 g=0.02 g=0.04 g=0.08 g=0.01 g=0.02 g=0.04 g=0.08

Figure 5.3: For the Occupancy dataset, kernel bandwidth probability variations of our algo-rithm in time.

5.2 Real Datasets

In this part, we present the ROC performances of the compared algorithms on several real datasets that are widely used in anomaly detection literature. We first perform an experiment on the Occupancy dataset [39], which consists of five features that are temperature, relative humidity, light, CO2 and humidity ratio.

This dataset has 10808 data points (8107 normal, 2701 anomalous). The labels correspond to occupied (normal) and unoccupied (anomalous) room statues.

As can be seen in Fig. 5.2, our technique AD-HKDE again achieves the high-est AUC score, which indicates that the intrinsic covariance structure within this dataset heavily requires local as well as temporal bandwidth adaptation, there-fore, the superiority of our algorithm follows. In all of our experiments including the Occupancy dataset and the others, our algorithm AD-HKDE is generally highly superior in relatively smaller false alarm regions (except a few cases).

(45)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE g=0.01 g=0.02 g=0.04 g=0.08 AUC of (g=0.04)=0.8190 AUC of (g=0.08)=0.7460 AUC of (g=0.02)=0.8626 AUC of AD-HKDE=0.9907 AUC of (g=0.01)=0.9168

Figure 5.4: For the Occupancy dataset, ROC performances of our algorithm and its static bandwidth version with fixed bandwidths.

Small false alarm rate requires the estimation of low density, which is in turn largely susceptible to insufficient data in corresponding low density regions. This necessitates to adjust the bandwidths or the neighborhood sizes intelligently in those regions of the support. Our algorithm elegantly addresses this need which introduces its superiority.

In addition, we plot the temporal variations of the bandwidth probabil-ities/weights of our algorithm in Fig. 5.3 to show the bandwidth adapta-tion power of our method. As an example, we specify the bandwidth set as G = {0.01, 0.02, 0.04, 0.08} for the root node θr, and apply our algorithm on the

Occupancy dataset. Since there are four possible bandwidths, initial probabilities are αθk

0 (g) = 14, ∀g ∈ G. Then, as can be seen in Fig. 5.3, our algorithm

asymp-totically chooses g = 0.08 (the smallest bandwidth) since the probability of the bandwidth g = 0.08 converges to one in time while the others converge to zero3_.

3_{As we define the bandwidth parameter as g =} 1

2δ2, g = 0.08 corresponds to the actual

(46)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE KDE-AB K-KDE FOGD K-LPE Online osPCA K-NN K-D Tree I-SVM SVM AUC of KDE-AB=0.9083 AUC of K-KDE=0.8363 AUC of Online osPCA=0.7248 AUC of I-SVM=0.9480 AUC of K-LPE=0.9408 AUC of SVM=0.9483 AUC of FOGD=0.9555 AUC of K-D Tree=0.9778 AUC of AD-HKDE=0.9672 AUC of K-NN=0.9778

Figure 5.5: The ROC performances of the algorithms for the Breast Wisconsin dataset.

This shows the efficacy of our algorithm since KDEs generally achieve higher performance with small bandwidths when the number of instances increases.

In Fig. 5.4, we provide the ROC of our algorithm and its static bandwidth version with fixed bandwidths, i.e., with singleton G’s at the nodes. The number of data instances is low initially and then increases since the algorithms work in the online framework. Also, this occupancy dataset has relatively large intrinsic covariance. Therefore, the static version with lower g is observed to have the best performance only among the static versions with fixed bandwidths. On the contrary, our algorithm outperforms all of the static versions and has overall the best performance with the highest AUC since it adaptively changes its bandwidths in time in accordance with the size of the available data.

We also apply the algorithms on Breast Cancer Wisconsin dataset [40]. This dataset has 30 features and 569 instances (357 normal, 212 anomalous). As shown in Fig. 5.5, K-NN and K-D Tree are the best performing ones in this case (Note that K-D Tree is K-NN with lower computational complexity), and our

(47)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE KDE-AB K-KDE FOGD K-LPE Online osPCA K-NN K-D Tree I-SVM SVM

AUC of Online osPCA=0.8951 AUC of K-KDE=0.9145 AUC of I-SVM=0.9334 AUC of SVM=0.9362 AUC of FOGD=0.9503 AUC of KDE-AB=0.9769 AUC of K-D Tree=0.9858 AUC of K-NN=0.9858 AUC of AD-HKDE=0.9913 AUC of K-LPE=0.9902

Figure 5.6: The ROC performances of the algorithms for the Pen digit dataset.

method AD-HKDE is the second best (the other methods are strongly outper-formed). The reason is that the data size is not large enough to perfectly learn the bandwidths in all regions separately across time. Namely, in our method, we require to learn more parameters (compared to K-NN and K-D Tree), and hence demand more instances. Even when the data size is not sufficient as in this case, the difference is so small as demonstrated in Fig. 5.5. Also, note that our method works sequentially in the truly online setting with no separate training or test phases or even cross validation, while K-NN and K-D Tree need training data making them impractical in our processing setting. In our targeted data streaming applications, the data size is typically not an issue.

Moreover, we perform an experiment on pen digits dataset [40] that is com-posed of handwritten digits from 0 to 9. This dataset has 6870 instances of dimensionality 16 and the 0 digits are considered anomalous (156 instances in total). The rest of the data includes digits from 1 to 9 in equal number and they are considered as normal. This dataset consists of 10 digits, which can also be

(48)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True Positive Rate

ROC Curves AD-HKDE KDE-AB K-KDE FOGD K-LPE Online osPCA K-NN K-D Tree I-SVM SVM AUC of I-SVM=0.6343 AUC of Online osPCA=0.5747

AUC of SVM=0.6821 AUC of KDE-AB=0.6928 AUC of FOGD=0.8474 AUC of AD-HKDE=0.9078 AUC of KNN=0.8569 AUC of K-D Tree=0.8569 AUC of K-LPE=0.7951 AUC of K-KDE=0.6997

Figure 5.7: The ROC performances of the algorithms for the Susy dataset

considered as the mixture of 10 distributions. Since our technique AD-HKDE is able to localize these components piecewise in its density estimation even with adaptive bandwidths, we obtain a decent performance on this dataset with better AUC than all the others as can be seen in Fig. 5.6.

Finally, we perform an experiment on Susy dataset [41] that has 18 features and 106_{instances (54×10}4_{normal, 46×10}4_{anomalous). As shown in Fig. 5.7, our}

algorithm AD-HKDE has the highest AUC score among all methods. In this case, we also clearly observe the advantage of partitioning since our method AD-HKDE also significantly outperforms the other kernel based methods where FOGD uses fixed bandwidth and K-KDE uses point-wise variable bandwidth based on the nearest neighbor distance. The number of data points being large allows our algorithm AD-HKDE to perfectly learn the local structure of the dataset, and hence, optimal partitioning with corresponding bandwidths.

Online anomaly detection with kernel density estimators

ONLINE ANOMALY DETECTION WITH

KERNEL DENSITY ESTIMATORS

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

electrical and electronics engineering

By

Mine Kerpi¸c¸ci

July 2019

ABSTRACT

ONLINE ANOMALY DETECTION WITH KERNEL

DENSITY ESTIMATORS

¨

OZET

C

¸ EK˙IRDEK YO ˘

GUNLUK TAHM˙INC˙ILER˙I ˙ILE

C

¸ EVR˙IM˙IC

¸ ˙I ANOMAL˙I TESP˙IT˙I

Acknowledgement

Contents

List of Figures

Chapter 1

Introduction

1.1

Related Work

1.2

Organization of the Thesis

Chapter 2

Problem Description

Chapter 3

Novel Density Estimator

3.1

Bandwidth

Optimized

Piecewise

Online

Kernel Density Estimator (KDE)

3.1.1

Conventional KDE

3.1.2

Online KDE

3.1.3

Piecewise Online KDE

3.1.4

Bandwidth optimized piecewise online KDE

3.2

Efficient Combination of Hierarchically

Or-ganized Bandwidth Optimized Piecewise

Online KDEs

Chapter 4

Anomaly Detector

Chapter 5

Experiments

5.1

Artificial Dataset

5.2

Real Datasets