Hafta 02 - Çok Değişkenli Regresyon

(1)

Hafta 02 - C ¸ ok De ˘gis¸kenli Regresyon

BGM 565 - Siber G üvenlik için Makine Ö ˘grenme Y öntemleri Bilgi G üvenli ˘gi M ühendisli ˘gi

Y ¨uksek Lisans Programı

Dr. Ferhat Özg ür Ç atak ozgur.catak@tubitak.gov.tr

˙Istanbul S¸ehir ¨Universitesi 2018 - Bahar

(2)

˙Ic¸indekiler

1 Do ˘grusal Regresyon C¸ ok De ˘gis¸kenli Do ˘grusal Regresyon

Normal Equations

2 Regularized Linear Regression Giris¸

Regularized Gradient Descent

Regularization Approach Python

3 Lojistik Regresyon Lojistik Regresyon Maliyet Fonksiyonu Python

4 KDDCUP’99 Veri Kumesi

(3)

˙Ic¸indekiler

Normal Equations

(4)

C ¸ ok De ˘gis¸kenli Do ˘grusal Regresyon I

Multivariate Linear Regression

Tek De ˘gis¸kenli Do ˘grusal Regresyon

h(x ) = w0+w1x (1)

C¸ ok De ˘gis¸kenli Do ˘grusal Regresyon

h(x) = w0+w1x1+ · · · +wnxn

y = w0+w1x1+ · · · +wnxn

(2)

y Ba ˘gımlı de ˘gis¸ken x1, · · · ,xn Ba ˘gımsız de ˘gis¸kenler

w0 Sabit

w1, · · · ,wn Katsayı

(5)

C ¸ ok De ˘gis¸kenli Do ˘grusal Regresyon II

Yeni De ˘gis¸ken

x0 x1 x2 x3 x4 y

1.00 0.54 0.17 0.93 0.58 3.74 1.00 0.85 0.35 0.84 0.45 4.55 1.00 0.97 0.74 0.44 0.30 5.24 1.00 0.62 0.68 0.67 0.98 5.92 1.00 0.59 0.88 0.09 0.89 5.75 1.00 0.66 0.83 0.92 0.82 6.43 1.00 0.64 0.04 0.82 0.84 3.91 1.00 0.85 0.83 0.95 0.07 5.31 1.00 0.74 0.16 0.71 0.57 3.89 1.00 0.32 0.33 0.13 0.59 3.02

Eski hipotez: h(x ) = w0+w1x

Yeni hipotez: h(x) = w0x0+w1x1+w2x2+w3x3+w4x4

Ornek: h(x) = 0.2 + 2x¨ 1+0.1x2+0.5x3+1.2x4

(6)

C ¸ ok De ˘gis¸kenli Do ˘grusal Regresyon III

Hipotez: h(x) = w0x0 x₀=1

+w1x1+w2x2+ · · · +wnxn

x =





 x0

x1

... xn







∈ Rⁿ⁺¹, w =





 w0

w1

... wn







∈ Rⁿ⁺¹

w0w1· · · wn

w^T





 x0

x1

... xn







(3)

h(x) = w0+w1x1+w2x2+ · · · +wnxn

=w^Tx (4)

(7)

C ¸ ok De ˘gis¸kenli Do ˘grusal Regresyon IV

Hipotez: h(x) = w0x0+w1x1+w2x2+ · · · +wnxn

Maliyet Fonksiyonu: C(w0,w1, · · · ,wn) = _2m¹ Pm i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

Tekrarla { w_j =w_j− α_∂w^∂

jC(w0,w1, · · · ,wn) }

Yeni Algoritma (Gradient Descent)

w0=w0− α1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾ x₀⁽ⁱ⁾

w1=w1− α1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾ x₁⁽ⁱ⁾

w2=w2− α1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾ x₂⁽ⁱ⁾

· · ·

wn=wn− α1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾ xn⁽ⁱ⁾

(8)

C ¸ ok De ˘gis¸kenli Do ˘grusal Regresyon V

Genel G ¨osterim tekrarla { wj =wj− α_m¹ Pm

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾

· x_j⁽ⁱ⁾ }

(9)

Gradient Descent - Matris G ¨osterimi

Gradient descent kuralı:

w = w − α∇C(w) (5)

∇C(w) ifadesi kolon vekt ör ü s¸eklinde g österilebilir.

∇C(w) =







∂C(w )

∂w₀

∂C(w )

∂w₁

...

∂C(w )

∂w_n







∂C(w )

∂wj

= 1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾

· x⁽ⁱ⁾_j

= 1 m

m

X

i=1

x⁽ⁱ⁾_j ·

h(x⁽ⁱ⁾) −y⁽ⁱ⁾

= 1

mx^T_j(Xw − y)

(6)

∇C(w) = 1

mX^T(Xw − y) (7)

Gradient descent kuralının matris g ¨osterimi:

w = w − α

mX^T(Xw − y) (8)

(10)

Normal Denklem I

Normal Equation

Gradient Descent

−2 −1 1 2 3

−2

−1 1 2

Normal Equations

I Gradient descent, C’yi minimize etmek için ç öz ümlerden biri

I Alternatif Yinelemeli olmayan (non-iterative) y ¨ontem:Normal denklem (Normal equation)

w = X

^T

X

−1

X

^T

y (9)

(11)

Normal Denklem II

Normal Equation

Do ˘ grusal Regresyon ˙Ic¸in Normal Denklemin T ¨ uretilmesi

Hipotez fonksiyonu: h(x) = w0x0+w1x1+ · · · +wnxn

Maliyet fonksiyonu: C(w) =_2m¹ Pm

i=1 h(xⁱ) −yⁱ2

Hipotez: h(x) = w^Tx Maliyet:

h(w) = 1

2m(Xw − y )^T(Xw − y )

=

(Xw)^T− y^T

(Xw − y )

= (Xw)^TXw − (X w)^Ty − y^T(Xw) + y^Ty

=w^TX^TXw − 2(X w)^Ty + y^Ty

(10)

∂C

∂w=2X^TXw − 2X^Ty = 0 X^TXw = X^Ty

(11)

Es¸itli ˘gin her iki tarafı (X^TX )⁻¹ile c¸arpılırsa

w = (X^TX )⁻¹X^Ty (12)

(12)

Normal Denklem III

Normal Equation

x1 x2 x3 x4 y

0,54 0,17 0,93 0,58 3,74 0,85 0,35 0,84 0,45 4,55 0,97 0,74 0,44 0,30 5,24 0,62 0,68 0,67 0,98 5,92 0,59 0,88 0,09 0,89 5,75 0,66 0,83 0,92 0,82 6,43 0,64 0,04 0,82 0,84 3,91 0,85 0,83 0,95 0,07 5,31 0,74 0,16 0,71 0,57 3,89 0,32 0,33 0,13 0,59 3,02

X =







1, 00 0, 54 0, 17 0, 93 0, 58 1, 00 0, 85 0, 35 0, 84 0, 45 1, 00 0, 97 0, 74 0, 44 0, 30 1, 00 0, 62 0, 68 0, 67 0, 98 1, 00 0, 59 0, 88 0, 09 0, 89 1, 00 0, 66 0, 83 0, 92 0, 82 1, 00 0, 64 0, 04 0, 82 0, 84 1, 00 0, 85 0, 83 0, 95 0, 07 1, 00 0, 74 0, 16 0, 71 0, 57 1, 00 0, 32 0, 33 0, 13 0, 59







∈ R^10×5 y =





 3, 74 4, 55 5, 24 5, 92 5, 75 6, 43 3, 91 5, 31 3, 89 3, 02







∈ R¹⁰

w = X^TX−1

X^Ty ⇒ w = [0.12490622, 1.9516536, 2.98882317, 0.97638019, 1.96358802]

h(x) = 0.12490622 + 1.9516536 · x₁+2.98882317 · x₂+0.97638019 · x₃+1.96358802 · x₄ (13)

(13)

Normal Denklem IV

Normal Equation

Gradient Descent

I αde ˘geri seçilmesi gerekli I Yineleme sayısı oldukça fazla I Y üksek boyutlu veri k ümeleri için

oldukc¸a uygun. (Kolon sayısı y ¨uksek)

Normal Equation I Parametrik de ˘gil I Yineleme yok

I Y üksek boyutlu veri k ümeleri için uygun de ˘gil, X^TX−1

karmas¸ıklık O(n³)

(14)

Normal Denklem V

Normal Equation

import pandas as pd import numpy as np

# veri kumesini oku

verikumesi = pd.read_csv("ds2.txt",delimiter="\t") verikumesi.insert(loc=0, column=’x0’, value=1)

X = verikumesi.iloc[:,:-1].values y = verikumesi.iloc[:,X.shape[1]].values

# Normal equation

tmp = np.linalg.inv(np.matmul(X.T,X)) w = np.dot(np.matmul(tmp,X.T),y)

print(w)

# [2.06239085 2.99213354 0.98455834 2.02928992]

y_pred = np.matmul(X,w.T)

df = pd.DataFrame({"y":y,"y_pred":y_pred}) print(df)

(15)

˙Ic¸indekiler

Normal Equations

(16)

Regularized Linear Regression I

Regularization

I Model karmas¸ıklı ˘gının azaltılması ic¸in kullanılır.

I As¸ırı ö ˘grenme (ezberleme, overfitting) probleminin ç öz üm ünde kullanılır.

I Olus¸turulan modelin e ˘gitim veri k ümesine oldukça uyumlu fakat yeni örneklerde hatalı sonuçlar vermesi

I Ç ¨oz üm: olus¸turulan hipotezin kullanaca ˘gı bazı a ˘gırlıkların etkisinin azaltılması I h(x) = w0+w1x1+w2x2+w3x3. Örnek olarak w2ve w3etkisini azaltmak

istiyorsak (0’a yaklas¸maları), C’ye bir biles¸en eklenebilir.

Regularization

Hata karelerinin toplamı + λ * model karmas¸ıklık cezası

(17)

Regularized Linear Regression II

C(w0,w1, · · · ,wn) = 1 2m

m

X

i=1

h(xⁱ) −y⁽ⁱ⁾2

(14)

C(w0,w1, · · · ,wn) = 1 2m





m

X

i=1

h(xⁱ) −y⁽ⁱ⁾ + λ

n

X

j=1

w_j²



 (15)

λ: Regularization parametresi

(18)

Regularized Gradient Descent

tekrarla {

w0=w0− α1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

x₀⁽ⁱ⁾

wj=wj− α

"

1 m

m

X

i=1

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

x₀⁽ⁱ⁾

! + λ

mwj

# (16)

}

(19)

Regularization Yaklas¸ımları I

Regularization Yaklas¸ımları I L2-Regularization (Ridge)

C(w₀,w₁, · · · ,wn) = 1 2m

m

X

i=1

+ λ||w||₂

!

I L1-Regularization (Lasso)

C(w₀,w₁, · · · ,wn) = 1 2m

m

X

i=1

+ λ||w||

!

Norm - Bas¸langıc¸ noktasına (Orjin) olan uzaklık

I Mutlak de ˘ger norm (Absolute Norm):

||w|| =PN i=1|wi| I Oklid Normu (Euclidean Norm):¨

||w||2= rh

P_N i=1|wk|²i

I Genel Vekt ¨or Normu (General Vector Norm):

||w||p=h PN

i=1|w_k|^pi¹ p I w = [−2, 3, −1] ⇒

||w||2=3.7417, ||w||₁=6

(20)

Regularization Yaklas¸ımları II

L1 vs L2

I Ornek model 1: y = 1 × x¨ 1+1 × x2bu durumda I L1 = (1 + 1) × λ = 2 × λ

I L2 = (1²+1²) × λ =2 × λ

I Ornek model 2: y = 2 × x¨ 1+0 × x2bu durumda I L1 = (2 + 0) × λ = 2 × λ

I L2 = (2²+0²) × λ =4 × λ

I L1 regularization uygulandı ˘gı zaman ¨ozellik katsayılarından bazıları daha fazla 0 olmaya bas¸lar.

I L1 ceza y öntemi veri k ümelerindennitelik seçimi için daha uygundur (Sparse solution).

I

(21)

Python I

Scikit-learn Regresyon

class sklearn.linear_model.SGDRegressor(loss=’squared_loss’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, random_state=None,

learning_rate=’invscaling’, eta0=0.01, power_t=0.25, warm_start=False, average=False, n_iter=None)

I loss: squared loss I penalty=’l2’

I alpha(regularization term)

I max iter I tol

I learning rate

(22)

Python II

importnumpy as np importpandas as pd

from sklearn.linear_model import SGDRegressor

# veri kumesini oku

verikumesi = pd.read_csv("ds2.txt",delimiter="\t")

# modeli tanimla

clf = SGDRegressor(penalty=’none’, verbose=1, max_iter=100000)

# modeli egit clf.fit(X, y)

print(clf.intercept_, clf.coef_)

# Grad. Dc.: [ 0.14157558 1.91993045 2.99348001 0.98027489 1.94982636]

# Norm. Eq.: [ 0.12490622 2.06239085 2.99213354 0.98455834 2.02928992]

(23)

˙Ic¸indekiler

Normal Equations

(24)

Lojistik Regresyon I

Logistic Regression, Logit Regression

I Model P(y = 1|x): do ˘grusal fonksiyon?

I Problem: Olasılık P(y = 1|x) do ˘grusal model olamaz. P(y = 1|x) 0 ve 1, [0, 1] aralı ˘gında olmalıdır.

I x’in de ˘gis¸iminin sonuc¸ları, olasılık aralı ˘gında [0, 1] sabit olmalıdır.

I E ˘ger P(y = 1|x) sonucu +1 veya 0’a yakınsa, x de ˘gis¸iminin y etkisi fazla olmalıdır.

I Ç öz üm: Logit transformation

Logit(p) = log(_1−p^p)

0.2 0.4 0.6 0.8 1

−2 2

4 Logit

Logistic(p) = log(_1+e¹−p)

1 2 3

0.5 1

Lojistik

(25)

Lojistik Regresyon II

Hatırlatma: Euler sayısı (e)

I e = 2.7182818284590452353602874713527 I Matematik, m ¨uhendislikte sık kullanılan sabit

e = lim

n→∞

1 +1

n

(17)

n (1 + 1/n)ⁿ

1 2,00000

2 2,25000

5 2,48832

10 2,59374

100 2,70481

1.000 2,71692 10.000 2,71815 100.000 2,71827

(26)

Lojistik Regresyon III

I Ba ˘gımlı de ˘gis¸kenin kategorik oldu ˘gu regresyon modelidir.

I Ur ¨un satın¨ alındı/alınmadı

I E-posta cevabı alındı/alınmadı

I Hastalık var/yok

1 2 3

0.5 1

Calisilan saatler

Sinavsonucu(Gecti/Kaldi)

Lojistik Dogrusal +1 Orneklem

(27)

Lojistik Regresyon IV

Do ˘grusal Regresyon:

h(x) = w0+w1x1+ · · · +wnxn (18)

Sigmoid fonksiyonu:

p = 1

1 + e^−h(x) (19)

ln( p

1 − p) =w0+w1x1+ · · · +wnxn

(20)

1 2 3

0.5 1

x

y

1 2 3

0.5 1

x

y

(28)

Lojistik Regresyon V

I Ba ˘gımlı de ˘gis¸ken y ∈ {0, 1},

I Negatif sınıf etiketine sahip olan ¨ornekler ic¸in 0, pozitif sınıf etiketine sahip

örnekler için 1 g österilecektir.

I Olus¸turulacak olan sınıflandırma modeli s¸u s¸artı yerine getirmelidir:

0 ≤ h(x) ≤ 1

I Ayrık 0-1 sınıflandırmasını elde etmek ic¸in h(x) ≥ 0.5 → y = 1 h(x) < 0.5 → y = 0

(29)

Maliyet Fonksiyonu I

Cost Function

Maliyet Fonksiyonu

I Do ˘grusal regresyon için kullanılan ç öz üm, lojistik regresyon için uyumlu olmayacaktır.

I Dıs¸ b ¨ukey (Convex) fonksiyon olmaması sebebiyle birden fazla lokal minimum noktası bulunmaktadır. Bu nedenle hatalı sonuc¸lara neden olabilmektedir.

−20 −10 10 20

−20

−10 10 20

x

y

Non-Convex

−20 −10 10 20

−20

−10 10 20

x

y

Convex

(30)

Maliyet Fonksiyonu II

Cost Function

Dogrusal regresyon maliyet fonksiyonu:

C(w) = 1 m

m

X

i=1

1 2

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

(21)

Bu maliyet fonksiyonu ic¸erisinden¹₂

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

de ˘gis¸tirilsin 1

2

h(x⁽ⁱ⁾) −y⁽ⁱ⁾2

⇒ Loss(h(x), y )

C(w) = 1 m

m

X

i=1

Loss(h(x), y )

(22)

(31)

Maliyet Fonksiyonu III

Cost Function

Cost(h(x), y ) ifadesinin konveks olması ic¸in:

Loss(h(x), y ) =

(− log(h(x)), if y = 1.

− log(1 − h(x)), if y = 0. (23) I Ozellikler¨

I E ˘ger h(x) = y , Loss(h(x), y ) = 0

I E ˘ger y = 0 ve h(x) → 1 ise Loss(h(x), y ) → ∞ I E ˘ger y = 1 ve h(x) → 0 ise Loss(h(x), y ) → ∞

Gradient descent’e daha elveris¸li bir bic¸imde maliyeti yeniden yazabiliriz:

Loss(h(x), y ) = −y log (h(x)) − (1 − y ) log (1 − h(x)) (24) Bu durumda t ¨um maliyet fonksiyonumuz:

C(w) = −1 m

" _m X

i=1

y⁽ⁱ⁾log h(x⁽ⁱ⁾)

+ (1 − y⁽ⁱ⁾) log (1 − h(x⁽ⁱ⁾))

#

(25)

(32)

Maliyet Fonksiyonu IV

Cost Function

Konveks

I Lojistik regresyon ic¸in di ˘ger maliyet fonksiyonlarını kullanabilir I Ancak bu maksimum olasılık tahmini (maximum likelihood estimation)

ilkesinden t ¨uretilir ve konveks olma ¨ozelli ˘gine sahiptir

I Bu nedenle bu temel olarak lojistik regresyon ic¸in kullandı ˘gı bir maliyet fonksiyonudur.

Gradient Descent minJ(w) hesaplamak ic¸in tekrarla {

wj =wj−^α_mPm i=1

h(x⁽ⁱ⁾) −yⁱ

· x_jⁱ }

Do ˘grusal regresyon ic¸in kullanılan gradient descent algoritması ile aynıdır.

Fakat h(x) artık do ˘grusal de ˘gildir h(x) = ¹

1+e^{wT x}.

(33)

Python I

Scikit-learn Lojistik Regresyon

class sklearn.linear_model.LogisticRegression I penalty=’l2’

I C: Regularization strength

I max iter I tol

(34)

Python II

import numpy as np import pandas as pd

from sklearn.linear_model import LogisticRegression

# veri kumesini oku

verikumesi = pd.read_csv("ds_logreg.txt",delimiter="\t")

# modeli tanimla

clf = LogisticRegression(verbose=1)

# modeli egit clf.fit(X, y)

print(clf.intercept_, clf.coef_)

(35)

˙Ic¸indekiler

Normal Equations

(36)

KDDCUP’99 Veri Kumesi I

KDDCUP’99 Veri K ¨umesi

I 1999 yılında bir konferansta (The Fifth International Conference on Knowledge Discovery and Data Mining - KDD) yapılan, Bilgi Ç ıkarımı ve Veri Madencili ˘gi Araçları Yarıs¸masında (International Knowledge Discovery and Data Mining Tools Competition) kullanılan veri k ümesi.

I Amaç: ”k öt ü” ba ˘glantıları saldırılar ve ”iyi” ba ˘glantıları normal olarak ayırt edebilen tahmin modeli olan bir IDS (Intrusion Detection System) olus¸turmaktı.

I Lab ortamında gerçekles¸tirilmis¸ ve birçok saldırının sim üle edilmis¸

halinin kayıt altına alınmasıyla olus¸turulmus¸tur.

I ˙Ic¸erdi˘gi saldırılar 4 ana kategoriye ayırlmaktadır:

I DOS: denial-of-service, ¨orn. syn flood;

I R2L: unauthorized access from a remote machine, ¨orn. guessing password;

I U2R: unauthorized access to local superuser (root) privileges, ¨orn., various

“buffer overflow” attacks;

I Probing: surveillance and other probing, ¨orn., port scanning.

(37)

KDDCUP’99 Veri Kumesi II

Table:TCP ba ˘glantılarının temel ¨ozellikleri.

Feature Name Description Type

duration length (number of seconds) of the connection continuous protocol type type of the protocol, e.g. tcp, udp, etc. discrete service network service on the destination, e.g., http, telnet, etc. discrete src bytes number of data bytes from source to destination continuous dst bytes number of data bytes from destination to source continuous

flag normal or error status of the connection discrete

land 1 if connection is from/to the same host/port; 0 otherwise discrete

wrong fragment number of “wrong” fragments continuous

urgent number of urgent packets continuous

(38)

KDDCUP’99 Veri Kumesi III

Table:Alan bilgisi ile önerilen bir ba ˘glantı içindeki içerik özellikleri.

hot number of “hot” indicators continuous

num failed logins number of failed login attempts continuous

logged in 1 if successfully logged in; 0 otherwise discrete

num compromised number of “compromised” conditions continuous

root shell 1 if root shell is obtained; 0 otherwise discrete

su attempted 1 if “su root” command attempted; 0 otherwise discrete

num root number of “root” accesses continuous

num file creations number of file creation operations continuous

num shells number of shell prompts continuous

num access files number of operations on access control files continuous num outbound cmds number of outbound commands in an ftp session continuous is hot login 1 if the login belongs to the “hot” list; 0 otherwise discrete is guest login 1 if the login is a “guest”login; 0 otherwise discrete

(39)

KDDCUP’99 Veri Kumesi IV

Table:˙Iki saniyelik bir zaman aralı˘gı kullanılarak hesaplanan trafik ¨ozellikleri.

count number of connections to the same host as the current connection in the past two seconds

continuous

serror rate % of connections that have “SYN” errors continuous rerror rate % of connections that have “REJ” errors continuous same srv rate % of connections to the same service continuous diff srv rate % of connections to different services continuous srv count number of connections to the same service as the current con-

nection in the past two seconds

continuous

srv serror rate % of connections that have “SYN” errors continuous srv rerror rate % of connections that have “REJ” errors continuous srv diff host rate % of connections to different hosts continuous

(40)

KDDCUP’99 Veri Kumesi V

importnumpy as np importpandas as pd

fromsklearn.linear_modelimport LogisticRegression fromsklearn.metricsimport confusion_matrix fromsklearn.model_selection importtrain_test_split

# veri kumesini oku

kolon_adlari = [’duration’,’protocol_type’,’service’,’flag’,’src_bytes’,’dst_bytes’,’land’,’wrong_fragment’,’urgent’,

’hot’,’num_failed_logins’,’logged_in’,’num_compromised’,’root_shell’,’su_attempted’,’num_root’,’num_file_creations’,

’num_shells’,’num_access_files’,’num_outbound_cmds’,’is_host_login’,’is_guest_login’,’count’,’srv_count’,

’serror_rate’,’srv_serror_rate’,’rerror_rate’,’srv_rerror_rate’,’same_srv_rate’,’diff_srv_rate’,’srv_diff_host_rate’,

’dst_host_count’,’dst_host_srv_count’,’dst_host_same_srv_rate’,’dst_host_diff_srv_rate’,’dst_host_same_src_port_rate’,

’dst_host_srv_diff_host_rate’,’dst_host_serror_rate’,’dst_host_srv_serror_rate’,’dst_host_rerror_rate’,

’dst_host_srv_rerror_rate’,’label’]

verikumesi = pd.read_csv("kddcup99.tar.gz",compression="gzip", names=kolon_adlari, low_memory=False, skiprows=1)

# ilgili kolonlari sec

secilecek_kolonlar = [’duration’,’src_bytes’,’dst_bytes’,’wrong_fragment’,’urgent’,’hot’,’num_failed_logins’,

’num_compromised’,’root_shell’,’su_attempted’,’num_root’,’num_file_creations’,’num_shells’,

’num_access_files’,’num_outbound_cmds’,’count’,’srv_count’,’serror_rate’,’srv_serror_rate’,

’rerror_rate’,’srv_rerror_rate’,’same_srv_rate’,’diff_srv_rate’,’srv_diff_host_rate’,’dst_host_count’,

’dst_host_srv_count’,’dst_host_same_srv_rate’,’dst_host_diff_srv_rate’,’dst_host_same_src_port_rate’,

’dst_host_srv_diff_host_rate’,’dst_host_serror_rate’,’dst_host_srv_serror_rate’,’dst_host_rerror_rate’,

’dst_host_srv_rerror_rate’]

X = verikumesi[secilecek_kolonlar].as_matrix()

y = verikumesi[’label’].apply(lambda d:0ifd ==’normal.’ else1).as_matrix()

(41)

KDDCUP’99 Veri Kumesi VI

# Egitim ve test veri kumeleri olustur

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# modeli tanimla

clf = LogisticRegression(verbose=0)

# modeli egit

clf.fit(X_train, y_train)

# confusion matrix y_hat = clf.predict(X_test) cm = confusion_matrix(y_test,y_hat)

print(cm)