A taxonomy of artificial neural networks

(1)

(2)

(3)

ISTANBUL TECHNICAL UNIVERSITY_{F GRADUATE SCHOOL OF SCIENCE} ENGINEERING AND TECHNOLOGY

A Taxonomy of Artificial Neural Networks

M.Sc. THESIS Alp Eren YILMAZ

Department of Mathematics Mathematical Engineering Programme

(4)

(5)

ISTANBUL TECHNICAL UNIVERSITY_{F GRADUATE SCHOOL OF SCIENCE} ENGINEERING AND TECHNOLOGY

M.Sc. THESIS Alp Eren YILMAZ

(509171246)

Department of Mathematics Mathematical Engineering Programme

Thesis Advisor: Prof. Dr. Atabey KAYGUN

(6)

(7)

˙ISTANBUL TEKN˙IK ÜN˙IVERS˙ITES˙I F FEN B˙IL˙IMLER˙I ENST˙ITÜSÜ

Yapay Sinir A˘gları’nın bir taksonomisi

YÜKSEK L˙ISANS TEZ˙I Alp Eren YILMAZ

(509171246)

Matematik Anabilim Dalı Matematik Mühendisli˘gi Programı

Tez Danı¸smanı: Prof. Dr. Atabey KAYGUN

(8)

(9)

Alp Eren YILMAZ, a M.Sc. student of ITU Graduate School of Science Engineering and Technology 509171246 successfully defended the thesis entitled “A Taxonomy of Artificial Neural Networks”, which he/she prepared after fulfilling the requirements specified in the associated legislations, before the jury whose signatures are below.

Thesis Advisor : Prof. Dr. Atabey KAYGUN ... Istanbul Technical University

Jury Members : Assoc. Prof. Özgür MART˙IN ... Mimar Sinan Fine Arts University

Asst. Prof. Gül ˙INAN ... Istanbul Technical University

...

Date of Submission : 11 May 2020 Date of Defense : 29 May 2020

(10)

(11)

Dedicated to the memory of my father, Sebahattin Yılmaz, who always believed in my ability to be successful in the academic arena

(12)

(13)

FOREWORD

I would like to express my sincere gratitude to my advisor Prof. Dr. Atabey Kaygun for his continuous support of my research for his patience, motivation, and immense knowledge. Prof. Kaygun’s office was always open whenever I ran into a trouble, or had a question about my research. He consistently allowed this thesis to be my own work but also channeled me into the right direction.

I would like to thank my mother, Gönul Yılmaz, and sister, Ecem Nur Yılmaz, for their encouragements, motivation, and psychological support. They were always with me in this process.

May 2020 Alp Eren YILMAZ

(14)

(15)

TABLE OF CONTENTS

Page

FOREWORD... ix

TABLE OF CONTENTS... xi

ABBREVIATIONS ... xv

LIST OF TABLES ...xvii

LIST OF FIGURES ... xix

SUMMARY ... xxi ÖZET ...xxiii 1. Machine Learning... 1 1.1 Supervised Learning ... 1 1.2 Unsupervised Learning... 2 1.3 Classification vs Clustering ... 3 1.3.0.1 Classification... 3 1.3.0.2 Clustering... 4 1.4 Bias-Variance Tradeoff ... 5 1.5 Cross Validation... 7 1.5.1 Resubstitution Validation... 7 1.5.2 Hold-out Validation ... 8

1.5.3 K-fold Cross Validation ... 8

1.5.4 Leave-one-out Cross Validation ... 9

2. Basic Machine Learning Algorithms ... 11

2.1 The Gradient Descent ... 11

2.2 Regression ... 12

2.2.1 Multiple Linear Regression ... 18

2.2.2 The Regularization ... 21

2.2.2.1 The L2regularization... 22

2.2.2.2 The L1regularization... 22

2.2.3 The Ridge Regression... 23

2.2.4 The Lasso Regression... 24

2.2.5 The Elastic Net Regression... 25

2.2.6 Nonlinear Regression ... 25

2.2.6.1 Nonlinear Least Squares ... 27

2.2.6.2 Maximum Likelihood Function ... 28

2.3 Logistic Regression and Multinomial Regression... 28

2.3.1 Odds... 29

2.3.2 The Sigmoid Function ... 29

2.3.3 The Logistic Regression ... 30

(16)

2.3.4 Multinomial Logistic Regression ... 31

2.4 The K-Means ... 33

2.4.1 The K-means Objective Function... 33

2.4.2 The K-means Algorithm... 34

2.4.3 Choosing the right K ... 35

2.5 The K-Nearest Neighbors... 36

2.6 Support Vector Machines ... 37

2.6.1 Classification for Linearly Separable Data... 37

2.6.2 Classification for Not Completely Linearly Separable Data ... 41

2.6.3 Regression on Support Vector Machines... 43

2.6.4 Nonlinear Regression on Support Vector Machines... 48

3. Artificial Neural Networks... 51

3.1 Computation Graphs... 51

3.1.1 Forward Pass... 51

3.1.2 Backward Pass ... 52

3.2 Perceptron... 53

3.2.1 The Single Layer Perceptron ... 53

3.2.2 Activation Functions... 54

3.2.3 The Perceptron Learning Algorithm ... 55

3.2.4 Adaline and The Least Mean Square... 57

3.2.5 The Parallel Perceptron ... 59

3.2.5.1 The P-delta Rule ... 59

3.3 Neural Networks... 61

3.3.1 Forward Propagation ... 62

3.3.2 Back Propagation... 63

3.4 Training Artificial Neural Networks... 66

3.4.0.1 Online Learning ... 66

3.4.0.2 Batch Learning... 67

4. Types of Neural Networks... 69

4.1 The Convolutional Neural Networks... 69

4.1.1 The ReLU Layer... 70

4.1.2 The Sigmoid Layer ... 71

4.1.3 The Convolution Layer... 71

4.1.4 Back Propagation on The Convolution Layer ... 74

4.1.5 The Pooling Layer ... 76

4.2 The Recurrent Neural Networks... 77

4.2.1 Back Propagation Through Time ... 79

4.2.2 The Long Short Term Memory... 81

4.2.3 Back Propagation through time in the LSTM ... 84

4.3 Autoencoders ... 86 4.3.1 Denoising Autoencoders ... 88 4.3.2 Contractive Autoencoders ... 88 4.3.3 Sparse Autoencoders ... 88 4.3.4 Variational Autoencoders ... 90 5. Experiments ... 91

(17)

5.1 Experiments on the SVHN data set ... 91

5.1.1 Experiments with the Convolutional Neural Networks... 91

5.1.1.1 Results... 92

5.1.2 Experiments with the Recurrent Neural Networks... 93

5.1.2.1 Results... 94

5.1.3 Experiments with the Autoencoders... 94

5.1.3.1 Results... 95

5.2 Experiments on the GTZAN data set ... 97

5.2.1 Experiments with the Convolutional Neural Network ... 98

5.2.1.1 Results... 99

5.2.2.1 Results... 101

5.2.3 Experiments with Autoencoders... 102

5.2.3.1 Results... 103

5.3 Experiments on the Extended MNIST data set ... 103

5.3.1.1 Results... 104

5.3.2.1 Results... 106

5.3.3.1 Results... 108

5.4 Experiments on the Apple Stock Prices data set ... 109

5.4.1.1 Results... 110

5.4.2.1 Results... 112

5.4.3.1 Results... 114

6. Conclusion ... 117

6.1 Results of our experiments ... 117

6.2 An analysis of the results of future work... 118

REFERENCES... 121 APPENDICES ... 125 APPENDIX A.1 ... 127 APPENDIX A.2 ... 128 APPENDIX A.3 ... 131 APPENDIX A.4 ... 135 APPENDIX B.1... 139 APPENDIX B.2... 140 APPENDIX B.3... 142 APPENDIX B.4... 145 APPENDIX C.1... 149 APPENDIX C.2... 151 APPENDIX C.3... 153 APPENDIX C.4... 156

(18)

(19)

ABBREVIATIONS

ANN : Artificial Neural Networks

NN : Neural Networks

BP : Backpropagation

MLE : Maximum Likelihood Estimation CNN : Convolutional Neural Networks RNN : Recurrent Neural Networks LSTM : Long Short Term Memory

AE : Autoencoders

(20)

(21)

LIST OF TABLES

Page

Table 5.1 : CNN on the data set of SVHN ... 92

Table 5.2 : AE on the data set of SVHN ... 95

Table 5.3 : CNN on the data set of GTZAN... 99

Table 5.4 : RNN on the data set of GTZAN... 101

Table 5.5 : AE on the GTZAN data set ... 103

Table 5.6 : CNN on the EMNIST data set... 104

Table 5.7 : RNN on the EMNIST data set... 106

Table 5.8 : AE on the EMNIST data set... 108

Table 5.9 : CNN on the Apple Stock Prices data set... 110

Table 5.10 : RNN on the Apple Stock Prices data set... 112

Table 5.11 : AE on the Apple Stock Prices data set... 114

(22)

(23)

LIST OF FIGURES

Page Figure 1.1 : Tradeoff ... 6 Figure 2.1 : The sigmoid function... 29 Figure 2.2 : Log-loss ... 32 Figure 2.3 : Elbow method... 36 Figure 2.4 : Support vectors ... 37 Figure 2.5 : Support vectors ... 41 Figure 2.6 : Error function in SVM regression ... 44 Figure 2.7 : Epsilon sensitive tube in SVM regression ... 44 Figure 2.8 : SVM kernel... 48 Figure 3.1 : Perceptron... 55 Figure 3.2 : Adaline... 58 Figure 3.3 : Neural network ... 61 Figure 4.1 : CNN... 69 Figure 4.2 : ReLu ... 70 Figure 4.3 : RNN... 78 Figure 4.4 : LSTM block... 83 Figure 5.1 : CNN loss curve on the SVHN data set... 93 Figure 5.2 : CNN accuracy curve on the SVHN data set ... 93 Figure 5.3 : Summary of AE on SVHN ... 95 Figure 5.4 : AE loss curve on the data set SVHN ... 96 Figure 5.5 : AE accuracy curve on the data set SVHN ... 96 Figure 5.6 : CNN loss curve on the data set GTZAN ... 99 Figure 5.7 : CNN accuracy curve on the data set GTZAN ... 100 Figure 5.8 : RNN loss curve on the GTZAN data set ... 101 Figure 5.9 : RNN accuracy curve on the GTZAN data set ... 102 Figure 5.10 : Summary of AE on GTZAN ... 102 Figure 5.11 : AE loss curve on the GTZAN data set ... 103 Figure 5.12 : CNN loss curve on the EMNIST data set... 105 Figure 5.13 : CNN accuracy curve on the EMNIST data set ... 105 Figure 5.14 : RNN loss curve on the EMNIST data set... 107 Figure 5.15 : RNN accuracy curve on the EMNIST data set ... 107 Figure 5.16 : Summary of AE on EMNIST ... 108 Figure 5.17 : AE curves on the data set of EMNIST ... 109 Figure 5.18 : CNN loss curves on the Apple Stock Prices data set ... 111 Figure 5.19 : CNN on the Apple Stock Prices data set ... 111 Figure 5.20 : RNN loss curves on the Apple Stock Prices data set ... 112 Figure 5.21 : CNN on the Apple Stock Prices data set ... 113

(24)

Figure 5.22 : RNN on the Apple Stock Prices data set ... 113 Figure 5.23 : Summary of AE on Apple ... 114 Figure 5.24 : AE on the Apple Stock Prices data set ... 115 Figure 5.25 : AE on the Apple Stock Prices data set ... 115

(25)

SUMMARY

In this dissertation, we study Artificial Neural Networks and their taxonomy. Also, we analyze the results of our own numerical experiments of applying three different types of neural network architectures on four different data sets.

In the Chapter I, we study the stochastic and statistical processes used in the machine learning. We introduce the statistical terms we need, and two main types of machine learning methods: supervised and unsupervised learning. We finish the first chapter with a quick study of various cross validation methods.

Studying algorithms commonly used in the machine learning tasks is crucial for understanding the notion of machine learning. In the Chapter II, we studied the most basic machine learning algorithms and techniques. We start with regression and the classification methods, which are two important types of supervised learning. We then look at regularization techniques, and other algorithms which are used for supervised and unsupervised learning tasks.

Chapter III is the section that we study the fundamental structures of Artificial Neural Networks. We introduce the notions of Computation Graphs, Perceptron, and Neural Networks. We also look at forward propagation, back propagation, batch learning, and stochastic learning.

The main theoretical section of our thesis is Chapter IV. In this chapter, we study the most important types of Artificial Neural Networks architectures. These are Convolutional Neural Networks, Recurrent Neural Networks, the Hopfield Network, and Autoencoders.

In the Chapter V, we present our computational experiments on four different data sets using the main three types of Artificial Neural Networks we studied in Chapter IV. The first data sets we use is the SVHN that includes images of house numbers photographed by Google Street View. The second data set we use is GTZAN that contains the 1000 half-minute music audio files chosen from ten music genre categories. Our third data set is EMNIST which is a extended version of the popular MNIST data set. The EMNIST data set includes images of handwritten letters and numbers. The last data set we consider is the Apple Stock Prices data set that contains the time-series data of historical prices of Apple Inc. in daily frequency.

The last chapter is our conclusion section of the thesis. In this section, we summarize the results of our experiments as a table. As a consequences of these experiments, we conclude that the Convolutional Neural Network architecture works well on the SVHN and EMNIST data sets. Also, the results of our experiments show that the Recurrent Neural Networks are suitable for working on the data sets of GTZAN and Apple Stock Prices which are of sequential type. Finally, the Autoencoders are not successful as

(26)

CNNs and the RNNs for classification problems on image data sets and time-series data sets.

(27)

Yapay Sinir A˘gları’nın bir taksonomisi

ÖZET

Bu çalı¸smada Yapay Sinir A˘gları’nın temel yapısı ve taksonomisi üzerine ara¸stırma yapılmı¸stır. Teorik ara¸stımanın yanısıra üç farklı Sinir A˘gları tipi algoritmaları kullanılarak dört farklı veri kümesi üzerinde deneyler yapılmı¸stır.

˙Ilk bölümde Makine Ö˘grenimi alanında uygulanan istatistiksel ve rassal süreçler ara¸stırılmı¸stır. Varyans ve sapma gibi istatistiksel terimler incelenmi¸s ve makine ö˘gren-iminin iki önemli tipi olan gözetimli ö˘grenme ve gözetimsiz ö˘grenme tanıtılmı¸stır. Gözetimsiz Ö˘grenme adı altında sınıflandırma ve kümeleme kar¸sıla¸stırılmı¸stır. Sapma ve varyans arasındaki ili¸ski a¸sırı ö˘grenme ve eksik ö˘grenme açısından incelenmi¸stir. Ayrıca, veri kümelerinde e˘gitim ve sınama alt kümelerinin uygun ölçüde ayrılması için kullanılan çapraz geçerlilik ölçütü incelenmi¸stir.

Makine ö˘greniminde kullanılan algoritmalar makine ö˘greniminin mantı˘gını gerçek anlamda anlamak açısından çok büyük öneme sahiptir. ˙Ikinci bölümde makine ö˘greniminin algoritmaları ve teknikleri incelenmi¸stir. Gözetimli ö˘grenmenin iki önemli çe¸sidi olan sınıflandırma ve regresyon ara¸stırılmı¸s, bu iki farklı alt alanda kullanılan algoritmalar tanıtılmı¸stır. Regresyon terimini anlamak sonraki ö˘grenme sürecini anlamak açısından önemlidir. Çünkü regresyon denklemlerinde kullanılan girdi, çıktı, a˘gırlık ve sapma terimleri makine ö˘greniminde kullanılan tekniklerde tezin sonraki bölümlerinde okuyucunun kar¸sısına çıkacaktır. Bu yüzden bu terimlerin ne oldü˘gu ve regresyon tekniklerinde veriden ö˘grenme sürecinin nasıl i¸sledi˘gini anlamak oldukça önemlidir. Bu bölümde regresyon tekni˘ginde kullanılan algoritmalar ve parametreler detaylı bir biçimde okuyucuya aktarılmı¸stır. Çoklu do˘grusal regresyonun yanısıra Ridge regresyon, Lasso regresyon ve Elastic Net regresyon incelenmi¸s ve buna ba˘glı olarak L1 ve L2 olmak üzere iki düzenle¸stirme tekni˘gi incelenmi¸stir. Sınıflandırma için ise Yapısal Regresyon ara¸stırılmı¸stır. Bu alt bölümde Yapısal Regresyon’da maliyet fonksiyonun do˘gru bir ¸sekilde anla¸sılabilmesi için logit kavramı ve dönü¸sümleri, sigmoid fonksiyonu detaylıca anlatılmı¸stır. Sonrasında ise bu bilgiler ı¸sı˘gında maliyet fonksiyonu tanıtılmı¸stır. Gözetimsiz ö˘grenmede kullanılan yöntemlerden olan k-means için kullanılan amaç fonksiyonunun nasıl elde edildi˘gi açıklanmı¸stır. Bu yöntemde do˘gru k de˘gerini seçmek önemlidir. Bu seçme i¸sleminde kullanılan bir yöntem olan Elbow yöntemi ara¸stırılmı¸stır. Di˘ger bir gözetimsiz ö˘grenme yöntemi olan k-NN de incelenen konular arasındadır. Yapay Sinir A˘gları’nda kullanılan Algılayıcı mantı˘gını daha anla¸sılabilir kılmak adına Destek Vektör Makineleri detaylı bir biçimde incelenmi¸stir. Destek Vektör Makineleri’nde kullanılan algoritmanın i¸sleyi¸sinin yanısıra bu alanda kullanılan regresyon ve sınıflandırma teknikleri anlatılmı¸s ve kullanılan algoritmalar ara¸stırılmı¸stır.

Üçüncü bölüm Yapay Sinir A˘gları’nın temel yapısının incelendi˘gi kısımdır. Burada, Bilgisayım Çizitleri, Yapay Sinir A˘gları’nın ilk formu olan Algılayıcı, Paralel

(28)

Algılayıcılar, Paralel Algılayıcılar’da kulanılan ö˘grenme kuralı olan p-delta kuralı, etkilenim fonksiyonları ve Çoklu Katmanlı Sinir A˘gları incelenmi¸stir. Algılayıcı mimarisi Yapay Sinir A˘gları’nın temelini olu¸sturur. Bu yüzden bu bölümde anlatılan temel kavramlar Yapay Sinir A˘gları’nda ö˘grenme sürecinin nasıl gerçekle¸sti˘gini anlamak açısından oldukça önemlidir. Algılayıcı mimarisinde tanıtılan girdi katmanı, saklı katman ve çıktı katmanı Yapay Sinir A˘gları mimarisinin en önemli birimleridir. Bu alt bölümde katmanların sahip oldu˘gu fonksiyonlar ve burada kullanılan algoritmaların ö˘grenme sürecinde nasıl i¸slendi˘gi okuyucuya açık bir ¸sekilde anlatılmı¸stır. Sinir A˘gları’nda veriden ö˘grenme sürecinin nasıl i¸slendi˘gi ˙Ileri yayılım ve Geri yayılım algoritmaları aracılı˘gıyla tanıtılmı¸stır. Özellikle Geri yayılım algoritması ba¸sta rastgele de˘gerlere sahip a˘gırlık parametrelerinin güncellenmesi sürecini açıklayan önemli bir algoritmadır. Bu algoritma matematiksel açıdan detaylıca incelenmi¸stir. Ayrıca, Rastgele Ö˘grenme ve Toplu Ö˘grenme konuları ve bu ö˘grenme yöntemlerinde kullanılan algoritmaların matematiksel alt yapısı ara¸stırılmı¸stır.

Tezdeki ana kısım dördüncü bölümdür. Bu bölümde Yapay Sinir A˘gları’nın en önemli A˘g tipleri incelenmi¸stir. Bu kısımda incelenen A˘g tipleri Evri¸simli Sinir A˘gları, Tekrarlı Sinir A˘gları, Hopfield A˘g ve Otokodlayıcılardır. Burada, dört farklı A˘g tipinde kullanılan algoritmalar ve söz konusu a˘g tiplerinin çe¸sitli alt mimarileri ara¸stılmı¸stır. Evri¸simli Sinir A˘gları farklı boyutlarra sahip girdi verisi için kullanılabilecek bir A˘g yöntemidir. Herhangi bir iki boyutlu verinin makine ö˘grenimi için i¸slenmesinde ve bir çıktı elde edilmesinde Evri¸simli Sinir A˘gları kullanılmaktadır. Bu alt bölümde Evri¸simli Sinir A˘gları algoritmalarında kullanılan ortaklama, kaydırma, piksel ekleme gibi önemli kavramların ne oldu˘gu ve i¸sleyi¸se nasıl etkide bulundukları ile ilgili bilgiler verilmi¸stir. Evri¸simli Sinir A˘gları’nda dört farklı etkilenim fonksiyonuna sahip katmanlar olan do˘grultulmu¸s do˘grusal ünite katmanı(ReLu), sigmoid katmanı, evri¸sim katmanı ve ortaklama katmanı ayrı ayrı incelenmi¸stir. Evri¸sim katmanında geri yayılım algoritması detaylıca ara¸stırılmı¸stır. Tekarlı Sinir A˘gları dü˘gümler ve dü˘gümler arası ba˘glantılar temel alınarak olu¸sturulmu¸s bir a˘g mimarisine sahiptir. Bu mimaride dü˘güm ba˘glantılarının yönlendirilmi¸s bir döngü yaratarak veri i¸sleyi¸sini sürdürdü˘günü söyleyebiliriz. Bu veri i¸sleyi¸si dinamik bir zamansal davranı¸s sergiler. Bu mimaride kullanılan ileri yayılım ve geri yayılım algoritmaları zamana ba˘glı olan bu davranı¸sa göre olu¸sturulmu¸stur. Bu alt bölümde ileri yayılım ve geri yayılım algoritmaları zamana ba˘glı davranı¸sın veri aracılı˘gıyla nasıl sürdü˘gü detaylıca incelenmi¸s ve i¸sleyi¸sin matematiksel arka planı okuyucuya aktarılmı¸stır. Ayrıca, Tekrarlı Sinir A˘gları’nın bir alt mimarisi olan Uzun-kısa Vadeli Bellek de incelenmi¸stir. Uzun-kısa vadeli bellek mimarisi Tekrarlı Sinir A˘gları’nda kullanılan mimariden çok farklı de˘gildir. Ancak, bu alt mimaride zamana ba˘glı veri i¸sleyi¸sine ba˘glı olarak saklı katman görevi gören kapılar vardır. Girdi kapısı, unutma kapısı, girdi geçi¸s kapısı ve çıktı kapısı diye tanımlanan bu saklı birimlerde verinin nasıl i¸slendi˘gi ve ö˘grenme sürecinin nasıl gerçekle¸sti˘gi hakkında bilgi verilmi¸stir. Bu a˘g tipinin ö˘grenme sürecinde kullanılan ileri yayılım ve geri yayılım algoritmaları matematiksel arka planı göz önüne alınarak detaylıca açıklanmı¸stır.

Be¸sinci bölümde üç farklı A˘g tipindeki algoritmalar kullanılarak dört farklı veri kümesi üzerinde deneyler yapılmı¸stır. Yani her bir sinir a˘gı çe¸sidi her bir veri kümesi üzerinde çalı¸stırılmı¸s ve sonuçlar analiz edilmi¸stir. "Street View House Numbers" veri kümesi Google Street View tarafından çekilen ev numaralarını içermektedir. "GTZAN" veri kümesi yarım dakika uzunlu˘gunda 1000 adet müzik dosyası içerir. Bu

(29)

müzikler 10 farklı müzik türündedir. "EMNIST" veri kümesi "MNIST" veri kümesinin geni¸sletilmi¸s halidir. Elle yazılmı¸s numaraların yanısıra harflerin de görüntülerini içermektedir. Son veri kümesi "Apple Stock Prices" ise Apple ¸sirketinin günlük hisselerinin verilerini içermektedir. Dört veri kümesi üzerinde üç a˘g tipi kodlanarak toplamda 12 deney yapılmı¸stır. Deneylerde Python programlama dilinden ve bu dil aracılı˘gıyla kullanılan Keras kütüphanesinden faydalanılmı¸stır. Her bir deney sonucunda elde edilen kesinlik ve hata verileri tablo aracılı˘gıyla verilmi¸stir. Kesinlik ve hata verilerinin do˘grulanması için her bir deney üç defa yapılmı¸s ve sonuçlar tabloda ifade edilmi¸stir. Ayrıca her bir deney sonucu elde edilen sonuç grafikleri de sunulmu¸stur.

Son bölüm tezin sonuç kısmıdır. Bu bölümde bir önceki bölümde yapılan deneyler ve bu deneylerin sonuçlarını özetleyen bir tablo hazırlanmı¸stır. Bu tablodan yola çıkarak tezde yapılan deneylerin sonuçları yorumlanmı¸stır. Bu deneyler sonucunda Evri¸simli Sinir A˘gları’nın görüntü verilerini içeren "EMNIST" ve "SVHN" veri kümeleri üzerinde çok iyi çalı¸stı˘gı söylenebilir. Tekrarlı Sinir A˘gları ise ardı¸sık veriye sahip "GTZAN" veri kümesi ve zamana ba˘glı ardı¸sık veri içeren "Apple Stock Prices" veri kümesi üzerinde ba¸sarılı sonuçlar vermektedir. Son olarak Otokodlayıcılar ise söz konusu veri kümeleri üzerinde Evri¸simli Sinir A˘gları ve Tekrarlı Sinir A˘gları kadar ba¸sarılı sonuçlar verememi¸stir.

(30)

(31)

1. Machine Learning

Machine learning is a experimental field which aims to learn something useful [1] by creating a computational model from data by using computational methods and algorithms. The parameters of the model are updated by using statistical methods and suitable algorithms in the learning process. The aim is to obtain useful information at the end of the learning process from the model and new data.

Machine learning accomplishes this task by finding an optimal solution for a task applying suitable algorithms. This process of finding an optimal solution is called optimization. The optimization here means minimizing an error function or maximizing a utility function. The term utility may be any function based on a particular task. The term of error is the difference between actual data and learned data which may also depend on a task.

Another important factor in machine learning algorithms is statistics. Statistics and machine learning are two fields which are built on similar notions but their purposes are different. Statistics is a field that aims to design models for inference about the relationships between the variables of data, whereas the aim of machine learning is to obtain accurate predictions.

1.1 Supervised Learning

Suppose that we have data set consists of n ordered pairs of the form (x1, y1), (x2, y2), ...., (xn, yn)

where each xi is set of measurements for a data point, and each yi is a label for that

data point. For the purposes of this section, we assume each xi and yi are numerical

vectors.

Supervised learning is a type of machine learning where one constructs a model of a function which maps input data to output data, and also tests any such model for fitness. The main distinguishing feature of supervised learning is the fact that one has

(32)

a cost function or a fit function that measures a model’s fitness as an approximation. In order to measure the fitness of a model, we split out data into two disjoint subsets: a training setand a test set. For the test data set, the most common practice is to sample mpoints

(xi1, yi1), ..., (xim, yim)

at random. Then remaining points are set aside as the training set. We will investigate this process in detail in Subsection 1.5 below. Training set provides a source for the model we are going to learn the mapping function from using a mixture of statistics, optimization, linear algebra and computer science. We build a model on the train set, and finally we measure fitness on the test data set [2, 3].

Two most important applications of supervised learning methods are classification and regression problems. Classification aims to split the data set into disjoint subsets through a constructed model. In these cases yi comes from a discrete set such as

{−1, 1}. Whereas in regression we aim to construct a linear function f (x) = α · x + β to approximate the mapping function yi∼ f (xi) from numerical vector input variables

xto a continuous output variable y [3].

1.2 Unsupervised Learning

We know that supervised learning aims to build a model of a mapping function from an input, also known as dependent variables, to an output, also known as the independent variable or the target variable. However, in unsupervised learning there is no target variable, where one has only the input data. The main objective of unsupervised learning is to find regularity and structure in input data without an externally set fitting criteria [4].

The most common applications of unsupervised learning are as follows [4]:

1. Association: This is a unsupervised learning method where one explores rules that describe large parts of data, by finding relations between data subsets [5]. It interests that functions which are perform on elements of a set tend to applied functions on elements of another set. For instance, people who adopt a cat from pet shop also tend to adopt a dog.

(33)

2. Dimensionality Reduction: In this method, we reduce the number of features (the dimension of the input variables) and represent the same data with fewer dimension. Here by feature we mean the colons of the data matrix, and by dimension we mean the number of features of data set. Dimensionality reduction is useful method when data set has large number of features. Any such reduction would render algorithms faster and help us to avoid costly storage. Principal Component Analysis, Linear Discriminant Analysis, and Singular Value Decomposition are commonly used algorithms in order to reduce dimensionality [6].

3. Clustering: Clustering is most commonly used method in unsupervised learning. Clustering aims to split the data set into disjoint subsets where each subset is uniformly similar. It aims to group similar items in same set and different items in different sets. We use clustering methods to solve problems encountered in social networks, bioinformatics, marketing, segmentation fields mostly [6].

1.3 Classification vs Clustering 1.3.0.1 Classification

Classification is a supervised learning method which is used on predicting qualitative result for a dependent variable which is categorical or discrete [3]. Classification problem can be a binary classification problem or a multivariate classification problem.

In binary classification, dependent variable can only have two different values. These observations are usually coded numerically as 0 and 1. For instance, we may use binary classification to detect whether an e-mail message is spam or not. The labels spam and not-spamdefines a dependent variable on our data set which we aim to predict from the content of the e-mails. So, we create a new feature SPAM which takes values 1 for spam and 0 for not-spam for each e-mail message.

In a multivariate classification problem we create models where the dependent variable is discrete but has more than two possible values. For example, if we try to predict the medical condition of a patient among three possible diagnoses such as stroke, drug overdose, and epileptic seizure on the basis of his/her symptoms. One can be

(34)

consider to encode the diagnoses numerically 1 for stroke, 2 for drug overdose, and 3 for epileptic seizure.

There are several methods one can use for classification algorithms. In this thesis we concentrate on Logistic Regression, K-Nearest Neighbours, Naive Bayes, Decision Trees, and Support Vector Machines.

1.3.0.2 Clustering

Clustering aims to group similar objects in same cluster and objects with dissimilar objects in different clusters [7].

Suppose that we have data set has n observations and p features are collected from tissue samples of breast cancer patients. Here n is the number of breast cancer samples, and p is is the number of measurements collected from each tissue sample. So, we may encounter with different unknown subtypes of breast cancer. Clustering is used to determine these subgroups. This is an unsupervised machine learning problem since data set has unlabeled independent variable.

In general, observations are clustered on the basis of features in an attempt to determine groups within observations. Features are clustered on the basis of observations for the purpose of exploring subgroups of within the data set. A commonly used method for clustering is k-means method which we explore in this thesis.

Clustering vs Classification

Attitude Classification Clustering

Learning type Supervised Learning Unsupervised Learning

Data type Labeled data Unlabeled data

Knowledge of data

Prior knowledge Non prior knowledge Algorithms Logistic regression, Support

Vector Machines, K-Nearest Neighbours, Naive Bayes, Decision Trees

K-means

Aim Classify observations on in-dependent variables

Grouping samples based on observations or features in data set

(35)

1.4 Bias-Variance Tradeoff

Bias is the difference between expected prediction and true value. And we know that as the complexity of the model increases, the bias in the model decreases [8].

Variance measures the variability of a model prediction for a given set of data points. Variance gives us a numerical value about how wide the expected outcomes vary for a particular data set. Similar to the bias, as complexity of the model increases, the variance decreases [8].

Let us explain these two terms from a mathematical point of view. Suppose that we try to predict the output y, and the available input data is x. Assume that there is a relationship between x and y as

y= f (x) + e (1.1)

where e is error term which is normally distributed with a mean of 0 with some variance δ , and f is a function defined on the input data. Let us assume that f (x) is a polynomial of degree 1, and we want to approximate ˆf(x) by using linear regression. The function of ˆf(x) gives the prediction value. So, bias is calculated as

bias= f (x) − E[ ˆf(x)] (1.2)

where E[ ˆf(x)] is expected prediction value and f (x) is true value. On the other hand, variance is expectation of square of the difference between prediction value and expected prediction value. So, variance is expressed as

variance= E[( ˆf(x) − E[ ˆf(x)])2] (1.3) The expected squared error at a point x is

Err(x) = E[(y − ˆf(x))2] (1.4)

Also, the expected squared error can be expressed as

Err(x) = (E[ ˆf(x)] − f (x))2+ E[( ˆf(x) − E[ ˆf(x)])2] + σ_e2 (1.5) where σ_e2is the irreducible error. It is the error that cannot be reduced by constructing better models. It is a measure of amount of noise in the data.

(36)

We see that the expected square error is the sum of the square of the bias, the variance, and the irreducible error. So, we aim to decrease the variance and the bias in order to obtain a low expected error and a high predictive performance. Therefore, a low variance and a low bias increases the predictive performance of the model in supervised learning. High and low values of bias and variance gives information on whether the model shows underfitting and overfitting. If the model shows a high bias and low variance, then this situation is described as an underfitting. On the other hand, if the model shows high variance and low bias then we describe the situation as an overfitting [8].

Figure 1.1 : Bias-Variance Tradeoff.

Source: [9] https://arxiv.org/pdf/1810.08591.pdf

We aim to find low bias and low variance in order to achieve high prediction and to reduce expected error in supervised learning. However, we cannot reduce bias and variance both at same time. Because, a reduction in variance usually increases bias, and a reduction in bias usually increases variance. Hence, we need to find an optimal balance between the bias and the variance to avoid overfitting or underfitting. This is usually referred as the bias-variance tradeoff. the Bias-variance tradeoff is the process of finding right balance between the bias and variance to achieve a good predictive performance.

(37)

1.5 Cross Validation

Cross validation is a statistical method which measures the fitness of a model by segmenting the data into two folds [10]. One of these folds, the training set, is used to build a model while the other fold, the test data set, is used for measuring the fitness of the model.

There are two main reasons why one might need cross validation [10]:

(i) Estimating performance of a learned model from available data, i.e. measuring generalizability of the model.

(ii) Comparing performances of two or more models in order to find the optimal algorithm for the available data. Alternatively, comparing performances of two or more variants of a parametrized model.

The methods we use in this thesis are

1. Resubstitution validation, 2. hold-out validation,

3. k-fold cross validation, and 4. leave-one-out cross validation

In this section, we will investigate these methods in detail.

1.5.1 Resubstitution Validation

The model is trained on all available all data, and then it is tested on the same data set. This validation process uses all data, but this situation might lead to over-fitting. Model can perform well on the available data well but there are no guarantees for performance on a unseen data point [10].

(38)

1.5.2 Hold-out Validation

In order to avoid an over-fitting, available data is divided into two disjoint subsets as a training set and a test set. Then the model is built on the training set only.

Hold-out validation avoids an overlap between the training and the test data sets. This usually brings more accurate predictions. However, since we do not use all of the data to build a model, the results depend on the choice for the train/test split. Data subset in test fold can be valuable for training. Not using this set in training may affect the performance of prediction in a negative way. Hold-out validation process must be repeated at many times to avoid this. However, this repetition must be performed randomly, not systematically. Even so, some subset of data may not get a chance to perform in the test set or in the training set during the learning phase. We use the k-fold cross validation to overcome this problem [10].

1.5.3 K-fold Cross Validation

Data is divided into k parts equally in the k-fold cross validation method. Training and validation are applied through k iterations. Different folds of data are used validation in every iteration and remaining k − 1 folds are used in training [10].

Algorithm 1 k-fold cross validation

Divide the data set is into K disjoint groupsD₁, . . . , DK.

for i:=1 to K do

Let test set be D_i. Let training T_i set beS

j6=iDj.

Build a model fi on the training set Ti.

Calculate MSE_i on the test set D_i. end for

Summarize evaluation scores.

Average of cross validation values in k-fold cross validation method is

CV_k= 1 k

k

∑

i

MSE_i where MSE_i= 1

|Di|_(x,y)∈Di

∑

(y − fi(x))2 (1.6)

where fi(x) is the model we constructed for the data points (x, y) in each Di. Individual

(39)

data point (x, y) in each Di. It is important that we choose a right value for k. A poorly

chosen k value may lead to a model which has high variance or high bias [11].

1.5.4 Leave-one-out Cross Validation

Leave-one-out cross validation (also known as one-versus-all cross-validation) is special version of the k-fold cross validation. In leave-one-out cross validation, k is equal to number of data. So, all data except for one observation is used for training the model. Remaining one data fold is utilized to be tested the model. Accuracy values of prediction which is obtained by using leave-one-out cross validation are almost unbiased but they have high variance. This situation cause unreliable predictions. Leave-one-out cross validation is commonly used when data sets are very sparse, especially in bioinformatics [10].

Average of cross validation scores in leave-one-out cross validation method is calculated as CV_n= 1 n n

∑

i MSE_i= 1 n k

∑

i (yi− targeti)2 (1.7)

Mean square error in leave-one-out cross validation is used as in k-fold cross validation. In the average of the cross-validation scores, we divide the score by n, the size of data set, instead of k. Because, leave-one-out cross validation is a special case of k-fold cross validation when k = n [11].

(40)

(41)

2. Basic Machine Learning Algorithms

2.1 The Gradient Descent

Most algorithms we are going to cover in this section, are about finding optimal parameters for a model type fitting the data at hand. This is usually accomplished by finding an equivalent optimization problem: finding the absolute extrema values of a cost or a fit function. One of the most commonly used techniques is the gradient descent.

Given a cost function J(θ ) to be minimized with respect to the model parameter θ , the optimal value of a model parameter is found through an iterative procedure. We set

θ(t+1)_j := θ(t)_j − α ∂ ∂ θj

J(θ ) j= 1, 2, .., m (2.1) where t indicates the iteration index. In every step, the partial derivative of the cost function is multiplied with a learning rate. Then this result is subtracted from old cost function value, and new cost function value is found. This process is repeated through all input data values which means from 1 to m which is number of data. As we iterate, the resulting sequence converges to the optimal value θ0.

There are two variants of the gradient descent: stochastic gradient descent and batch gradient descent.

In the stochastic gradient descent, the cost gradient of only one example is computed at each iteration instead of sum of the cost gradient of all examples.

Algorithm 2 Stochastic Gradient Descent for i=1 to m do

θj:= θj− α_m1(hθ(xi) − yi)x (i)

j

end for

In the batch gradient descent, sum of the cost gradient of all examples is computed. This process is repeated until convergence for j = 1 to m which is number of data.

(42)

Algorithm 3 Batch Gradient Descent while until convergence do

θj:= θj− α_m1∑m_i=1(hθ(xi) − yi)x (i)

j

end while 2.2 Regression

The first form of the regression is a method of the least square function. It was published by Legendre in 1805 and by Gauss in 1809. Legendre and Gauss applied this method to detect problems related to orbits of planets and celestial bodies around Sun based on astronomical observations. The terminology regression is first suggested by Francis Galton in 19th century [12] in a biological case study where he showed the heights of descendants of tall ancestors are in tendency to regress down towards average. However, since then the regression method is used in many disciplines. Later, Yule and Pearson were able to place regression within a more statistical framework where the joint distribution of the response and explanatory variables was assumed to be Gaussian [13,14]. Then Fisher showed that conditional distribution of the response variables needs to be Gaussian, but the joint distribution need not [15].

As we described earlier, in supervised learning algorithms there are two types of variables: independent variables and dependent variables. One can say that any changes in the independent variables cause changes in the dependent variables. Given a data set, the set of independent variables are referred to as the features or the explanatory variablesof the underlying data set, and dependent variables are referred as the target variables of the data set.

The regression models are statistical models which assume the relationship between the explanatory variables and the target variables are linear. In other words, there is a linear relationship which describes the association between amount of change in independent variables and amount of change in dependent variable. The statistical significance is then used to determine and examine the chance affects observed in these linear relationships [16, 17].

In the following, we assume that y is a response variable and x is an explanatory variable. The regression models set the relationship between x and y as

(43)

where β0 is called the intercept and β1 is called the slope. If there are deviations

between the calculated value β0+ β1xicoming from the model and the actual response

value yifor a data point (xi, yi) then the equation must be considered with the error. So,

we write

y= β0+ β1x+ ε (2.3)

as a more suitable expression for the model where ε accounts for the statistical error of the model. The updated model is called a simple linear regression model. [16] We think of both x and y as random variables. Then the conditional expected value of the response variable is

E(y|x) = β0+ β1x

while its variance is

Var(y|x) = Var(β0+ β1x+ ε) = σ2

if we assume ε ∼ N(0, σ2) normally distributed with mean 0 and variance σ2. Next, we observe that the conditional mean of y is a linear function of x but variance does not depend on x.

Now, we use the Least Square Method to estimate the unknown regression constants β0and β1. That is to say, we minimize the sum of square differences between response

variable and the model to estimate these constants. Let consider the S as a function,

S(β0, β1) = n

∑

i=1

(yi− β0− β1xi)2 (2.4)

We shall take derivatives of the S function subject to β0and β1, then set them equal to

0: ∂ S ∂ β0 _ˆ β0, ˆβ1 = − 2 n

∑

i=1 (yi− ˆβ0− ˆβ1xi) = 0 = − 2 n

∑

i=1 yi+ 2 n

∑

i=1 ˆ β0+ 2 n

∑

i=1 ˆ β1xi= 0 ∂ S ∂ β1 _ˆ β0, ˆβ1 = − 2 n

∑

i=1 (y_i− ˆβ0− ˆβ1xi)xi= 0 = − 2 n

∑

i=1 y_ix_i+ 2 n

∑

i=1 ˆ β0xi+ 2 n

∑

i=1 ˆ β1x2i = 0

(44)

If we arranged these equations, we obtain n ˆβ0+ ˆβ1 n

∑

i=1 x_i= n

∑

i=1 y_i (2.5) and ˆ β0 n

∑

i=1 xi+ ˆβ1 n

∑

i=1 x2_i = n

∑

i=1 yixi (2.6)

where the Equations (2.5) and (2.6) are called the least square normal equations. Now, we need to obtain ˆβ0 and ˆβ1 which are the least square estimators. First, take the

Equation (2.6). ˆ β0 n

∑

i=1 x_i= n

∑

i=1 y_ix_i− ˆβ1 n

∑

i=1 x_i2

Then, divide two sides by ∑ni=1xi

ˆ β0= ∑n_i=1yixi ∑ni=1xi − ˆβ1 ∑n_i=1x2_i ∑ni=1xi

Let think about the Equation (2.5) n ˆβ0= n

∑

i=1 y_i− ˆβ1 n

∑

i=1 x_i and divide two sides by n

ˆ β0=∑ n i=1yi n − ˆβ1 ∑ni=1xi n

Since ¯yis the mean of y, and ¯xis the mean of x, we obtained that ˆ

β0= ¯y− ˆβ1x¯ (2.7)

We can arrange the Equation (2.6) as ˆ β1 n

∑

i=1 x2_i = n

∑

i=1 y_ix_i− ˆβ0 n

∑

i=1 x_i

Again, we divide the two sides by ∑ni=1x2i

ˆ β1= ∑n_i=1yixi ∑ni=1x2i − ˆβ0 ∑n_i=1xi ∑ni=1x2i (2.8)

Before, we have obtained ˆβ0as

ˆ β0= ∑n_i=1yi n − ˆβ1 ∑n_i=1xi n

(45)

Now, let think about the Equation (2.8) and we substitute the expression of ˆβ0 in the Equation (2.7) ˆ β1= ∑n_i=1yixi ∑ni=1x2i − (∑ n i=1yi n − ˆβ1 ∑n_i=1xi n ) ∑n_i=1xi ∑ni=1x2i

Since we want to find ˆβ1at one side of the equation, we get

∑ yixi ∑ x2_i −∑ yi∑ xi n ∑ x2_i = ˆβ1− ˆβ1 ∑ xi∑ xi n ∑ x2_i = ˆβ1(1 −∑ xi∑ xi n ∑ x2_i ) At the end, we obtained ˆβ1as

ˆ β1= 1 ∑ x2i(∑ yi x_i−∑ yi∑ xi n ) 1 n ∑ x2i(n ∑ x 2 i − (∑ xi)2) (2.9) = ∑ yixi− ∑ yi∑ xi n ∑ x2_i −(∑ xi) 2 n (2.10)

Thus, we found ˆβ0 and ˆβ1are the least square estimators of the intercept(β0) and the

slope(β1). So, fitted linear equation can be stated as

ˆ

y= ˆβ0+ ˆβ1x (2.11)

The Equation (2.9) now gives a point estimate value of the mean of the response variable y.

Let Sxx be the sum of squares of the differences between x and the mean of x.

Sxx= n

∑

i=1 (xi− ¯x)2 (2.12) = n

∑

i=1 x_i2−(∑ n i=1xi)2 n (2.13) Let Sxy be defined as Sxy= n

∑

i=1 (yi− ¯y)(xi− ¯x) (2.14) = n

∑

i=1 y_ix_i−∑ n i=1yi∑ni=1xi n (2.15)

Because of the Equation (2.9) and the Equations (2.12) and (2.14), we get ˆ

β1=

S_xy

(46)

The difference between available observation of yiand the data of corresponding fitted

value ˆy_iis called the residual [16, 17].

ei= yi− ˆyi= yi− ( ˆβ0+ ˆβ1xi) i= 1, 2, ..., n

The residual values are important for assessing the model adequacy.

A bias of an estimator is the difference between expected value of the estimator and actual value of the parameter which we estimated. If the bias of an estimator is zero then the estimator is called unbiased. An unbiased estimator is the accurate statistics which is used to approximate a population parameter. By accuracy we mean that the statistics is not overestimating or underestimating the population parameter. If there is a overestimation or underestimation then we can say that the estimator is biased [16, 17].

So, if the following statistics holds

E[u(X1, X2, ..., Xn)] = θ

then the statistics u(X1, X2, ..., Xn) is a unbiased estimator of the parameter θ .

The least square estimators( ˆβ0 and ˆβ1) are unbiased estimators of the parameters(β0

and β1) of the model.

If we assume that the model is correct, then we can say that

E(yi) = β0+ β1xi (2.17)

and ˆβ1is unbiased estimator of β1and ˆβ0is unbiased estimator of β0. Before, we have

obtained ˆβ1as ˆ β1= Sxy S_xx = n

∑

i=1 k_iy_i

where ki= (xi− ¯x)/Sxx for i = 1, 2, ..., n and n is the number of the data points. It can

be seen that n

∑

i=1 k_i= 0 (2.18) n

∑

i=1 kixi= 1 (2.19)

(47)

because of ∑ni=1(xi− ¯x) = 0. Now, consider that E( ˆβ1) = E( n

∑

i=1 k_iy_i) = E( n

∑

i=1 k_i(β0+ β1xi)) = E(β0 n

∑

i=1 ki+ β1 n

∑

i=1 kixi)

Because of the Equations (2.17), (2.18), and (2.19) E( ˆβ1) = E(β0 n

∑

i=1 k_i+ β1 n

∑

i=1 k_ix_i) = E(β1)

and since β1is a constant

E(β1) = β1

Finally, we can obtain that

E( ˆβ1) = β1 (2.20)

The Equation (2.20) says that ˆβ1is unbiased estimator of β1. Now, we shall consider

the relationship between β0and ˆβ0. We know that

ˆ

β0= ¯y− ˆβ1x¯

because of the Equation (2.7). So,

E( ˆβ0) = E( ¯y− ˆβ1x)¯ We know that ¯ y= 1 n

∑

yi and x¯= 1 n

∑

xi Therefore, E( ˆβ0) = E( ¯y− ˆβ1x)¯ = E(1 n

∑

yi) − E( ˆβ1)

∑

x_i n = 1 n

∑

yi− E( ˆβ1)

∑

xi n

(48)

Before, we have found that ˆβ1is unbiased estimator of β1and we know that yi= β0+ β1xi Hence, E( ˆβ0) = 1 n

∑

(β0+ β1xi) − β1

∑

x_i n =1 n

∑

β0+ 1 n

∑

β1xi− β1

∑

xi n =1 n

∑

β0+ 1 nβ1

∑

xi− β1 1 n

∑

xi =1 n

∑

β0 Since the size of data is assumed as n,

∑

β0= nβ0

At the end, we obtained that

E( ˆβ0) =

1

nnβ0= β0 (2.21)

Thus, we showed that ˆβ0 is unbiased estimator of β0 by proving the expression of the

Equation (2.21) and ˆβ1is unbiased estimator of β1through the proof of the expression

of the Equation (2.20).

2.2.1 Multiple Linear Regression

The models which involve more than one regressor is called the multiple regression models [16]. In the multiple regression models, association between the response variables and the regressor variables is stated as

y= β0+ β1x1+ β2x2+ ... + βkxk+ ε (2.22)

where ε is the error, β0 is the intercept value and β1, β2, ..., βk are the regression

constants. The models which have interaction effects are analyzed by the multiple linear regression methods [16]. For instance, consider the equation of

y= β0+ β1x1+ β2x2+ β12x1x2+ ε (2.23)

The Equation (2.23) can be expressed as

(49)

for x3 = x1x2 and β12 = β3. Similarly, second order models with the interaction or

polynomial models can be set as the multiple linear regression model.

In the multiple linear regression, we use the least square method to estimate the regression constants(β1, ..., βk) as is the case with the simple linear regression. Only

one sample equation can be written as

y_i= β0+ β1xi1+ β2xi2+ ... + βkxik+ εi = β0+ k

∑

j=1 βjxi j+ εi i= 1, 2, ..., n k, n ∈ N

In the simple linear regression, we have estimated the regression constants through the least square function. Again, we will use the least square function to estimate the regression constant in the multiple linear regression.

Let S be the least square error function. S(β0, β1, . . . , βk) = n

∑

i=1 e2_i = n

∑

i=1 (yi− β0− k

∑

j=1 βjxi j)2

The function S must be minimized with respect to parameters β0, β1, . . . , βk. So, we

will derive the S function with respect to β0, . . . , βk values, and then we will set the

derivation to zero. Let take the derivative of the S with respect to β0

∂ S ∂ β0 _ˆ β0,.. ˆβk = −2 n

∑

i=1 (yi− ˆβ0− k

∑

j=1 ˆ βjxi j) = 0

Then, we obtain that

n ˆβ0+ ˆβ1

∑

xi1+ ... + ˆβk

∑

xik=

∑

yi

Now, we take the derivative of the S with respect to βj for j = 0, 1, ..., k where k + 1 is

the number of the parameters: ∂ S ∂ βj _ˆ β0,.. ˆβk = −2 n

∑

i=1 (yi− ˆβ0− k

∑

j=1 ˆ βjxi j)xi j = 0

(50)

ˆ β0

∑

xi1+ ˆβ1

∑

x2i1+ ˆβ2

∑

xi2xi1+ ... + ˆβk

∑

xikxi1=

∑

xi1yi ˆ β0

∑

xi2+ ˆβ1

∑

xi1xi2+ ˆβ2

∑

x2i2+ ... + ˆβk

∑

xikxi2=

∑

xi2yi ... ˆ β0

∑

xik+ ˆβ1

∑

xi1xik+ ˆβ2

∑

xi2xik+ ... + ˆβk

∑

x2ik=

∑

xikyi

There are k equations for each unknown regression constant. We can state these equations as a vector or a matrix.

β =       β0 β1 .. .. β_k       y=       y₁ y₂ .. .. yn       ε =       ε1 ε2 .. .. εn       X=     1 X11 X12 ... X1k 1 X21 X22 ... X2k ... 1 Xn1 Xn2 ... Xnk     y= X β + ε (2.24)

The least square function is used to minimize regression constant vector β . S(β ) =

_∑

ε_i2

= ε|ε

= (y − X β )|(y − X β )

= y|y− y|X β − β|X|y+ β|X|X β

The β|X|yis a matrix have dimension 1 × 1 and is a scalar value. So, transpose of it is equal to itself.

(51)

In that case, the S function can be written as

S(β ) = y|y− 2β|X|y+ β|X|X β

In the simple linear regression, we found the estimators for only one instance. Now, we have multiple regressors. We will calculate the estimator as a vector. So, we need minimize the S function and the process of minimizing is realized through derivation the S function with respect to β vector.

∂ S ∂ β _ˆ β = −2X|y+ 2X|X ˆβ = 0 X|X ˆβ = X|y ˆ β = (X|X)−1X|y (2.25)

The Equation (2.25) says that inverse of (X|X)−1 exist. The inverse (X|X)−1 always exist if regressors are linearly independent. In other words, no column of X is a linear combination of other columns.

The collinearity is the phenomenon that a feature variable is highly correlated with other feature variable in a regression model. In this case, the regression coefficients cannot be determined as uniquely. This damages interpretability of the model. Because the regression coefficients are not unique and they are affected by other features. If the collinearity occurs, X may have rank less than full rank. So, columns of X may not be linear independent and (X|X) may not be inverted. In this situation, the term of (X|X)−1X|y may not exist and the estimator of β may not be found. Therefore, the collinearity should be avoided [12, 16–18].

2.2.2 The Regularization

The regularization is a process of modifying the learning algorithm in an attempt to obtain a better prediction [17, 19]. In the regularization, we try to modify the loss function to penalize the learned weight values. The learning algorithm is regularized by adding regularization terms which include the parameter constant to the loss function. Explicitly, we will consider that

(52)

where R(w) is a regularization term, λ which is defined as the strength of the regularization. The term of regularization, R(w), is defined differently based on the norm of the weight [19]. We will view these different regularization terms.

2.2.2.1 The L2regularization

A norm measures the length of a vector. In the L2regularization, we use the L2norm. The L2norm of a weight is computed as

kwk2= v u u t k

∑

i=1 w2_i

where k is total number of the weights [19].

If we consider the loss function with L2norm, the combined function is stated as L(w; X ) + λ kwk2₂

where X represents the data, w is the weight, L(w; X ) is the loss function, and λ is the tuning parameter.

The L2 regularization is most common type of the regularization. This is called ridge regressionwhen it is used with linear regression. In this case, the regularization term, kwk2₂, is convex function, and when we add this term to a loss function, the combined function will always be convex [19].

In the regularization term, we prefer the weights which are close to zero by minimizing the expression of λ kwk2₂. The term λ is a parameter that controls trade-off between the weight and the loss function.

2.2.2.2 The L1regularization

Another commonly used type of the regularization is the L1regularization in which we use the L1norm. The norm which we use in the L1regularization can be defined as

kwk₁=

k

∑

j=1

|w_j|

where k is total number of the weights. If the L1regularization is used in the regression, this regression is called lasso regression [19].

(53)

Let consider the loss function with L1function. So, we can state the combined function as

L(w; X ) + λ kwk1

where X represents the data, w is the weight, L(w; X ) is the loss function, and λ is the tuning parameter.

The tuning parameter, λ , is used to control the strength of the penalty term. It is amount of the shrinkage [19].

In the L1 regularization, many weights can be equal to zero exactly. If the tuning parameter, λ , is equal to zero, there are no eliminated weights from the model. As λ increases, the values of many weights are equal to zero. So, all weights are eliminated from the model when λ is equal to infinity.

2.2.3 The Ridge Regression

We need to use alternative methods to find the least square estimators when regressor variables carry some colinearity [16, 17]. One of these alternative methods is the ridge regression. The ridge regression is used for regularization and to avoid under-fitting. In the ridge regression, we introduce a penalty term in such a manner that parameter vector β is restricted as k ˆβ k > kβ k Accordingly, the ridge estimate is defined as

ˆ βR= argmin β (ky − k

∑

j=1 βjxjk2) s.t k

∑

j=1 β_j2≤ t, t≥ 0 (2.27)

where k is number of the predictors and k.k2 is the squared Euclidean norm. A penalized regression formulation of the Equation (2.27) is given as

ˆ βR= argmin β (ky − k

∑

j=1 βjxjk2+ λ1 k

∑

j=1 β2_j), λ1> 0 (2.28)

where λ1is the penalty term. The regression constant vector is determined as

ˆ

βR= (X|X+ λ I)−1X|y (2.29)

based on the Equation (2.28). Regular and invertible matrix is obtained by adding λ I to X|X where I is the k × k identity matrix.

The estimator which is obtained by the ridge regression is not unbiased. This is in contrast with the least square estimators which we use in both simple linear regression

(54)

and multiple linear regression methods. Therefore, the regularization method used in the ridge regression allows some bias to reduce the mean squared error and the variance. Correspondingly, the generated model is less sensitive to changes in the data [16]. In the ridge regression, the regression coefficients do not become exactly zero but approach to zero when λ is large. The regularization have no effect when λ is zero. So, we use ordinary linear regression in such a situation [17].

The interpretibility of a model decreases if the target variable depends on too many features. If we decrease the number of features the interpretability of a model increases. The ridge regression provides a more stable estimation with shrinking coefficients, but it has no ability to decrease the number of features. [16] Therefore, it does not yield an easily interpretable model. For the same reason, no feature selection method is possible for the ridge regression. Hence, we need better regularization methods such as the lasso regression.

2.2.4 The Lasso Regression

The lasso regression introduces a regularization technique that bounds the size of the regression constants, and at the same time, performs a feature selection. The least square estimators are obtained by minimizing residual sum of the square based on a constraint where we used the L2norm. The ridge regression introduced a regularization parameter which still used the L2-norm. In the lasso regression we use the L1-norm instead of the L2-norm for the regularization parameters. In other words, sum of absolute value of the coefficients is constrained. [19, 20]

ˆ βL = argmin β (ky − k

∑

j=1 βjxjk2) s.t k

∑

j=1 |βj| ≤ t, t ≥ 0 (2.30)

where k is the number of the predictors and k.k2 is the squared Euclidean norm. Equivalently, the Lasso method optimizes

ˆ βL= argmin β (ky − k

∑

j=1 βjxjk2+ λ2 k

∑

j=1 |βj|), λ2> 0 (2.31)

where λ2is the penalty term.

The purpose of the Lasso regression is fitting linear model and controlling the size of the regression coefficients. Estimated Lasso regression coefficients shrinks to zero

(55)

when the values of constraint parameters, t, decrease. So, the coefficients can be zero exactly because of the L1-regularization term in the lasso regression. More coefficients may become zero as λ approaches to infinity. If λ is zero, then the linear regression is applied because there is no regularization in such a situation. The Lasso regression is a powerful technique which makes feature selection possible [19, 20].

2.2.5 The Elastic Net Regression

The Lasso regression may not be sufficient for regularization for high dimensional data with few instances. Also, if the data set has highly correlated variables then the lasso regression tries to select one variable but ignores others. To avoid this problem, the elastic net method adds a penalty term to the ridge regression [17, 19]. We can say that the elastic net regression is a regularization method that involves both L1 and L2 penalties used in the lasso regression and the ridge regression. So, the estimates from the elastic net regression is described as

ˆ βE = argmin β ( n

∑

i=1 (y_i− k

∑

j=1 βjxi j)2) s.t k

∑

j=1 |βj| ≤ t and k

∑

j=1 β2_j ≤ t, t≥ 0 (2.32) Equivalently, the elastic net method defines ˆβ constant vector satisfying the equation

ˆ βE = argmin β ( n

∑

i=1 (yi− k

∑

j=1 βjxi j)2) + λ1 k

∑

j=1 |βj| + λ2 k

∑

j=1 β_j2), λ > 0 (2.33) where λ1 and λ2 are penalty terms [21]. We can tune the size of these terms to

find best fitting of the model through cross-validation. The Elastic net regression is a regularization method which is used efficiently in highly dimensional data sets.

2.2.6 Nonlinear Regression

There are many problems which are not suitable for linear regression models. For instance, the equation which defines relationship between response variable and regressors may be a differential equation or the model may require a differential equation to solve problem. In such a situation, the linear regression is not adequate for solving the problem [17].

A model which is not linear on unknown parameters is called nonlinear model. For example,

(56)

equation is not linear on α1 and α2 parameters. In general, the nonlinear model is

defined as [21]

y= f (x, α) + ε (2.35)

where α is unknown parameters vector and ε is uncorrelated random error term such that E(ε) = 0, Var(ε) = σ2.

Also, it is assumed that errors distribute normally in common with linear regression. It means that E(ε) = 0.

E(y) = E[ f (x, α) + ε] = E( f (x, α)) + E(ε)

= f (x, α) f(x, α) : expectation function

The derivative of the expectation function with respect to the parameters was not a function in the linear regression. However, in the nonlinear regression, we can obtain a model function which may depend on the parameters when we take the derivatives with respect to the model parameters.

Consider a linear model as

y=β0+ β1x1+ ... + βkxk+ ε y=β0+ k

∑

i=1 βjxj, ∂ f (x, β ) ∂ βj = xj

The derivatives with respect to the regression constants are not functions but constants in a linear model.

Now, consider a nonlinear model as

y= f (x, α) + ε (2.36) = α1eα2x+ ε (2.37) where we have ∂ f (x, α ) ∂ α1 = eα2x_, ∂ f (x, α ) ∂ α2 = α1xeα2

So, in a non-linear model the derivatives of the error function with respect to the parameters are functions of the unknown parameters (α1, α2).

(57)

2.2.6.1 Nonlinear Least Squares

The sum of square errors function was defined as

S(β ) = n

∑

i=1 (yi− (β0+ k

∑

j=1 βjxi j))2

in the linear regression model. To optimize the sum of square errors, one calculates the derivatives of sum of square errors function with respect to the coefficient constants and then they are set to zero. As a consequence, the generated normal equations were the linear equations and solving these equations was easy.

Now, consider the nonlinear situation:

y_i= f (xi, α) + εi, xi| = [1 xi1 ... xik] i= 1, 2, ..., n S(α) = n

∑

i=1 (yi− f (xi, α))2 (2.38)

The derivatives of (2.38) with respect to each αjmust be equal to zero. n

∑

i=1 (yi− f (xi, α)) ∂ f (xi, α) ∂ αj α = ˆα = 0 for j= 1, ..., p

Consider again example of the Equation (2.37). y= α1eα2x+ ε

We shall take derivatives of the equation with respect to the parameters α1and α2, then

will set them to zero. Firstly, let take the derivative with respect to α1 and then set to

zero.

n

∑

i=1

(yi− ˆα1eα2xiˆ )eα2xiˆ = 0

Then, we apply same process for the α2parameter. n

∑

i=1

(yi− ˆα1eα2xiˆ ) ˆα1xieα2xi = 0

If two equations are arranged,

n

∑

i=1 y_ieα2xiˆ − ˆα1 n

∑

i=1 e2 ˆα2xi= 0 n

∑

i=1 y_ix_ieα2xiˆ − ˆα1 n

∑

i=1 x_ie2 ˆα2xi= 0 are obtained.

(58)

In the linear regression model, residual sum of the sum of square errors function, S(α) depends on only α for given a sample. So, S(α) function can be represented with a contour plot. Each contour plot on the surface is a line of constant residual sum of square. On the other hand, the contours are ellipsoidal and they have a unique global minimum point for the least square estimator [17].

In the nonlinear regression model, the contours are not elliptic, may be rather elongated and irregular. The appearances of these contours depend on form of the nonlinear model and available data. They have no specific shapes. In the surface, many solutions for α produce a residual sum of squares that is close to global minimum. In such problems, it is quite difficult to find the value of the global minimum for α. So, this becomes an ill-conditioned problem. In some situations, contours may be so irregular. We may encounter a problem when the surface has irregular contours. In other words, the surface may have several local minimal values and have more than one global minimum values [17].

2.2.6.2 Maximum Likelihood Function

In the nonlinear situations, if error terms are normally distributed and they have constant variance then one can use the maximum likelihood method for estimating the sum of square errors [19]. Let us consider the equation

y_i= α1eα2xi+ εi, i= 1, 2, ..., n

If the errors are normally distributed with mean zero and variance σ2, then the likelihood function is found as

L(α, σ2) = 1 (2πσ2₎n/2exp (− 1 2σ2 n

∑

i=1 (yi− α1eα2xi)2)

Maximizing the likelihood function is equivalent to minimizing the residual sum of square errors. In other words, the least square estimate is the same as the maximum likelihood estimate.

2.3 Logistic Regression and Multinomial Regression

The Logistic regression proposes a model relating a categorical dependent variable and independent variables. There are two types of models: the logistic regression and the multinomial logistic regression[3]. We use the logistic regression when the dependent

(59)

−8 −6 −4 −2 0 2 4 6 8 0 0.5 1 x f( x )

Figure 2.1 : The sigmoid function σ (x) = _1+eexx

variable can have only two values, such as 0 and 1, or "Yes" or "No" while we use the multinomial logistic regression when the dependent variable has three or more unique values.

2.3.1 Odds

Assume that 0 and 1 are the two outcomes of a binary dependent variable. In general, 0 symbolizes a negative response, and 1 symbolizes a positive response. The mean of this variable is proportion of positive responses. If p is the proportion of the variables have positive response as outcome, then 1-p is the probability of the variables have negative outcomes. The ratio of these proportions, _1−pp , is called the odds, and the logarithm of the odds is called logit [3].

odds(p) = p 1 − p ` =logit(p) = ln p 1 − p

The value p is between 0 and 1, and logit(p) ranges between minus infinity and plus infinity.

2.3.2 The Sigmoid Function

The sigmoid function is inverse of the logit transformation. σ (`) = e

`

(60)

The sigmoid function is a non-linear function used in the logistic regression for predictions. It generates values between 0 and 1 modelling a probability. However, when we must produce an outcome of 1 or 0 we determine a threshold p0, and then if

the model produces a probability p greater than the threshold value p0then we set the

output as 1, while the output is set to 0 otherwise.

2.3.3 The Logistic Regression

In the logistic regression, the model has a dependent variable x with two unique values, and this variable is regressed with n independent variables which are denoted by x₁, x2, ..., xnwhere n ∈ N. The regression model is going to estimate the odds which is

expressed as ln p 1 − p = α + θ1x1+ θ2x2+ ... + θnxn (2.40)

The model parameters are θ = (θ1, . . . , θn) and α. Thus, the equation is expressed as:

ln

_p

1 − p

= α + θ x (2.41)

When we apply the sigmoid function on both sides, we get p= e

α +θ x

1 + eα +θ x = σ (α + θ x)

We can think p as probability of the target value Y being 1 provided that the dependent variable has value X . So, we can state that

p= P(Y |X ) = e

α +θ x

1 + eα +θ x = σ (α + θ x) (2.42)

2.3.3.1 The Cost Function

The cost function is used for reducing the difference between the target value and the output value to a minimum. For example, the mean squared error is the cost function we use in the linear regression. However, we cannot use this cost function for the logistic regression because the function we use in the logistic regression is not linear [18, 22]. We see that P(y = 1|x) =σ (θ · x + α) = 1 1 + e−(θ ·x+α) P(y = 0|x) =1 − σ (θ · x + α) = 1 − 1 1 + e−(θ ·x+α) = 1 1 + eθ ·x+α