Primitive Diabetes Prediction using Machine Learning Models: An Empirical
Investigation
1
Priyabrata Sahoo,
2Prachet Bhuyan
1. KIIT Deemed to be University, Bhubaneswar, Odisha, India. [email protected] 2. KIIT Deemed to be University, Bhubaneswar, Odisha, India. [email protected]
Article History: Received: 11 January 2021; Revised: 12 February 2021; Accepted: 27 March 2021; Published online: 10 May 2021
Abstract: Diabetes is a prevalent metabolic disease that causes numerous amounts of death each year. It affects patients of all age groups and huge populations around the world. Early detection would help in early treatment that facilitates the prognosis. Untreated diabetes disturbs the proper functionality of other organs in human body. The contemporary lifestyle has led to unhealthy eating habits causing type 2 diabetes .Various computational intelligence techniques have been proposed in predicting diabetes based on the historical records of related symptoms. Hence early detection is a significant process to have a healthy life style. Diabetes needs greatest support of machine learning to detect diabetes disease in early stage, since it cannot be cured and also brings great complication to our health system. In this research article, a novel model is generated for prediction of diabetes using Gaussian Naive Bayes (GNB), Random Forest (RF), Support Vector Classifier (SVC) and Multinomial Naïve Bayes (MNB) based on the significant attributes, and the relationship of the differing attributes. The computation of the model is simple; hence enabling an efficient process for prediction. Computer simulation experiments is performed by using the diabetes dataset is used to compare the performance of the different models. Keywords: Support Vector Classifier, Random Forest, Gaussian Naive Bayes, Multinomial Naïve Bayes, Diabetes.
1. Introduction
The disease or condition which is continual or whose effects are permanent is a chronic condition. These types of diseases affect quality of life, which has major adverse effect. Diabetes is one of the most acute diseases, and is present worldwide. A major reason of deaths in adults across the globe includes this chronic condition. Chronic conditions are also cost associated. A major portion of budget is spent on chronic diseases by governments and individuals [1,2].Diabetes is a disease which is detected when the glucose level of the blood turns high, which finally leads to additional health problems such as kidney disease, heart diseases etc. Numerous data mining projects have implemented algorithms to predict diabetes in a patient. Data mining has been successfully applied to different fields in human society, such as market analysis, weather prognosis, customer relationship management, engineering diagnosis and the application in disease prediction and medical data analysis still has scope for improvement in accuracy.Machine learning is relatively close to Artificial Intelligence (AI) and builds software applications forecast outcomes through statistical analysis. The techniques used allow for reaching an optimal accuracy rate in forecasting the output from the input data. Machine learning follows related processes implemented in data mining and predictive modeling. They identify patterns through the data entered and then regulate the actions of the program accordingly.Machine learning algorithms are classified as supervised learning and unsupervised learning. Supervised learning need input data and the preferred output data to construct a training model. The training model is constructed by a data analyst or a data scientist. A feedback is then provided concerning the accuracy of the model and other output metrics during algorithm training. Revising is done as required. Once the training phase is finished, the model can forecast outcomes for new data. Classification is the major data mining tasks. Classification falls under supervised learning which means that the machine learns through examples in Classification.
Research on biological data is limited but the passage of time enables computational and statistical models to be used for analysis. A sufficient amount of data is also being gathered by healthcare organizations. New knowledge is gathered when models are developed to learn from the observed data using data mining techniques. Data mining is the process of extracting from data and can be utilized to create a decision-making process with efficiency in the medical domain [3]. Several data mining techniques have been utilized for disease prediction as well as for knowledge discovery from biomedical data [4,5].The diabetes cause death in final stage before that it will lead to several disorders. Therefore, it is important detect this disease in an early stage that allows the physicians to treat the patient with proper diagnosis. In many cases the physicians provide the erroneous treatment without the proper experience of diabetes [7]. The contribution of the proposed paper is to develop a novel feature selection approach to reduce the unwanted features and provides the better classification accuracy that aids the patients for early prediction of diabetics and prevents the death ratio due to diabetics.
2 BACKGROUND
2.1 Support Vector Classifier
SVMs (Support Vector Machines) are a useful technique for data classification [8]. A classification task usually involves separating data into training and testing sets. Each instance in the training set contains one “target value” (i.e. the class labels) and several “attributes” (i.e. the features or observed variables). In past years, many scholars research centre of attention has become the realistic problems of meticulous theoretical foundation, but also better determine the small sample, high dimension, nonlinear and local minima points. Primarily, support vector machines for class classification task, but with the fast development of computer technology, database technology, network technology, for the classification and management of huge amounts of information, the class classification task can no longer meet people's desires [9]. This technique will be enhanced to multi-class classification task. Presently, it is a burning research topic. The support vector machine is a novel small-sample learning mode, because it depends on the rule of structural risk minimization, rather than the traditional empirical risk reduction principle, it is better-quality to existing methods on many performances. Support vector machine is a two dimensional explanation of the optimal surface developed from the linearly separable case.The goal of SVM is to produce model (based on the training data) which predicts the target values of the test data given only the test data attributes.
2.2 Gaussian Naïve Bayes
Naïve Bayes algorithm uses probabilities of each attribute that belonging to each class in the training set to predict the class of new data instances. A Bayesian Network (BN) gets a relationship between probability distributions and graphs [10]. In the recent days, BN was primarily applied for knowledge representation and reasoning. In the past years have seen several successful applications of BN in classification, among which the Naïve Bayes classifier was proved to be amazingly effective in spite of its simple mechanism. It is made upon the robust assumption that various attributes are independent with each other. In spite of its lots of advantages, a key limitation of using the Naïve Bayes classifier is that the real-world data sometimes suit the independence theory
among attributes.
This robust assumption could formulate the prediction accuracy of the Naïve Bayes classifier extremely sensitive to the correlated attributes. To overcome the drawback, lots of approaches have been implemented to get better the performance of the Naïve Bayes classifier [11]. Naïve Bayes predicts datasets with the assumption that attributes belonging to a class that is independent of each other. This study uses Gaussian Naïve Bayes algorithm which works well with both continuous and discrete datasets.
2.3 Multinomial Naïve Bayes
It estimates the conditional probability of a particular word given a class as the relative frequency of term t in documents belonging to class(c). The variation takes into account the number of occurrences of term t in training documents from class (c), including multiple occurrences. Multinomial Naive Bayes is a specialized version of Naive Bayes (NB) that is designed more for text documents [12]. Nowadays, the multinomial models are considered to be the prevailing modeling approach and it is more proficient than multivariate Bernoulli model which introduces language modeling in information retrieval. It is proved that the multivariate models are considerably better than multinomial model in classification tasks. Mostly, the NB method is a machine learning algorithm. It is mainly used to classify and predict, including multidimensional training data sets. Some examples are famous for disease prediction, document classification, span filtration, sentimental analysis, and using the NB algorithm, one can speedily create models and quickly predict models. To calculate the required parameters, a small amount of training data is necessary. The NB technique is called "naive" because it assumes that the form of a feature is irrelevant to the form of other features.
2.4 Random Forest
The random forest method is a flexible, fast, and simple machine learning algorithm which is a combination of tree predictors. Random forest produces satisfactory results most of the time [13]. It is difficult to improve on its performance, and it can also handle different types of data including numerical, binary, and nominal. Random forest builds multiple decision trees and aggregates them to achieve more suitable and accurate results. It has been used for both classification and regression. Classification is a major task of machine learning. In random forest, a random subset of attributes gives more accurate results on large datasets, and more random trees can be generated by fixing a random threshold for all attributes, instead of finding the most accurate threshold. Although the mechanism emerges simple, it engages lots of different driving forces which make it hard to analyze. In detail, its mathematical properties stay to date largely unknown and, up to now, many theoretical studies have concentrated on isolated parts or stylized versions of the algorithm. However, the statistical method of “true” random forests is not yet completely understood and is still under lively investigation.
3. Related Work
Most of the work related to machine learning in the domain of diabetes diagnosis has concentrated on the study of the Pima Indian Diabetes dataset in the UCI repository. In this context, Shanker[6] used neural networks to predict the onset of diabetes mellitus among the Pima Indian female population near Phoenix, Arizona. This particular dataset has been widely used in machine learning experiments and is currently available through the UCI repository of standard datasets.
Alam et al. [7] proposed a Machine learning and data mining techniques those are valuable in disease
diagnosis. Authors have used association rule mining, the results have shown that there is a strong association of BMI and glucose with diabetes. In [14] ,the main contribution of study was proposing two predictive models using machine learning techniques, Gradient Boosting Machine and Logistic Regression, in order to identify patients with high risk of developing DMNeuro-Fuzzy systems have also been used by Dazzi et al. [15], for the control of BGL in critical diabetic patients, with the main objective of being able to predict the exact dosage of insulin with the least number of invasive blood tests. A combination of back propagation (BEP) neural networks and fuzzy logic were used to predict the variation in insulin dosage In [16], the authors make a comparative study of association rules and decision trees to predict the occurrences of certain diseases prevalent in diabetic patients.In [17], they deal solely with association rule mining on diabetes patient data, to come up with new rules for prediction of specific diseases in such patients. Singh et al. [18] applied different algorithms on datasets of different types. They used the KNN, random forest and Naïve Bayesian algorithms. The K-fold cross-validation technique was used for evaluation. Amina et al. [19] compared different data mining algorithms by using the PID dataset for early prediction of diabetes. Anuja Kumari and R. Chitra [20] used the SVM model to diagnose diabetes using a high dimensional medical dataset.
It is vital to monitor the glucose levels of patient suffered from diabetes, as it can give better control over their condition. Glucose monitoring can be operated to optimize patient treatment strategies such as the result of medications, exercise, and/or diet. The SMBG device allows the patient to monitor the glucose change as well as to respond immediately with the appropriate action [21]. SMBG utilizes glucose sensors based on electrochemical methods [22] and offers patients with the ability to self-monitor Glucose levels so as to supervise insulin levels. Several investigations have been carried out regarding diabetes classification that used clinical datasets such as the PIMA Indian dataset. The dataset having 768 patients, of which 268 patients with diabetic and 500 patients are normal. Maniruzzaman et al. addressed a Gaussian process (GP)-based classification method using three kernels, namely linear, polynomial, and radial basis [23]. The proposed model was analyzed with existing techniques such as linear discriminant analysis, quadratic discriminant analysis, and Naïve Bayes (NB). The outcome showed that the performance of a GP-based model is high compared to other methods. Rule Miner for diabetes classification is proposed by Cheruku et al. [24]. The applied rule miner is tested against rule based algorithms such as C4.5, ID3 and CART along with some meta-heuristic based rule mining algorithms. The results of the investigations showed that the proposed Rule Miner performed the other algorithms in terms of average classification accuracy and average sensitivity.
Additionally, Wu et al. presented a novel model based on data mining techniques for predicting T2D[25]. It consists of two parts: logistic regression algorithm and an improved K-means algorithm. The enhanced K-means algorithm was applied to eliminate incorrectly clustered data, after which the logistic regression technique was used to classify the remaining data. The outcomes exposed that the presented model performed a higher prediction accuracy compared to previous studies.
4.Materials & Methods 4.1 Dataset
The dataset used in this study, is originally taken from the Pima Indian Diabetes dataset present in the UCI repository. The main Objective of using this dataset was to predict through diagnosis whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. The Pima Indian Diabetes (PID) dataset having: 9 = 8 + 1 (Class Attribute) attributes, 768 records describing female patients (of which there were 500 negative instances(65.1%) and 268 positive instances (34.9%)).
4.2 Data Preparation
In real-world data there can be missing values and/or noisy and inconsistent data. If data quality is low then no quality results may be found. It is necessary to map the relevant entities towards the objective of problem statement in order to achieve quality results.
One of the primary steps of machine learning is data cleaning. Considered to be one of the crucial steps of the workflow, because it can make or break the model. Data cleaning consists of filling the missing values and removing noisy data. Noisy data contains outliers which are removed to resolve inconsistencies.
There are several factors to consider in the data cleaning process. • Duplicate or irrelevant observations.
• Bad labeling of data, same category occurring multiple times. • Missing or null data points.
• Unexpected outliers.
Since we are using a standard data set, we can safely assume that factors 1, 2 are already dealt with.
Using pandas library, we have found missing or null data points of the data set. From Table -1 We observed that there are no data points missing in the data set.
Table-1: Observing Missing Data
In order to remove the Unexpected Outliers, we found out the histogram and identified the outliers in some columns. By the end of the data cleaning process, we have come to the conclusion that this given data set is incomplete that is it contains invalid readings in BloodPressure, BMI and Glucose columns. Hence, we proceeded with the given data with some minor adjustments. Figure -1 contains the features distribution available in the dataset
Fig-1: Visualizing the feature distribution
We found correlation of every pair of features (and the outcome variable), and visualize the correlations using a Heatmap. Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0
Fig-2: Heatmap of feature correlations
In case of Heatmap brighter colors indicate more correlation. From the Heatmap we can say, glucose levels, age, BMI and number of pregnancies all have significant correlation with the outcome variable.
5. Results and discussion
Different classification algorithms were applied on our dataset, and results for all techniques were slightly different as the working criteria of each algorithm are different. The results were evaluated on the basis of accuracy. The outcome of this study is based on the performance metrics such as precision, recall, f-measures, accuracy and ROC. The dataset was divided into training set and testing set. The training set is used to train the model. And the testing set is used to test the model, and evaluate the accuracy. The performance metrics of the classification model were calculated based on precision, recall, and accuracy and are presented in Table 2. TP and TN specify the numbers of diabetes and normal patients that were correctly classified, respectively, while FN and FP specify the numbers of normal and diabetes patients that were incorrectly classified, respectively. 10-fold cross-validation was used to train and test the dataset for the entire classification model.
Table-2 .Performance metrics for the classification model.
Performance Metric Formula
Performance Metric Formula Precision TP/(TP + FP)
Recall TP/(TP + FN)
Accuracy (TP + TN)/(TP + TN + FP + FN)
For first run all the attributes of the dataset is fed to the model for prediction. The performance plot is shown in figure-3.
Fig-3: Performance plot taking all the features of the dataset
Some factors in the dataset are not found to be influencing the outcome. Parameters like Diabetic Pedigree Function (DPF) do not have a normal distribution in the dataset. As DPF increases, there seems to be a likelihood of being diabetic, but needs statistical validation. Hence can be dropped.
There is a tendency that as people age, they are likely to become diabetic. But diabetes, itself doesn’t seem to have an influence of longevity. May be it impacts quality of life which is not measured in this data set. This needs statistical validation hence dropped.
2hour serum insulin is expected to be between 16 to 166. Clearly there are Outliers in the data of insulin. These Outliers are concern for us and most of them with higher insulin values are also diabetic. So, this is a suspect and not taken into consideration.
In second run, fewer attributes are taken and performance of model is evaluated which is shown in figure-4. As we can see there is increased in the level of accuracy for Random Forest classifier and Multinomial Naïve Bayes.
Fig-4: Performance plot taking "Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "BMI" as attributes.
Again, there are outliers present in attributes like Skin Thickness and BMI. For BMI (shown in figure-5) 1st graph – It is evident that there are few outliers. Few are obese in the dataset. Expected range is between 18 to 25. In general, people are obese. 2nd graph – Diabetic people seems to be only higher side of BMI. Also, the contribute more for outliers. 3rd graph – Same inference as 2nd graph
Fig-5Screening of variable : BMI
For Skin thickness (figure-6), 1st graph – Skin thickness seems be skewed a bit. 2nd graph – Like BP, people who are not diabetic have lower skin thickness. This is a hypothesis that has to be validated.
Fig-6 Screening of variable: Skin Thickness
The attributes BMI and Skin Thickness is being dropped and the model is evaluated taking three parameters: pregnancies, glucose and blood pressure. The performance accuracy of models in shown in below figure-7
Fig-7: Performance plot taking "Pregnancies", "Glucose", "Blood Pressure as attributes. The SVC showing to have a greater performance as compare to other classifiers. We have 3 variables that influence the Outcome and they are Pregnancies, Glucose and Blood Pressure and none of these parameters are correlated to each other. Hence the model is not inflated.
6. Conclusion and Future Works
The present investigation is the first focusing on machine learning based techniques to predict diabetes. For disease diagnosis the application of machine learning techniques are regarded to be very valuable. The ability to predict the early signs diabetes plays a crucial role in identifying correct procedure to be followed for patient’s treatment. In this paper machine learning algorithm were applied on the diabetes dataset which was also trained and validated against testing dataset. Here structured data set is used for diabetes prediction. The result has shown that SVC models beats the other models. In future these proposed models will be implemented in other medical domains such as prediction of cancer, Parkinson’s disease. If a real dataset is obtained from a real case implementation, it will boost the accuracy of the classification model.
References
1. Falvo D, Holland BE. Medical and psychosocial aspects of chronic illness and disability. Jones & Bartlett Learning; 2017.
2. Skyler JS, Bakris GL, Bonifacio E, Darsow T, Eckel RH, Groop L, et al. Differentiation of diabetes by patho physiology, natural history, and prognosis. Diabetes 2017; 66:241–55.
3. Diwani S, Mishol S, Kayange DS, Machuve D, Sam A. Overview applications of data mining in health care: the case study of Arusha region‖. Int J ComputEng Res2013; 3:73–7.
4. Alam TM, Awan MJ. Domain analysis of information ExtractionTechniques. Int JMultidiscip Sci Eng2018;9:1–9.
5. Alam TM, Khan MMA, Iqbal MA, Wahab A, Mushtaq M. Cervical cancer prediction through different screening methods using data mining. Int J Adv Comput Sci Appl2019; 10:388–96.
6. M. Shanker. Using neural networks to predict the onset of diabetes mellitus. J ChemInform Computer Science, 36:35–41, 1996.
7. Alam, T.M., Iqbal, M.A., Ali, Y., Wahab, A., Ijaz, S., Baig, T.I., Hussain, A., Malik, M.A., Raza, M.M., Ibrar, S. and Abbas, Z., 2019. A model for early prediction of diabetes. Informatics in Medicine Unlocked, 16, p.100204.
8. Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, C.J. Lin, "Training and testing low-degree polynomial data mappings via linear SVM", Journal of Machine Learning Research, vol. 11, pp. 1471-1490, 2010. 9. H. Han, J. Xiaoqian, "Overcome Support Vector Machine Diagnosis Overfitting", Cancer Informatics, vol.
13, no. 11, pp. 145-158, 2014.
10. Bo LJ. Song: Naive bayesian classifier based on genetic simulated annealing algorithm. Procedia Eng. 2011; 23:504–9.
11. Jianga L, Zhang L, Yu L, Wang D. Class-specific attribute weighted naive bayes. Pattern Recogn. 2019; 88:321–30.
12. Nima Shiri Harzevili SHA. Mixture of latent multinomial naive bayes classifier. Appl Soft Comput. 2018; 69:516–27.
13. Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. Vol. 1. IEEE: 1995. p. 278–82
14. Hang Lai, Huaxiong Huang et al. Predictive models for diabetes mellitus using machine learning techniques, 2019 Oct 15. doi: 10.1186/s12902-019-0436-6
15. D. Dazzi, F. Taddei, A. Gavarini, E. Uggeri, R. Negro, and A. Pezzarossa. The control ofblood glucose in the critical diabetic patient: a neuro-fuzzy method. Journal of DiabetesComplications, 15(2):80–87, Mar-Apr 2001.
16. M. Zorman, G. Masuda, P. Kokol, R. Yamamoto, and B. Stiglic. Mining diabetes databasewith decision trees and association rules. In 15th IEEE Symposium on Computer-BasedMedical Systems, pages 134–139, 2002 17. W. Hsu, M. L. Lee, B. Liu, and T. W. Ling. Exploration mining in diabetic subjects databases: Findings and conclusion. In 6th ACM SIGKDD international conference on Knowledge discovery and data mining, 2000. 18. Singh A, Halgamuge MN, Lakshmiganthan R. Impact of different data types onclassifier performance of
random forest, naive Bayes, and K-nearest neighbors algorithms.Int J Adv Comput Sci Appl2017;8:1–10 19. Azrar A, Ali Y, Awais M, Zaheer K. Data mining models comparison for diabetes prediction. Int J Adv
Comput Sci Appl 2018;9.
20. Kumari VA, Chitra R. Classification of diabetes disease using support vector machine.Int J Eng Res Afr2013;3:1797–801.
21. Yoo, E.H.; Lee, S.Y. Glucose biosensors: An overview of use in clinical practice. Sensors 2010, 10, 4558– 4576.
22. Clark, L.C., Jr.; Lyons, C. Electrode systems for continuous monitoring in cardiovascular surgery. Ann. N. Y. Acad. Sci. 1962, 102, 29–45.
23. Maniruzzaman, M.; Kumar, N.; Menhazul, A.M.; Shaykhul, I.M.; Suri, H.S.; El-Baz, A.S.; Suri, J.S. Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Programs Biomed. 2017, 152, 23–34.
24. Cheruku, R.; Edla, D.R.; Kuppili, V. SM-RuleMiner: Spider monkey based rule miner using novel fitness function for diabetes classification. Comput. Biol. Med. 2017, 81, 79–92.
25. Wu, H.; Yang, S.; Huang, Z.; He, J.; Wang, X. Type 2 diabetes mellitus prediction model based on data mining. Inform. Med. Unlocked 2018, 10, 100–107.