A predictive model for liver disease progression based on logistic regression algorithm

Department of Software Engineering, Faculty of Computer Science and Infromation Technology, Wasit University, Iraq. North Technical University, Iraq. Department of Software Engineering, Faculty of Computer Science and Infromation Technology, Wasit University, Iraq. Department of Computers, College of Education for Pure Sciences, Wasit University, Iraq. Department of Electrical Engineering, Faculty of Engineering, Wasit University, Iraq. Department of Software Engineering, Faculty of Computer Science and Infromation Technology, Wasit University, Iraq.


INTRODUCTION
Liver disease (hepatic disease) is any disease that harmfully affects the regular, healthy performance of the liver.It considers as a significant risk factor for mortality and morbidity rates, which significantly impacts the patient quality of life.For example, a study that followed 11,448 subjects for 5 years showed that the incidence of a disease was 12% (n = 1,418) [1].Another cohort study followed 77,425 free of liver disease subjects for 4.5 years, 10,340 of them have developed the disease [2].In general, more than 75% of liver tissue requires to be affected before the function of the liver is decreased [3].There are several symptoms indicts to liver diseases such as fatigue, jaundice, sickness, puke, pain in the back, and abdominal, fluid in the abnormal cavity, general itching, and weight loss [4].Many factors such as genes, viruses, drugs, snuffing of dangerous gases, alcohol, eating of contaminated food, and pickles are contributed in developing the disease [5].Obesity, Cholesterol or triglycerides are also related to liver damage.Chemicals and minerals, or infiltrated by abnormal cells, such as cancer cells are also could cause damage to the Liver Tissue [3].It is therefore essential to perform a timely diagnosis of the disease to prevent the occurrence or decrease the possible risks consequences.Though there have been extensive researches done to provide a definitive therapeutic approach for liver diseases, however there is no definitive method to predict the disease progression has been developed so far.Moreover, the efficacy of antiviral treatment to prevent the progression of disease in patients with chronic hepatitis B and cirrhosis or advanced fibrosis is obscure [6].Many studies have been carried out to diagnose liver disease using blood tests combined with various risk factors [7][8][9][10][11][12][13][14].Although there have been some excellent results from these models, the disease's progression remains obscure.This study, therefore, aims to handle this gap.This has a significant impact on health-care specialists to provide timely intervention and increase patient awareness.
Machine learning is one of the fields of computer science, employ algorithms of computer to find patterns in data, as well as help in predicting various outcome depend on the used data [15].Such algorithms have appeared as a reliable methods for estimation and decision-making in the various fields of real life [16].As a result of the availability of medical data, such algorithms have a significant contribution to medical decision making [17,18].Using machine learning capabilities to develop an efficient predictive model would serve as valuable assistance to identify the disease and make a real-time effective medical decision.
In this paper, we develop a seamless model that will help to predict the occurrence of liver disease.The proposed predictive model is based on the logistic regression method.The model predicts the likelihood of liver disease occurrence based on blood tests.Our main contributions are as follow:  Develop a predictive model can predict the probability of liver disease occurrence. Investigate the significance of the included tests in the predicting of disease. We analyzed the performance of the proposed predictive model using ILPD dataset.
The outline of this study is as follows: Section 2 reviews the existing related works.Section 3 explains the proposed predictive model.Analysis of the proposed predictive model, results are presented in Section 4. At the end, we present the conclusion in section 5.

Related Works
In this section, relevant prior works about liver disease prediction are reviewed as presented in Table 1.
Table 1: Reviews of the different works of diagnosing Liver's disease.

Study
Prediction of liver disease Categories Occurrence Probability Vijayarani, Dhayanand [7] Yes NO Dhamodharan [8] Yes NO Karthik, Priyadarishini [9] Yes NO Nahar, Ara [10] Yes NO Kumar, Sahoo [11] Yes NO Wu, Yeh [12] Yes NO Hashem and Mabrouk [13] Yes NO Nagaraj and Sridhar [14] Yes NO Vijayarani, Dhayanand [7] have compared the performance of classification algorithms to predict liver diseases.The compared algorithms in this work were Naïve Bayes (NB) and support vector machine (SVM).The experimental results showed that the SVM is a superior classifier than the Naïve Bayes to predicting the liver diseases."Another work has been proposed to predict the three major types of the disease such as Liver cancer, Hepatitis, and Cirrhosis using the notable symptoms [8]".To achieve this study Naïve Bayes and FT Tree approaches were used to expect the types of the disease.The experiment's result showed that Naïve Bayes has outstanding the FT Tree algorithm in the accuracy measure.Karthik, Priyadarishini [9] applied a soft computing method to identify the liver disease.The authors have applied the method in three phases.In the first phase, the Artificial Neural Network (ANN) classification algorithm was used to classify liver disease.In the second phase, the rules of classification were produced using Learn by Example (LEM) algorithm implemented in rough set rule induction whereas the third phase was used to identify the types of liver disease using fuzzy rules."Different work by [10] has proposed to measure the performance of many decision tree (DT) methods then compare their achievement in predicting liver disease.The methods that used in this study were J48, LMT, Random Forest (RF), Random tree (RT), REPTree, Decision Stump (DS), and Hoeffding Tre.The result of the comparison showed that DS overcome the other used methods with the high accuracy."Kumar, Sahoo [11] has also proposed a rule-based model combine with various machine learning techniques to predict the diverse types of the disease.The techniques were Rule Induction (RI), Naive Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN)."The proposed model combines with decision tree (DT) showed better result among all the other methods (RI, SVM, ANN, and NB)."Another work by [12] developed and compared the performance of different machine learning methods in order to predict Fatty liver disease (FLD)."The evaluated methods were (ANN), (RF), (NB), and logistic regression (LR)".The result of the study revealed that (RF) showed outstanding result than other methods."Hashem and Mabrouk [13] evaluate the performance of Support vector machine to classify patients of the liver disease using two datasets with diverse tests combinations such as SGOT, SGPT, and Alkaline Phosphates".A hybrid model has proposed [14] by to diagnose liver disease known as NeuroSVM model using SVM and feed-forward artificial neural network (ANN).
Despite the excellent performance of the proposed models in the literature [7][8][9][10][11][12]14], these models, however, did not handle the problem of predicting the probability of LIVER DISEASE occurrence.Also, the models are more general, where they have proposed to predict the categories of the disease (e.g., healthy, or patient).More specific, they did not show the progress of the disease of each patient.

The Framework of the Predictive Model
This section illustrates the predictive model to predict of the probability of liver disease incidence.Figure 1 elaborates the steps of building the predictive model in this study.In this section, we introduce the predictive model that we proposed to predict the probability of liver disease occurrence .The model is readily to run with various platforms and tools such as the R language.It can provide both specialists and patients with a clear vision about the disease progress.Consequently, knowing the degree of Whereas the probability of patient C developing liver's disease can be calculated as follows

The Evaluation of the Predictive Model
We conducted an analysis and evaluation of the proposed predictive model.In this section, the dataset, experimental setup and the metrics used are presented.

Experimental Setup
All experiments were accomplished by SPSS version 20.0 (SPSS, Inc., Chicago, IL, USA) and R language on PC with Intel Core 5, CPU 3.4 GB, 16GB RAM running Windows 10 operating system.Various measures such as accuracy, sensitivity, specificity and confusion matrix measurements calculated to assess accomplishment of the predictive model.Following is the description of the evaluation measures used in the current paper: For validation, we employed 10-fold cross-validation (CV) to obtain a balanced evaluation of the generalization error.The 10-fold crossvalidation (CV) divided the entire dataset into 10 sub-sets randomly: 9 sub-sets were employed for training stage (90%), and the remaining sub-set was exploited in testing stage (10%).This procedure is then repeated ten times with replacement in the tested folds."The standard measurements such as accuracy, sensitivity, specificity, Type I Error and Type II Error were calculated to evaluate the performance of the predictive model".

Dataset
In this study, a real well-known Indian Liver Patient Dataset (ILPD) used.ILPD is a public dataset existing in the Machine learning repository (UCI).The data collected in Andhra Pradesh region, India.ILPD contains 583 instances with various tests as listed in table 4. The dependent variable was stated depend on these tests.As result 416 cases were with liver disease and 167 cases were healthy.The information of the used data is presented in table 4.

Performance Metrics
The area under the curve (AUC) which is also known as receiver operating characteristic curve (ROC) is used to assess the achievement of the predictive model for the current study.The ROC curve is an essential tool that is used for diagnose test evaluation.What follows below is a description of the evaluation measures applied in the current study:  Sensitivity: represents the number of participants correctly predicted with a positive disease.
 Specificity: depicts the number of participants correctly diagnosed with a negative disease.
 Accuracy: exposes the total number of participants correctly predicted with a positive and negative disease.
 Type I Error (α): the probability of diagnosing patients into a control group.
 Type II Error: represents the probability of diagnosing control subjects with the patient group.

Result
In this section, the experimental results and discussions are presented.The sensitivity achieved by the predictive model was 90.3 %, which refers to the number of patients that are correctly identified with a positive disease.In meanwhile, the obtained specificity by the model was 78.3 %.As for Type I Error, and Type II Error of the predictive model, they were 9.7 % and 21.7% respectively.This demonstrates that the model can be efficiently employed to predict the probability of liver disease occurrence.It is also possible to exploit such a model with different health conditions.  2 clarifies the ROC curves of the predictive model and the details as listed in Table 6.As illustrated in Table 6, the ROC of the predictive model is 0.758% with a 95% confidence interval (0.719-0.797).Also, ROC is significantly different from 0.5 since P value (Asymptotic Significance) is < 0.05 meaning that the proposed predictive model has predicted the group significantly better than by chance.

Conclusion
This paper is presented a model to predict the probability of liver's disease occurrence.The analysis and evaluation of the proposed model show that it is very efficient and easy to implement and use.The model provides high accuracy (72.4%), sensitivity (90.3 %) and specificity (78.3 %) as well as very good stability (0.758 %).This makes the model suitable for use by the healthcare providers to facilitate the planning of timely intervention and also creates greater awareness of the risk of the disease.The predictive model also confirmed that laboratory tests such as (Age; Direct Bilirubin (DB), Alamine_Aminotransferase (SGPT), Total_Protiens (TP), Albumin (ALB)) were significant predictors of the categories of liver disease.Thus the model can be used by healthcare providers if they have the results of the conducted tests.It can be suggested for monitoring the progress of liver disease, for example, implementing as a tool for web-based and mobile phone.This study, however, has limitations of feasibility and resource constraints.The small size of the sample, for example, may be raised as limitations of the present study; the sample would be more representative if more information from liver patients are collected.In the future, therefore we aim to increase the number of patients as well as using various tests.

Figure 1 :
Figure 1: A graphical explanation of the predictive model.

Table 4 .
ILPD description Table 5displays the confusion matrix of the predictive model.As is evident in the confusion matrix, the model has achieved predictive accuracy of 72.4%.The model has correctly predicted 381 patient out of 416 while 41 out of 167 healthy participants.This performance is achieved because of that all the included tests are significantly associated with the disease (P < 0.05).

Table 6 .
The AUC of the Predictive Model.Area

Under the Curve Test Result Variable(s): Predicted probability ROC Std. Errora Asymptotic Sig. b Confidence interval (%)
Figure 2. ROC curve of the Predictive Model Figure