A classification model on tumor cancer disease based mutual information and firefly algorithm

Received May 33, 2019 Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches. Keyword:


Introduction
Cancer is a term that refers to diseases that result from an uncontrolled body cell division.Such uncontrolled cell division results in the development of lumps known as a tumor.These abnormal cells invade the immune system and spread through the blood vessels and lymph system to other parts of the body.There are more than 100 types of tumors which are mainly named based on where they are formed in the body or based on the type of tissue they develop from.Cancers can be classified into six major classes based on the type of tissue involved: Carcinoma, Leukemia, Sarcoma, Myeloma, Lymphoma, and Mixed Types.based on the initial place of origin, cancers can be specified as breast, lung, prostate, liver, kidney, oral, or brain cancers [1], [2].
observe the level of expression of several genes [3] and have facilitated the ascent of computational analysis, including ML techniques.These techniques are useful for pattern extraction and classification model development from the gene expression database; they have also been helpful in cancers prognosis and management [4].DNA microarray technologies have found application in the prediction of cancer diseases and have served as an effective platform for gene expression analysis in several experimental studies.With microarray technologies, the level of expression of several genes can be analyzed from 2 test cells.regarding how to get the samples cells, there is a need for essential investigations such as accurate diagnosis, disease improvement, response to medication, and post-treatment prognosis are necessary [5]- [7].
Several feature selection frameworks have been developed over time and their review may be necessary to understand their successes and limitations [6].Several studies have previously been involved in the study of the effectiveness of an attribute subset in deciding an optimal.Essentially, feature selection is an optimization problem; however, a well-organized process of discriminative gene selection from microarray gene expression data for cancers, diagnosis has been recommended.Further, the need for dimension reduction prior to classification with gene expression microarray approach has been outlined.Recently the swarm-based methods and evolutionary methods like Ant Colony Optimization (ACO) [8]- [10], Genetic Algorithm (GA) [11]- [13], Artificial Bee Colony (ABC) [14], [15] Particle Swarm Optimization (PSO) [16], [17] and Harmony Search Algorithm (HSA) have been used to handle the problems of features selection [18], [19].Firefly algorithm (FA) was developed by Yang [20], [21] as a swarm-based metaheuristic.FA has attracted much research interest owing to its efficiency in handling optimization issues [22]- [25].
In this paper, the issue of human cancer disease classification was tackled using gene expression profiles.This study presents a new approach for the analysis of microarray datasets and for an efficient classification of cancer [26], [27].In the new approach, Mutual Information (MI) is used as a feature selection model for the selection of the most relevant features, while Firefly algorithm (FA) was deployed for feature reduction.Support Vector Machine (SVM) was used as the classification model for the evaluation of each solution achieved by the Firefly algorithm.The remaining part of this paper is arranged in the following way: section 2 provides an overview of FA and MI, while section 3 explained the proposed classification algorithm in this study.In section 4, the results of the study are presented, while the conclusions drawn from the study are provided in section 5.

Overview
Optimization means determine the current best solution to the problems which can not be solved in a deterministic polynomial time, or NP hard problems [28]- [31].In most real-world optimization tasks, finding optima requires an expensive computation process.Some of the study limitations such as computation resource constraints and project time requirements have necessitated the need to make the optimization process less complicated and rapid [32], [33].Most of the standard optimization frameworks require several function evaluations and usually produce satisfactory results due to their inherent special information transfer mechanism of using several first-choice solutions in a range of fitness evaluations.The need to evaluate each candidate resolution makes these processes to demand much computing resources and execution time.Consequently, efforts have been devoted to the development of efficient optimization algorithms for the evaluation of several functions.Several novel approaches have been proposed in recent times with some satisfactory performances with fewer function evaluations [28], [34]- [38].
Feature selection is a pre-processing method of selecting the most informative genes which can distinguish groups (cancer subtypes in this case).The major aim of the feature section is to establish a reduced set of features from a dataset in a bid to reduce the initial dimensionality of the feature space.In general, studies on cancer classification require the deployment of formal feature selection strategies for the purpose of lowering the computational demands in experimental responsibilities which helps to evaluate and explore data in such domain, as well as to provide a programmable solution in terms of structure and the number of features [39]- [41].

Mutual Information
In the filter based methods, variables are ranked without any form of dependence on the classifier, and some of such performance measures are Fisher score, Pearson correlation coefficient, and Information theory-based measures [42] [43].Such techniques are advantageous because they are easy to implement, computationally less expensive, and provides a more generalizable feature subset since they are not dependent on any classifier [44].Having said that, their major problem is that they cannot exploit specific ML algorithm characteristics which are intended for use, and as such, rarely achieve the highest level of classification accuracies.
MI refers to the specific information shared by 2 variables.Through entropy, the conveyable information from a variable can be quantified but the major point of interest is the level of variables overlaps in the recorded variables.This is important when considering the effectiveness of one variable in the prediction of the other; a higher level of shared information implies that a similar information source is being measured: Where represents the value of weight of individual feature, while denotes the entropy value.Entropy is calculated by the summation of all the probability distribution of values of the feature, multiplied by the natural of those probability distribution, as follows: ( Where represent a value of the set , while represents the probability distribution of .

Firegly Algorithm
The FA is a recently developed nature-inspired metaheuristic which was inspired by the light flashing pattern of fireflies [20], [45] There are three general rules guiding the fireflies in the FA:  The fireflies are randomly attracted to each other (unisex).
 The level of attractiveness of a firefly depends on its brightness; attractiveness relates inversely with the distance between any 2 fireflies.
 The brightness of a firefly is directly related to the value of its objective function.
There are two important issues in the FA, and these involve the changes in the light intensity and the attractiveness formulation.The attractiveness of a firefly simply depends on its brightness, but being that attractiveness decreases with distance between 2 fireflies, it appears that weaker intensities imply less attractiveness.Thus, both light intensity variation and attractiveness ought to be monotonically decreasing function often formulated thus: ( where = inter-firefly distance, = initial light intensity, and = coefficient of light absorption which controls light intensity variation (the value of is usually fixed in the FA).The attractiveness of a firefly in the FA is directly related to the light intensity as defined in Eq. 2.26.
(4) where = attractiveness at .Then, the attraction of the th firefly to the th firefly can be formulated thus: (5) where = randomization parameter, = a uniformly distributed random number generator in the range [0,1].
= distance between the th and th firefly which can be formulated thus: (6) where is the size of an optimization problem.The pseudocode of FA is given in the figure follow.

Firefly Algorithm
Input: , , 1. Initialize all of 2. Initialize a β_0 and δ 3. Calculate the fitness and the light intensity of each firefly 4. While ( 5. For each firefly ( ) in the swarm 6.
For each firefly ( ) in the swarm 7.
If the intensity of ( ) is less than the intensity of ( ( 8.
Calculate the distance between and 9.
Calculate the attractiveness between and 10.
Update the position of ( ) 11.
Evaluate the new solution, update the light intensity 12.
End If 13.End For 14. End For 15. Rank the swarm and find the current best firefly 16. t = t + 1 17.End While 18.Return best firefly Figure1.The pseudocode of FA

The Proposed Algorithm
Feature selection algorithms are mainly developed due to the need for finding better subset features that will offer better performance accuracies.The proposed model is divided into two main stages.In the first stage, the features in the dataset are filtered by using MI.The selection process depends on the high weights of the features.Then, the second stage is the wrapper feature selection, which is represented by the firefly algorithm, which selects the most relevant features from the results of the first stage.Figure 1 portrays the general structure of the wrapper feature selection method, while Figure 2 displays the proposed model.In the second stage, all the Fireflies in the swarm are initialized in a binary sequence, unlike in the traditional wrapper model where the Fireflies are initialized with randomly selected features.The proposed BFA is comprised of four (4) major stepsinitialization, fitness function, attractiveness calculation, and positional updating.These steps are further explained in the following subsections.

A. BFA Initialization
All the Fireflies in the search space is initiated in this step using a random number in the range of [0,1].The position of each firefly in the search space is represented by these random numbers.The position of each firefly is calculated using equation 1, - Where = upper bound [1.0] and = lower bound [0.0] The generated sequence is further converted into a binary sequence using the relation in equation 2: -1, (2) 0, e Where = position of each firefly, sigmoid ( ) = 1 / [1+ ], = uniform distribution.= binary sequence, 1 = chances that a feature will be selected, and 0 = chances that a feature will not be selected.Each initiated Firefly in the swarm has its own position based on the generated number of each Firefly.In the proposed algorithm, the FF is determined to minimize the rate of classification error over the validation set of training data, as shown in equation 2, while maximizing the number of irrelevant or non-selected features.The FF of the algorithm was calculated using a classifier.Here, the Support Vector Machine (SVM) was applied to determine the classification accuracy.

B. Calculation of the fitness function (FF)
( where A = accuracy rate of the classifier, in other words, the 5-fold cross-validation error rate after training the SVM.Equation 4 is used to calculate the intensity of each Firefly based on the error value. ( Calculation of the attractiveness of each firefly.To calculate the level of attractiveness β of each Firefly, equation 5 was deployed: (5) where = the distance between 2 fireflies (calculated using equation 6), β_0 = the attractiveness of a firefly at the initial case (r = 0).
where X = real positional values of each firefly earlier calculated using the information gain ratio equation.
This distance between 2 fireflies is calculated using the hamming distance method, where each bit of Firefly is subtracted from Firefly .In this method, the distance is represented by the difference between the binary strings of the two Fireflies.This method improves the ability of the FA to work better with binary features than working with continuous values.

C. Updating the position of the best firefly
In the swarm, each firefly is attracted to a brighter Firefly.In the algorithm, the position of the brighter Firefly is updated using equation 7.
where in the first part of the relation = current position of the best Firefly, while the second part of the relation expresses the attractiveness between position and .The third part of the relation expresses the randomization with α, where This randomness is decremented by another constant rate δ, where such that at the final optimization stage, the value of α will be maximized, as in equation ( 8). ( 8)

Results and Discussion
The performance of the proposed methodology on colon cancer dataset was evaluated in this section.Two important criteria were considered for the observational evaluation of the performance evaluation and these are the number of genes selected and the predictive accuracy on the selected gene.

Experimental Settings
Dataset downloaded from the Kent Ridge Bio-Medical Dataset website (http://datam.i2r.astar.edu.sg/dataset/krbd/) was used in this study.This dataset is characterized as N * M matrix, with N and M being the number of experimental samples and the number of genes involved, respectively.In the matrix, each cell represents the level of expression of a particular gene in a particular experiment, with the total number of samples being 62 and the number of genes being 2000.There are 22 positive (P) and 40 negative (N) samples in the dataset.The measurement of the results was based on the following diagnostic performance measures: Positive instances (P) and negative instances (N), True Positive (TP): the number of correctly diagnosed positive instances, True Negative (TN): the number of negative instances identified as positive (Type 1 error), False Negative (FN): the number of positive instances wrongly identified as negative (Type II error).First, these performance measures were computed before being used to compute the algorithms' classification accuracy (CA) as follows: (9)

MIFA Results
In this paper, the proposed algorithm is compared to the other related works.Table 1 showed that the proposed hybrid algorithm attained the highest classification accuracy than the other methods.The proposed algorithm (MIFA-SVM) has selected a small number of selected features, which is 41 features.Which proved the ability of our algorithm to solve the gene selection problem.
The boxplot for the comparing the accuracy of the standard firefly algorithm with the proposed hybrid algorithm (MIFA) is given in Figure 4.It can be seen that the MIFA is more stable than the FA during all experiments, which means that MIFA selected the most relevant features from the dataset more than the firefly algorithm.
In addition to the boxplot, the performance of MIFA is compared with the original FA in terms of ROC and Recall-Precision curves in Figure 5 and Figure 6 respectively.It can be seen that the Recall-Precision curve of MIFA is consistently higher than that of FA.

Conclusion
Cancer classification based on gene expression data is a trending research field in the area of data mining.The proposed algorithm in this study tended to the problem of early cancer detection which is important for proper cancer management.This paper presented a new methodology for the classification of human cancer based on their gene expression pattern.The proposed method deployed MI for feature selection while FA served as a feature selection framework; finally, SVM was used for the classification task.The performance of the method was evaluated on a colon microarray dataset and compared with other recent methods.From the results, FA was proved to be robust in achieving great results as it can predict using a lower number of genes compared to the other methods.In conclusion, the results of this study showed that the proposed method can maximize the accuracy of caner classification and minimize the number of selected genes compared to the other methods.