Mining Method for Cancer and Pre-Cancer Detection Caused by Mutant Codon 248 in TP53 Deeman

Received Dec8, 2018 Process of prediction has a substantial function in detecting and efficient protection of cancer. The tumor suppressor P53 is approximately near 50% of all human beings tumors due to the mutations which is appear in the TP53 gene to the cells within updated UMD TP53 Mutation Database Oct. 2017 [1], it is so difficult working with prime data (in excel) to predict and diagnosis cancers. In this research a functional model of mining approach and Artificial Neural Network which is proposed to predict cancer and pre-cancer caused by specific codon mutation (each codon has hundreds mutations cause cancers) of tumor protein P53, and applied this approach on mutability of hotspot codon 248 (exon 7), CGG. The Quick Propagation mechanism has been used for training and testing the Neural Network structure to determine the accuracy of the proposed architecture. This research procedure demonstrates that Neural Network based prediction of Cancer and Premalignant Disease (pre-cancer) of mutated codon 248 and manifests perfect performance in the prognosis of the mutation situation to pre-cancer or cancer in general. Using of data mining preprocessing steps and pattern extraction to construct the prediction model by selecting (8) out of (132) new TP53 gene database fields in order to classify the cases to the target class pathology (Cancer, Precancer) using these fields. A high professional Neural Network software simulation (Alyuda NeuroIntellegence) is used to build the classifier and Neural Network, the testing and experimental results from the proposed architecture shows that using Quick Propagation algorithm is very accurate in term of accuracy and minimum error rates showing the results of accuracy (99.97%, 100%, 99.85%) for (Train, Validation and Test) phases respectively with error rate of (0.0003, 0, 0.0015) for (Train, Validation and Test) phases respectively. Keyword:


Introduction
Cancer disease considered as a standout amongst the vastly known and complicated infections now a days, since it takes place due to various organic and physical activities and responses.Computer based paradigms of disease detection are being produced to help in both clinical recipes and biological originality [2].The p53 gene holds mutations in about 50-60% from humanbeing cancers and around 90% of the mutations encrypt missense mutated proteins which extend about 190 diverse codons located in the DNA-binding range for both of gene and protein.These alterations results in a protein that has a miniature abiliasty to link to a particular DNA sequence which is organizing the p53 transcription route [3].The epidermal tumour oppressor gene p53 situated on the short arm of chromosome 17(17p13), further minutely, TP53 gene is located in the range of base pair 7,571,719 to base pair 7,590,867 on the chromosome 17 as shown in Fig. 1, and since its invention in 1979, the p53 became the topic of keen scrutiny, and yet such is the intricate attitude of this protein, it sticks to defy researchers in spite of roughly four decades of explore and research [4].The p53 tumor suppressor/transcription operator controls enormous variety of cellular functions and vigors, such as cell cycle, apoptosis, DNA reform and metabolism" [5].The entire length variant/isoform generates a protein of 393 amino acids (Codon), and plurality of TP53 mutations alter monocular amino acids which is considerd as protein building blocks in TP53, that results in the procreation of a modified protein version which is not able to link efficiently to DNA and the defective protein will be able to construct up at the kernel of the cells and prohibit them from subjecting apoptosis as a reaction to DNA spoiling, while spoiled cells carry on growing with dividing on an un streamline direction, that may results in cancerous tumors [4].Codon 248 which is located on exon 7 in p53 gene sequence as shown in Fig. 2 depending on the International Agency for research on Cancer (IARC), and acts as a hotspot mutant codon that affect normal protein activities and functions and contribute to tumor malignancy and chemoresistance [6].

Related Works
Ayad G. Ismaeel, with Raghad Z. Yousif in (2015), suggested a mechanism for classifying, diagnosing patient mutations with predicting mutations location of the diseased person.The TP53 gene databases were employed with (6) selected columns of the UMD_Cell_line_2010 database, for training and testing Quick Propagation Network (QPN) and the mining technique was based on training the Quick Propagation, that is a refinement of the tradional back propagation network, by using the number of nodes (283-141-1) that used as an (inputhiddenoutput) layers respectively.Following results were obtained from the learning phase for (train, test, and validation set): R-squared = (0.9987),Correlation = (0.9993), with Mean of Absolute Relative Error = (0.0057) [7].Zahraa N. Shahweli, Ban.N. Dhannoon, Rehab S. Ramadhan in (2017), proposed a silico molecular classifying approch for breast and prostate cancers by the mean of Back Propagation Network, by adopting seven datasets to assess the proposed model, five datasets from the UMD TP53 database and two sets of the (IARC TP53 database), the back propagation using hybrid pattern with 5 fold cross validation and sets of validation used for prediction and classification of breast and prostate cancers of patients using molecular mutations situated in the TP53 gene.The performance evaluation of the proposed model for both UMD TP53 and IARC TP53 respectively with accuracy = (98 , 96.7), specificity = (96.6 , 97.3) , and sensitivity (Recall) = (97 , 96.5) [2].Zahraa N. Shahweli, Ban.N. Dhannoon (2017), proposed an enhanced method of the Relief algorithm named ReliefK and output favorable results for feature selection.Back Propagation Neural Network (BPNN) also used with this method to predict and classify cancer based on the mutations that were listed in the somatic and germline mutations of IARC TP53 database.Five measures, including accuracy (Acc), sensitivity (Sn), Fmeasure, Matthew correlation coefficient (MCC), and specificity (Sp), have been used to assess the proposed algorithm.The proposed method of feature selection ReliefK and BPNN results MCC of 1 and 0.88 for IARC TP53 somatic and germline mutations, respectively [8].Chaurasia and Pal (2014), analyzed the performances of several data mining techniques.The breast cancer data classification can be utilized for finding the result of several disease or disclosing the common nature of cancer disease.Several data mining algorithms were used for analyzing cancer disease, the present novel approach used to find out the compare performances of decision tree classifier such as K-Nearest Neighbor Classifier, and Sequential Minimal Optimization (SMO), and Best First Tree.The result shows that the performance of SMO provides good result compare to other classifier in terms of accuracy, low error rate and performance [9].

Dataset and PreProcessing
There are sundry primitive databases such as protein dataset and excel genome, TP53 gene and its tumor protein with its mutations that cause diseases (cancers) is one of these types of dataset.In this study, modern bio-database (last updating at Oct. 2017) of TP53 (tumor protein P53) has been used which contains a big prime data of mutations (Excel file form).This bio-database contains about 80,400 tumors (mutated) records in different codons (in file called UMD_mutations_US) within "UMD TP53 Mutation Database" [1], [10].The P53 new dataset consist of 132 fields and due to irrelevant fields and data quality assurance selection and preprocessing steps of data mining is required as follow:

1-Data Selection
Since the study focuses on Codon 248 mutations, the target data will be all records with this domain which is about (5120) records from the whole dataset, and in addition to data selection this step consider as dataset reduction on records (Row) level.

2-Data Cleaning
This step includes eliminating missing values from the new selected dataset named (Codon 248 dataset) and removing noise and inconsistent data.

3-Data Normalization
Data normalization is an essential part of knowledge data discovery for changing the data scale and standardized the format and for the raw data used in this research Z-score normalization has been implemented since the Z-score is an effective method of normalization for homogeneous data belonging to same range (i.e. in our case codon 248), all the data parameters related to the codon 248.This method that gives the normalized range of data from the original unorganized data using the measures of central tendency like "mean and standard deviation" then the Parameter is called as Z-score Normalization.So the unstructured data can be normalized using z-score parameter, as per given in the following equation [11]: The form of normalization that is based on the z-score formula, scales and translates the feature x ij , where m j is the sample mean and s j is the standard deviation of the attribute j respectively.

4-Feature Subset Selection
Feature selection method is utilized for distinguishing the input fields that are not profitable and do not have any significant contribution to the performance of the System [12].The UMD TP53 Mutation Database consist of 132 features, and most of these features are irrelevant to the prediction function and time consuming if all features considered for computation.The abstraction of trivial fields will progress the Neural Network evaluation metrics.The Backward stepwise feature selection method is used for determining the importance of the input features, which starts with all features and eleminating one input feature at each step by finding a feature with least deterioration the network fulfillment to be eliminated from the input features set [12], and presented in Fig. 3.
Features that are nominated to be the input set for the model based on their importance are shown in table 1.The field of artificial neural networks has been exceedingly elaborated and has been extensively used in many applications of artificial intelligence and because of their outstanding ability of "self learning and adapting", they get much concern from the scientific community, and now a days they have been considered as a pivotal components of many systems and are deemed as a robust tool for the solution of various type of problems [13], [14].Two major types of learning process are exist namely supervised learning and unsupervised learning.The neural network in the supervised learning know the output of the learning function (target) and the fitting of weight coefficients is done in such a manner that the desired and the computed outputs are tend to be as close as possible [15].This paper is concerned with supervised learning and Quick Propagation Neural Network has been implemented as a mining method for training and testing of Cancer and Pre-Cancer diagnosis using the input selected features with the target output.This case study supposed that there is a mutation in Codon 248, the rule of the neural network is to classify the mutation into Cancer or Pre-Cancer via the Quick Propagation algorithm.The quick propagation is Newton's method based to determine the solution and this algorithm is an enhancement of back propagation algorithm.The learning criteria of the Quick Propagation depends of picking out a distinction of this link coefficient by combining two traditional approaches used in [16]:

Table 1. Input Set features
1. Learning rate regulation based on the historic improvement of the learning rate up to the effective value for each weight either globally or apart.2. Second derivative calculation of the error for the sake of each weight.
All the weights are updated separately in the quick back propagation algorithm as used in [16], [17].The resulting weight updating rule which is: As L is the function of total loss "cost function", while the gradient have to calculate in respect to the connection between neurons (i belongs to higher layer and j belongs to lower layer), while learning rate represented by "alpha".The Quick propagation implemented via simple neural network in a simplest way as shown in Fig. 4 the hypothesis of the neural network: The bias b similar to any other weight can be learned, where the total loss function based on the "crossentropy" is: Where the generic gradient calculated as follow:

Results and Performance Measures
The Dataset have been analyzed in this research using Alyuda Neurointelligence tool [18].The dataset size fixed at 5095 records out of 5120 records, after eliminating records of missing values.Through the data analysis, Pathology column is set to be the target, other eight columns will be input columns.The dataset is divided in to subsets represented by training, validation and test set as the following percentages: The dataset errors presented in Fig. 5: Precision and Recall factors are independent from the training and testing sample size.Derivation of these metrics are from the confusion matrix which is a statndard data structure [19].A simple confusion matrix of two classes Premalignant Disease (Pre-Cancer), Cancer classification of our case is represented as in table 2.
( Results are obtained for each step of the learning (training), validation and testing phases are as follow: 1-Learning (Training) Phase: Learning phase done on three ways using the same dataset to train the network and construct the model using three different input activation function which are: i. Logistic ii.Hyperbolic Tangent iii.Linear In this step which is about learning the Quick Propagation neural network using the training set that is consist of 60% of the whole data by the mean of above input activation functions results the same prediction rates for the all three function since the sample dataset is well normalized and all parameters belonging to the same codon 248 (i.e.homogeneous data) and the result was: The ROC Curve of Testing phase presented in Fig. 8:

Conclusion
The proposed mining method of prediction cancers and pre-cancer by mutations in codon 248 of TP53 demonstrates following conclusions: 1-The Quick Propagation Neural Network has been implemented for classification of mutant codon 248 and its high accuracy for classifying mutations in to Cancer and Premalignant Disease.2-The mining method is accurate in predict cancers and pre-cancer via codon 248 mutations, by training and testing the quich propagation with minimal number of inputs: (8) features of a huge number of fields total (132) field of UMD_mutations_US TP53 dataset.3-The objective of this work is to mine in a big data set such as the modern biodata of TP53 (tumor protein P53) named UMD_mutations_US and the focal point of the research is concerned about hotspot mutant codon 248 and detection of the infection caused by this mutant codon either cancer or pre-cancer.4-Separating Codon 248 from the whole data set will enhance the learning and testing of the constructed model and through the experiments of this work and the corresponding results obtained from the model it could emphasized that the updated P53 data set is a rich biodata for mining purposes and can help studying each codon (amino acid) alone to determine the type of cancers that could be caused by TP53.

Figure 1 .
Figure 1.TP53 gene position on the short arm of Chromosome 17

Figure 2 .
Figure 2. Position of Codon 248 on exon 7 in P53 gene sequence using the p53 cDNA as reference (location 1 points to the A of the start of ATG)18 End_ cDNA Mutation end coordinate using the p53 cDNA as reference (location 1 points to the A of the start of ATG) 21 Genome_ base_coding Nucleotide at the start position of the mutation.22cDNA_ variantMutation nomenclature according to HGVS standards using the coding sequence as reference (location 1 points to the A of the start of ATG) of the mutated codon.Del: exonic deletion Ins: exonic insertion Indel: complex event that involves an exonic insertion and a deletion.Splice: mutation that aims the essential address of AG splice acceptor site or GT splice donor site.NR: not relevant, mutations targeting intronic sequence, 5'UTR or 3'UTR.49 Sample_origin Adjacent tissue, Adjacent tissue (stroma), Cell line, Circulating tumour cells, Extra cellular DNA, Normal tissue, Pathological tissue, Peripheral blood lymphocytes, Tumour, Xenograft

Figure 4
Figure 4. Simple Neural Network

Figure 8 .
Figure 8. ROC Curve of Testing phase

Table 3 .
Comparision of mining method with other techniques