Volume 14, number 1
 Views: (Visited 410 times, 1 visits today)    PDF Downloads: 1358

Tamilvanan T, Bhaskaran V. M. A New Feature Selection Techniques Using Genetics Search and Random Search Approaches for Breast Cancer. Biosci Biotechnol Res Asia 2017;14(1)
Manuscript received on : 06 January 2017
Manuscript accepted on : 24 January 2017
Published online on:  --

Plagiarism Check: Yes

How to Cite    |   Publication History    |   PlumX Article Matrix

A New Feature Selection Techniques Using Genetics Search and Random Search Approaches for Breast Cancer

Tamilvanan1 and V. Murali Bhaskaran2

1Research and Development Centre, Bharathiar University, Coimbatore-641046, Tamil Nadu, India.

2Dhirajlal Gandhi College of Technology, Salem-636290, Tamil Nadu, India.

DOI : http://dx.doi.org/10.13005/bbra/2459

ABSTRACT: In this paper mainly deals with various classification algorithm techniques with feature extraction algorithm used to improve the predicated accuracy of the algorithm. This paper applied with correlation based feature selection as a feature evaluator and Genetics and random searching method. The results of the classification model are sensitive, specificity, precision, time, and accuracy. Finally, it concludes that the proposed CFL-NB algorithm performance is better than other classification algorithms techniques for breast cancer disease.

KEYWORDS: Correlation-based Feature Selection; Data Mining; Genetic Algorithm; Random Search and Naive Bayes Algorithm

Download this article as: 
Copy the following to cite this article:

Tamilvanan T, Bhaskaran V. M. A New Feature Selection Techniques Using Genetics Search and Random Search Approaches for Breast Cancer. Biosci Biotechnol Res Asia 2017;14(1)

Copy the following to cite this URL:

Tamilvanan T, Bhaskaran V. M. A New Feature Selection Techniques Using Genetics Search and Random Search Approaches for Breast Cancer. Biosci Biotechnol Res Asia 2017;14(1) Available from: https://www.biotech-asia.org/?p=21840

Introduction

Data mining techniques and application are utilized in a wide range of fields, including banking, gregarious science, inculcation, business industries, bioinformatics, weather, forecasting healthcare and sizably voluminous data. Nowadays health care industry generates a large amount of data about patients, disease diagnosis, etc.  Some different types of approaches to building accurate classifications have been proposed (e.g., NB, MLP, SMO, RF). In classification, we give a Breast Cancer data set of example record or the input data, called the test data set, with each record consisting of various attributes

An attribute can be either a numerical attribute or categorical attribute. If values of an attributes belong to an authoritatively mandated domain, the attribute is called numerical attribute ( e.g. Age, Menopause, Tumor-size, Inv-nodes, Deg-Malig). A categorical attribute (e.g.Node-Capes, Breast, Breast-Quad, Irradiat, Class). Classification is the way toward part a dataset into totally unrelated gatherings, called a class, based on suitable attributes.

This paper is organized accordingly: the relates works and depiction of the specialized parts of the utilized information mining techniques in section 1. The elaborates with classification algorithms like navie bayes, multi-layer perception, Sequential minimal optimization and random forest in section 2.The introduction of the dataset for Breast Cancer in section 3. The Experiment Results and Discussion in section 4. And finally, conclude the paper and future works.

Correlation Based Feature Selection

Correlation based feature selection9,10,18 is one of the notable methods to rank the pertinence of elements by measuring amongst elements and classes and amongst elements and different components. Given number of components k and classes c, CFS portrayed centrality of parts subset by using Pearson’s relationship condition

Vol_14_no1_New_Tami_f1

Where Ms  is the significance of highlight subset

Vol_14_no1_New_Tami_f2

is the normal direct correlation coefficient between these elements and classes and

Vol_14_no1_New_Tami_f3

 is the normal straight correlation coefficient between various elements.

Typically, CFS includes (forward choice) or erases (in reverse determination) one element at once. in this work we used Genetic search and Random search algorithms for the great results.19,20

Genetic Search Algorithm

Look strategies explore the ascribe space to find a not too bad subset and the quality is measured by the property subset evaluator through CFS subset evaluator and genetic search is being used as a request technique. The parameters of the genetic algorithm are various generations, people appraise and the probabilities of change and hybrid. A person from the basic masses makes by deciding a rundown of value records as a hunt point. For delivering progress reports, each such an assortment of generations can be Utilized.20,21 The basic  genetic search procedure is demonstrated as follows:

Step   1: Start by randomly generating an initial people

Step   2: Calculate e(x) for each fellow xεR.

Step  3: Define a probability distribution p over the Fellows of R where p(x) ∞ e(x).

Step  4:  Choose two population members x and y to produce new population members x’ and y’.

Step  5:   Apply mutation to x’ and y’.

Step  6:   Insert x’ and y’ into P’

Step  7:   If |R’| < |R|, go to step 4.

Step  8:   Let R ßR’.

Step 9:    If there are more generations to process, go to step 2.

Step 10:  Return x ε P for which e(x) is highest.

Random Search Algorithm

Step 1:  Set algorithm parameters Θ0, initial points X0 ⊂ S  And iteration index k = 0.

Step 2:  Generate a collection of candidate points Vk+1 ⊂  S according to a specific generator and associated Sampling distribution.

Step 3:  Update Xk+1 based on the candidate points Vk+1, Previous iterates, and algorithmic parameters. Also, update algorithm parameters Θk+1.

Step 4: If a stopping criterion is met, stop. Otherwise Increment k and return to Step 1.

Naive Bayes Classification

Naive Bayes27-30 executes probabilistic naive Bayes classifier. naive means restrictive autonomy among traits of components. The “naive” supposition incredibly diminishes calculation unpredictability to a basic increase of probabilities. The Naive Bayes handles numeric properties utilizing directed discretization and utilizations piece thickness estimators that will enhance the execution It requires just little arrangement of preparing information to create exact parameter estimations since it requires just the computation of the frequencies of characteristics and property result combines in the preparation information set30,31

Dataset

The dataset utilized as a part of this model gathered from UCI machine learning repository32 ought to be more exact what’s more, exact remembering the true objective to upgrade the judicious precision of data mining errands. The data set may have missing (or) irrelevant attributes and these are to be dealt with successfully by the data mining process.

Attribute Identification

The breast cancer dataset which comprises of 286 instances and 10 attributes with the class expressing the life visualization yes (or) no. appear in Table 1.

Table 1: Dataset for Breast cancer

Attributes Name Description
Age Age (years)
Inv-Nodes 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39
Node-Caps yes, no
Menopause lt40, ge40, premeno
Tumor-Size 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,45-49, 50-54, 55-59
Deg-Malig 1, 2, 3.
Breast left, right
Breast-Quad left-up, left-low, right-up, right-low, central.
Irradiat yes, no.
Class no-recurrence-events, recurrence-events

 

Methodology of Proposed Systems

The exploration procedure has two stages. First one is genetics algorithm based CFS connected to the breast cancer dataset which was reduced to 5 from 10 and the Naive Bayes algorithm connected for classification. The second phase of this work utilized Random search based CFS connected to the same data set which was reduced to 6 from 10 and after that connected Naïve Bayes classification algorithm for better expectation.

Experimental Results and Discussion

The exploratory outcomes outline the distinctive measures that are used to evaluate the model for characterization and desire. In this work the specificity, precision, sensitivity and accuracy are expounded.

Sensitivity, Accuracy, Specificity and Precision

Accuracy = (TP + TN)/(TP + TN + FP + FN)       (1)

Sensitivity = TP/(TP + FN)                                (2)

Specificity = TN/(TN + FP)                                   (3)

Precision = TP/(TP + FP)                                      (4)

Table 2: Before Feature Selection

Classifier Precision Sensitivity Specificity Time Accuracy
NB 83% 77% 52% 0.00 71%
SMO 85% 75% 48% 0.09 69%
MLP 74% 75% 40% 2.18 64%
RF 86% 74% 48% 0.1 69%

 

 Figure 1: Before Feature Selection Figure 1: Before Feature Selection

 

Click here to View figure

 

Table 3: After Feature Selection using Genetics Search

Classifier Precision Sensitivity Specificity Time Accuracy
NB 83% 78% 54% 0.00 72%
SMO 81% 73% 41% 0.01 66%
MLP 84% 77% 53% 0.99 71%
RF 88% 74% 50% 0.03 70%

 

Table 4: After Feature Selection using Random search

Classifier Precision Sensitivity Specificity Time Accuracy
NB 39% 51% 82% 0.00 77%
SMO 30% 53% 80% 0.06 76%
MLP 32% 41% 80% 1.11 73%
RF 33% 52% 81% 0.13 76%

 

 Figure 2: After Feature Selection using Random Search Figure 2: After Feature Selection using Random Search

 

Click here to View figure


Figure 3: after Feature Selection using Genetics Search Figure 3: after Feature Selection using Genetics Search

 

Click here to View figure

 

10-Fold Cross-Validation

The classification algorithm is arranged and attempted in 10 times. The cross approval isolates the information into 10 subgroups and each subgroup is attempted through request oversee worked from whatever is left of the 9 bundles. Ten unmistakable test results are gotten for each train–test setup and the typical result gives the test exactness of the calculation.

Confusion Matrix

The confusion matrix7 plots what quantities of cases have been apportioned to each class and the parts of the lattice speak to the amount of experiments whose genuine class is the line and whose foreseen class is the portion. Tables 5, 6 and 7 depict the perplexity matrix that is processed for Naive Bayes and the innate interest based CFS-NB and sporadic chase based CFS-NB computations.

Table 5: Before Feature Selection based on Naive Bayes

a b Classified as
168 33 a = no-recurrence-events
48 37 b = recurrence-events

 

Table 6: After Feature Selection for Genetic search Based on Naive Bayes

a b Classified as
27 41 a = Yes
27 191 b = No

 

Table 7: After Feature Selection for Random search Based on Naive Bayes

a b Classified as
27 41 a = Yes
25 193 b = No

 

Graph Results

Figure 3 demonstrate the precision of different grouping calculations that was accomplished through genetic search based CFS. Figure 2 demonstrate the precision of different order calculations that was accomplished through random search based CFS.

Conclusion

In this work, an enhanced technique was made for breast cancer analysis. The results exhibit that Random search based CFS-NB achieved comparable characterization correctnesses for a decreased component subset that contained six elements. The genetic search based CFS-NB was conveyed better arrangement precision for a decreased subset of five elements. The relative review was coordinated on the breast cancer information in light of random and genetic search in view of CFS with other characterization calculations like Multilayer perceptron and Sequential Minimal Optimization and Random Forest. The trial result simply portrays that the genetic search based CFS and naive bayes execution was better contrasted and other grouping calculations as far as time and accuracy.

References

  1. Yua Q., Tan K. C., Toeh E. J., Goh K. C.  A hybrid evolutionary algorithm for attribute selection in data mining. Expert Systems with Applications. 2009;36:8616–30.
    CrossRef
  2. Kim J. S., Yoon H., Park C. S., Baek J. G.  Algorithm learning based neural network integrating feature selection and classification. Expert Systems with Applications. 2013;40:231–41.
    CrossRef
  3. Sahin F., Chandrasekar G.  A survey on feature selection methods. Computers and Electrical Engineering. 2014;40:16–24.
    CrossRef
  4. Gupta M., Chandra B. An efficient statistical feature selection Approach for classification of gene expression data.Journal of Bio medical Informatics. 2011;44:529–35.
    CrossRef
  5. Yu L., Liu H. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering. 2005;17(4).
  6. Krishnakumari P., Purusothaman G. A Survey of Data Mining Techniques on Risk Prediction: Heart Disease. Indian Journal of Science and Technology. 2015;8(12).
  7. Sumathi C. P., Suganya P. A Novel Metaheuristic Data Mining Algorithm for the Detection and Classification of Parkinson Disease. Indian Journal of Science and Technology. 2015;8(14).
  8. Nasira G. M., Kalaiselvi C. Prediction of Heart Diseases and Cancer in Diabetic Patients Using Data Mining Techniques.Indian Journal of Science and Technology. 2015;8(14).
  9. Smith L. A.,Hall M. A. Feature Selection for Machine Learning: Comparing a Correlation based filter approach to the wrapper. FCAIRS Conference. 1999.
  10. Hall M. Correlation-based feature selection for discrete and numeric class machine learning. Proceedings of the Seventeenth International Conference on Machine Learning. 2000;359–66.
  11. Deng Y., Lu X., Peng X., Feng B., Liu Ping, Liao B. A Novel Feature Selection Method Based on Correlation-Based Feature Selection in Cancer Recognition. Journal of computational and Theoretical nano science. 2014;11(2):427– 33.
    CrossRef
  12. Kennedy J., Eberhart R. C.  A new optimizer using particle swarm theory. In Proceedings of the sixth international symposium on micro machine and human science. 1995;1:39–43.
  13. Eberhart R. C., Shi Y. Empirical study of particle swarm optimization. IEEE Proceedings of the 1999 Congress on Evolutionary Computation. 1999;3.
  14. Eberhart R. C., Shi Y. Particle swarm optimization: developments, applications and resources. IEEE Proceedings of the 2001 Congress on in Evolutionary Computation. 2001;1:81–6.
    CrossRef
  15. Sudha S., Vijayarani S. An Efficient Clustering Algorithm for Predicting Diseases from Hemogram Blood Test Samples. Indian Journal of Science and Technology. 2015;8(17).
  16. García-Saez G., Caballero-Ruiz E., Balsells M., Pons B., Morillo M., Gomez E. J., Hernando M. E.  Automatic Blood Glucose Classification for Gestational Diabetes with Feature Selection: Decision Trees vs. Neural Networks. XIII Mediterranean Conference on Medical and Biological Engineering and Computing, 2013. IFMBE Proceedings. 2014; 41:1370–3.
  17. Gao Y., Xu J., Sun L. X. T.  An ensemble feature selection technique for cancer recognition. Bio-Medical Materials and Engineering. 2014;24:1001–8.
  18. Murugan A., Sridevi T. A Novel Feature Selection Method for Effective Breast Cancer Diagnosis and Prognosis. International Journal of Computer Applications. 2014;88(11).
  19. Yi Y., Wang J., Zhou S., Kong J. An Improved Feature Selection Based on Effective Range for classification. The Scientific World Journal. 2014;1–8.
  20. Yang A., Zang Y., Xiong C., Wang T., Zhang Z. Feature Selection for Data Envelopment Analysis. Knowledge Based Systems. 2014;64:70–80.
    CrossRef
  21. Li Y., Ding S., Shi Z., Yan S. A Protein Structural Classes Prediction Method based on Predicted Secondary Structure and PST-BLAST Profile. Biochimie. 2014;97:60-5.
    CrossRef
  22. Kennedy J., Riccardo P., Blackwell T. Particle swarm optimization.Swarm intelligence. Kennedy J. Particle swarm optimization. 2007;1(1):33–57. Encyclopedia of Machine Learning. Springer US.  2010;760–6.
  23. Mukhopadhyay A., Mandal M. A novel PSO-based graphtheoretic approach for identifying most relevant and non-redundant gene markers from gene expression data. International Journal of Parallel, Emergent and Distributed Systems. 2014;1–18.
  24. Wang B. , Ji Z. Identifying potential clinical syndromes of Hepatocellular carcinoma Using PSO-based hierarchical feature selection algorithm. Bio Med research international.  2014;1–12.
  25. Ananthanarayanan N. R.,Yasodha P. Analyzing Big Data to Build Knowledge Based System for Early Detection of Ovarian Cancer. Indian Journal of Science and Technology.2015;8(14).
  26. Dumitru D. Prediction of recurrent events in breast cancer using the Naive Bayesian classification. Annals of University of Craiova Math Comp Sci Ser. 2009;36(2):92–6.
  27. Chandrasekaran R. N., Nagarajan S. Design and Implementation of Expert Clinical System for Diagnosing Diabetes using Data Mining Techniques. Indian Journal of Science and Technology. 2015;8(8):771–6.
    CrossRef
  28. Velmurugan T., Anuradha C. A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Students Performance. Indian Journal of Science and Technology. 2015;8(15).
  29. Lee K. H., Leung K. S., Wang J. F., Ng E. Y., Chan H. L., Tsui S. K.,Mok T. S., Tse p. C., Sung J. J.  Data Mining on DNA Sequences of Hepatitis B Virus. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2011;8(2):428–40.
    CrossRef
  30. Thangaraju P., Karthikeyan T. Analysis of Classification Algorithms Applied to Hepatitis Patients. International Journal of Computer Applications. 2013;62(5):25–30.
  31. Available from:http://archieve.ics.uci.edu/ml/datasets/ Breast Cancer/ Accessed on: 2015;27:5.
  32. Thangaraju P., Karthikeyan T. PCA-NB Algorithm to Enhance the Predictive Accuracy. International Journal of Engineering and Technology. 2014;6(1):381–7.
  33. Thangaraju P., Karthikeyan T. A Combined Approach of CFS and Naive bayes Algorithm to Enhance the Prediction Accuracy. IEEE International Conference on Control Instrumentation Communication and Computational Technologies (ICCICCT 2014). 2014.
  34. Thangaraju P., Karthikeyan T. Best First and Greedy Search based CFS-Naive Bayes Classification Algorithms for Hepatitis Diagnosis. Biosciences and Biotechnology Research Asia. 2015;12(1):983–90.
    CrossRef
  35. Frank E., Written I. H.  Data mining Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers. 2011.
(Visited 410 times, 1 visits today)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.