Comparative Study of Classification Techniques For Large Scale Data - Case Study

Nigar M.Shafiq Surameery; Dana Lattef Hussein

doi:10.24017/science.2017.3.2

Authors

Nigar M.Shafiq Surameery Building and Construction Engineering Dept, College of Engineering University of Garmian Kalar, sulaimani, Iraq
Dana Lattef Hussein Database Dept, Computer Science Institute, Sulaimani Polytechnic University, Sulaimani, Iraq

Abstract

The existence of Massive datasets that are generated in many applications provides various opportunities and challenges. Especially, scalable mining of such large-scale datasets is a challenging issue that attracted some recent research. In the present study, the main focus is to analyse the classification techniques using WEKA machine learning workbench. Moreover, a large-scale dataset was used. This dataset comes from the protein structure prediction field. It has already been partitioned into training and test sets using the ten-fold cross-validation methodology. In this experiment, nine different methods have been tested. As a result, it became obvious that it is not applicable to test more than one classifier from the (tree) family in the same experiment. On the other hand, using (NaiveBayes) Classifier with the default properties of the attribute selection filter has a great time consuming. Finally, varying the parameters of the attribute selections should be prioritized for more accurate results.

Keywords:

classification techniques, WEKA, data mining, bioinformatics, knowledge discovery, large-scale data.

References

[1] AL-Nabi, Luqman Delveen, and Shukri Shereen Ahmed. "Survey on Classification Algorithms for Data Mining:(Comparison and Evaluation)." Computer Engineering and Intelligent Systems 4, no. 8: pp.18-27 (2013).
[2] Angus-Hill, et al. "A Rsc3/Rsc30 zinc cluster dimer reveals novel roles for the chromatin remodeler RSC in gene expression and cell cycle control." Molecular cell 7, no. 4: pp.741-751(2001).
https://doi.org/10.1016/S1097-2765(01)00219-2
[3] Bergmann, Sven, Ihmels Jan, and Barkai Naama. "Iterative signature algorithm for the analysis of large-scale gene expression data." Physical review E 67, no. 3: pp.031902 (2003).
https://doi.org/10.1103/PhysRevE.67.031902
[4] Bhavsar, H., and A. Ganatra. "A comparative study of training algorithms for supervised machine learning." International Journal of Soft Computing and Engineering (IJSCE) 2, no. 4: pp.2231-2307 (2012).
[5] Chauhan, R., H. Kaur, and M. A. Alam. "Data clustering method for discovering clusters in spatial cancer databases." International Journal of Computer Applications 0975-8887 (2010).
https://doi.org/10.5120/1487-2004
[6] David, S. K., A. T. Saeb, and K. Al Rubeaan. "Comparative Analysis of Data Mining tools and classification Techniques using WEKA in Medical Bioinformatics." Computer Engineering and Intelligent Systems 4, no. 13: pp.28-38 (2013).
[7] Dueck, D., D. Q. Morris, and J. B. Frey. "Multi-way clustering of microarray data using probabilistic sparse matrix factorizatio." Bioinformatics 21, no. suppl 1: pp.i144-i151 (2005).
https://doi.org/10.1093/bioinformatics/bti1041
[8] Eisen, M. B., P. T. Spellman, P. O. Brown, and Botst. "Cluster analysis and display of genome-wide expression patterns." Proceedings of the National Academy of Sciences 95, no. 25: pp.14863-14868 (1998).
https://doi.org/10.1073/pnas.95.25.14863
[9] Erica, C., and H. Falk. "Using Blackbox Algorithms Such as TreeNet and Random Forests for Data-Mining and for Finding Meaningful." Information science reference,: pp. 65-84 (2009).
https://doi.org/10.4018/978-1-59904-982-3.ch004
[10] Everitt, S. B., Landau Sabine, and Leese Morven. Cluster Analysis. fourth. London: Arnold, (2004).
[11] Fayyad, U., and S. Paul. "Data mining and KDD: Promise and challenges." Future generation computer systems 13, no. 2-3: pp.99-115 (1997).
https://doi.org/10.1016/S0167-739X(97)00015-0
[12] Frank, E., M. Hall , L. Trigg, G. Holmes, and I. H. Witten. "Data mining in bioinformatics using Weka." Bioinformatics 20, no. 15: pp.2479-2481 (2004).
https://doi.org/10.1093/bioinformatics/bth261
[13] Freitas, A. A. "Data mining and knowledge discovery with evolutionary algorithms." Springer Science & Business Media, (2013).
[14] Guerra, L., M. McGarry, V. Robles, C. Bielza, P. Larrañaga, and R. Yuste. "Comparison between supervised. And unsupervised classifications of neuronal cell types: A case study." Developmental neurobiology 71, no. 1: pp. 71-82 (2011).
https://doi.org/10.1002/dneu.20809
[15] Huttenhower, C., M. Hibbs, C. Myers, and Troyansk. "A scalable method for integration and functional analysis ofmultiple microarray datasets." Bioinformatics 22, no. 23: pp.2890-2897 (2006).
https://doi.org/10.1093/bioinformatics/btl492
[16] John, G. H., and P. Langley. "Estimating continuous distributions in Bayesian classifiers." In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence (Morgan Kaufmann Publishers Inc),: pp. 338-345 (1995).
[17] Kifaya, S. Qaddoum. "Mining Student Evolution Using Associative Classification and Clustering." Communications of the IBIMA 11, no. 1943-7765: pp. 19-25 (2009).
[18] Kretschmann, E., W. Fleischmann, and R. Apweiler. "Automatic rule generation for protein annotation with the C4. 5 data mining algorithm applied on SWISS-PROT." Bioinformatics 17, no. 10: pp.920-926 (2001).
https://doi.org/10.1093/bioinformatics/17.10.920
[19] Li, J., and L. Wong. "Identifying good diagnostic gene groups from gene expression profiles using the. Concept of emerging patterns." Bioinformatics 18, no. 5: pp. 725-734 (2002).
https://doi.org/10.1093/bioinformatics/18.5.725
[20] Luscombe, N. M., D. Greenbaum, and M. Gerstein. "What is bioinformatics? An introduction and overview." Yearbook of Medical Informatics 1, no. (83-100): p.2 (2001).
https://doi.org/10.1055/s-0038-1638103
[21] Pavlidis, P., J. Weston, J. Cai, and W. S. Noble. "Learning gene functional classifications from multiple data types." Journal of computational biology 9, no. 2: pp.401-411 (2002).
https://doi.org/10.1089/10665270252935539
[22] Pi, Jiaxiong, Yong Shi, and Z. Chen. "From similarity retrieval to cluster analysis: The case of R*-trees." Computational Intelligence and Data Mining,: pp. 524-529 (2007).
https://doi.org/10.1109/CIDM.2007.368919
[23] Rahman, R. M., and F. Afroz. "Comparison of various classification techniques using different data mining tools for diabetes diagnosis." Journal of Software Engineering and Applications 6, no. 03: p.85 (2013).
https://doi.org/10.4236/jsea.2013.63013
[24] Schreiber, A. W., and U. Baumann. "A framework for gene expression analysis." Bioinformatics 23, no. 2: pp.191-197 (2007).
https://doi.org/10.1093/bioinformatics/btl591
[25] Tan, A. X., V. L. Liu, M. Kantarcioglu, and Thurais. "A comparison of approaches for large-scale data mining." Technical Report UTDCS-24-10, Tech. Rep., (2010).
[26] Thakur, R., and A.R. Mahajan. "Preprocessing and Classification of Data Analysis in Institutional System using Weka." International Journal of Computer Applications 112, no. 6 (2015).
[27] Tobler, J.B., M.N. Molla, E.F. Nuwaysir, R.D. Green, and J.W. Shavlik. "Evaluating machine learning approaches for aiding probe selection for gene-expression arrays." Bioinformatics 18, no. (suppl 1): pp.S164-S171 (2002).
https://doi.org/10.1093/bioinformatics/18.suppl_1.S164
[28] Troyanskaya, O.G., K. Dolinski, A.B. Owen, R.B. Altman, and D. Botstein. "A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)." Proceedings of the National Academy of Sciences 100, no. 14: pp.8348-8353 (2003).
https://doi.org/10.1073/pnas.0832373100
[29] Yang, H. C., A. Dasdan, R. L Hsiao, and D. S. Parker. "Map-reduce-merge: simplified relational data processing on large clusters." In Proceedings of the 2007 ACM SIGMOD international conference on Management of data,: pp. 1029-1040 (2007).
https://doi.org/10.1145/1247480.1247602
[30] Yoo, I., et al. "Data mining in healthcare and biomedicine: a survey of the literature." Journal of medical systems 36, no. 4: pp.2431-2448 (2012).
https://doi.org/10.1007/s10916-011-9710-5