Successful Data Science Projects: Lessons Learned from Kaggle Competition

https://doi.org/10.24017/science.2017.3.18

Abstract views: 2212 / PDF downloads: 1912

Authors

  • Mohammed Zuhair Al-Taie Faculty of Computing Universiti Teknologi Malaysia, Johor Malaysia
  • Naomie Salim Faculty of Computing Universiti Teknologi Malaysia Johor, Malaysia
  • Adekunle Isiaka Obasa Department of Computer Science, College of Science and Technology Kaduna Polytechnic Kaduna, Nigeria

Abstract

The workflow from data understanding to deployment of an analytical model of a data science project begins at framing the problem at hand, a task that is typically business-oriented and requires human-to-human interaction. However, the next three steps: data understanding, feature extraction, and model building that come next in the pipeline are the key to successful data science projects. Failing to fully understand the requirements of each of these three steps can negatively affect the performance of the proposed system. Hence, the current study tries to answer the following question “What are the requirements of a successful data science project?” To answer this question, we will use the solution that we built to measure the relevance of local search results of small online e-businesses and submitted to Kaggle data science platform to shed light on why our solution did not achieve a top position among other competitors. Evaluation of the design that we submitted to the competition is going to be carried out in the spirit of the three winning submissions. Our results revealed that well-performed data preprocessing, well-defined features, and model ensembling are critical for building successful data science projects. Such a clarification provides insight into specific aspects of model design to help others including Kagglers avoid possible mistakes while approaching their data science projects.

Keywords:

Data Science Pipeline, E-businesses, Kaggle Competition, Model Ensembling, Relevance Prediction.

References

[1] F. Provost and T. Fawcett, "Data science and its relationship to big data and data-driven decision making," Big Data, vol. 1, pp. 51-59, 2013.
https://doi.org/10.1089/big.2013.1508
[2] F. Cady, The Data Science Handbook: John Wiley & Sons, 2017.
https://doi.org/10.1002/9781119092919
[3] D. Cielen, M. Ali, and A. Meysman, Introducing data science: big data, machine learning, and more, using Python tools: Manning Publ., 2016.
[4] S. García, J. Luengo, and F. Herrera, Data preprocessing in data mining vol. 72: Springer, 2014.
https://doi.org/10.1007/978-3-319-10247-4
[5] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques: Elsevier, 2011.
[6] A. Destrero, S. Mosci, C. De Mol, A. Verri, and F. Odone, "Feature selection for high-dimensional data," Computational management science, vol. 6, pp. 25-40, 2009.
https://doi.org/10.1007/s10287-008-0070-7
[7] T. Ojeda, S. P. Murphy, B. Bengfort, and A. Dasgupta, Practical data science cookbook: Packt Publishing Ltd, 2014.
[8] F. Lazarinis, "Evaluating the searching capabilities of e-commerce web sites in a non-English language: A Greek case study," Online Information Review, vol. 31, pp. 881-891, 2007.
https://doi.org/10.1108/14684520710841829
[9] A. Lee and M. Chau, "The impact of query suggestion in e-commerce websites," in Workshop on E-Business, 2011, pp. 248-254.
https://doi.org/10.1007/978-3-642-29873-8_23
[10] R. Palanisamy, "Evaluation of search engines: a conceptual model and research issues," International Journal of Business and Management, vol. 8, p. 1, 2013.
https://doi.org/10.5539/ijbm.v8n6p1
[11] P. Schmutz, S. Heinz, Y. Métrailler, and K. Opwis, "Cognitive load in ecommerce applications: measurement and effects on user satisfaction," Advances in Human-Computer Interaction, vol. 2009, p. 3, 2009.
https://doi.org/10.1155/2009/121494
[12] M. Markland, "Does the student's love of the search engine mean that high quality online academic resources are being missed?," Performance measurement and metrics, vol. 6, pp. 19-31, 2005.
https://doi.org/10.1108/14678040510588562
[13] M. Cao, Q. Zhang, and J. Seydel, "B2C e-commerce web site quality: an empirical examination," Industrial Management & Data Systems, vol. 105, pp. 645-661, 2005.
https://doi.org/10.1108/02635570510600000
[14] M. Z. Al-Taie, S. M. Shamsuddin, and J. P. Lucas, "Predicting the Relevance of Search Results for E-Commerce Systems," Int. J. Advance Soft Compu. Appl, vol. 7, 2015.
[15] S. v. d. Walt, S. C. Colbert, and G. Varoquaux, "The NumPy array: a structure for efficient numerical computation," Computing in Science & Engineering, vol. 13, pp. 22-30, 2011.
https://doi.org/10.1109/MCSE.2011.37
[16] E. Jones, T. Oliphant, and P. Peterson, "{SciPy}: open source scientific tools for {Python}," 2014.
[17] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp. 273-297, 1995.
https://doi.org/10.1007/BF00994018
[18] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001.
https://doi.org/10.1023/A:1010933404324
[19] C. Chen. (2017, Accessed: July 7, 2017). "CrowdFlower Winner's Interview: 1st place". Available: http://jikeme.com/crowdflower-winners-interview-1st-place-chenglong-chen
[20] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, "Ensemble selection from libraries of models," in Proceedings of the twenty-first international conference on Machine learning, 2004, p. 18.
https://doi.org/10.1145/1015330.1015432
[21] M. Trofimov. Kaggle 'Search Results Relevance' 2nd place solution [Online]. Available: https://github.com/geffy/kaggle-crowdflower/blob/master/description.pdf
[22] T. Quartet. (2017, Accessed: July 7, 2017). CrowdFlower Winners' Interview: 3rd place. Available: http://blog.kaggle.com/2015/07/22/crowdflower-winners-interview-3rd-place-team-quartet/

Downloads

Article Metrics

Published

27-08-2017

Issue

Section

Pure and Applied Science

How to Cite

[1]
M. Z. Al-Taie, N. Salim, and A. I. Obasa, “Successful Data Science Projects: Lessons Learned from Kaggle Competition”, KJAR, vol. 2, no. 3, pp. 40–49, Aug. 2017, doi: 10.24017/science.2017.3.18.