Performance Comparison of Naive Bayes and Complement Naive Bayes Algorithms


Seref B., BOSTANCI G. E.

6th International Conference on Electrical and Electronics Engineering (ICEEE), İstanbul, Türkiye, 16 - 17 Nisan 2019, ss.131-138 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/iceee2019.2019.00033
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.131-138
  • Anahtar Kelimeler: big data, training time, testing time, mahout
  • Ankara Üniversitesi Adresli: Evet

Özet

Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naive Bayes and Complement Naive Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, "20 Newsgroups" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naive Bayes Algorithm is better than Complement Naive Bayes Algorithm based on average training time. On the other hand, performance of Complement Naive Bayes is better than the other based on average correctly classified instance percentage.