6th International Conference on Electrical and Electronics Engineering (ICEEE), İstanbul, Türkiye, 16 - 17 Nisan 2019, ss.131-138
Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naive Bayes and Complement Naive Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, "20 Newsgroups" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naive Bayes Algorithm is better than Complement Naive Bayes Algorithm based on average training time. On the other hand, performance of Complement Naive Bayes is better than the other based on average correctly classified instance percentage.