Thursday 21 November 2013

Performance of Naïve Bayesian (NB) & J48 Algorithm



Performance of Naïve Bayesian (NB) & J48 Algorithm

Ms. Arpana Chaturvedi, pcord.bca@jagannath.org

Abstract

Due to the large volumes of data as well as the complex and dynamic properties data, data mining based techniques have been applied to datasets. With recent advances in computer technology large amounts of data could be collected and stored. Machine Learning techniques can help the integration of computer-based systems in any environment providing opportunities to facilitate and enhance the work of various industry professionals. It ultimately improves the efficiency and quality of data and information. The objective of this paper is to perform analysis on large data set by using different supervised machine learning algorithms and obtain the maximum classification accuracy for improving the performance.

In this article, we are discussing the performance of two supervised learning techniques.

The Naive Bayes Classifier (Probabilistic Learner) technique is based on Bayesian theorem and is used when the dimensionality of the inputs is high. Naïve Bayes classifiers assume that the variable value on a given class is independent
of the values of other variable. The Naive-Bayes inducer computes conditional probabilities of the classes given the instance and picks the class with the highest posterior. Depending on the precise nature of the probability model, Naive Bayes classifiers can be trained very efficiently in a supervised learning mode.

J48 (enhanced version of C4.5) is based on the ID3 algorithm developed by Ross Quinlan ,with additional features to address problems that ID3 was unable to deal. In practice C4.5 uses one successful method for finding high accuracy hypotheses, based on pruning the rules issued from the tree constructed during the learning phase. However, the principal  disadvantage of C4.5 rule sets is the amount of CPU time and  memory they require. Given a set S of cases, J48 first grows  an initial tree using the divide-and-conquer algorithm as follows:
 • If all the cases in S belong to the same class or S is  small, the tree is leaf labeled with the most frequent class in S.
• Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test as the root of the tree with one branch for each outcome of the test partition S into corresponding subsets S1,S2 ,…, according to the outcome for   each case, and apply the same procedure recursively to each subset .
There are usually many tests that could be chosen in this last step.
J48 uses two heuristic criteria to rank possible tests: information gain, which minimizes the total entropy of the subsets {Si} .


www.jagannath.org

No comments:

Post a Comment