Classification Model from Scratch

栏目: IT技术 · 发布时间: 4年前

内容简介：Beginner’s guide in building a Naive Bayes classifier model (simple classification model) from scratch using Python.In machine learning, we can use probability to make predictions. Perhaps the most widely used example is called the Naive Bayes algorithm. N

Classification Model from Scratch

Beginner’s guide in building a Naive Bayes classifier model (simple classification model) from scratch using Python.

CAMERON FOXLY “BASIC programming into an old computer”

In machine learning, we can use probability to make predictions. Perhaps the most widely used example is called the Naive Bayes algorithm. Not only it is straightforward to understand, but it also achieves surprisingly good results on a wide range of problems.

Naive Bayes algorithm is a classification technique based on the Bayes Theorem. It assumes that a feature in a class is unrelated to the presence of any other feature. As shown in the following formula, the algorithm relies on the posterior probability of the class given a predictor:

where:

P(c|x) is the posterior probability of class given a predictor
P(x|c) is the probability of the predictor given the class. Also known as Likelihood
P(c) is the prior probability of the class
P(x) is the prior probability of predictor.

Or in plain english, the Naive Bayes classifier equation can be written as :

The good news is Naive Bayes classifier is easy to implement and performs well, even with a small training data set. It is one of the best fast solutions when it comes to predicting the class of the data. Scikit-learn offers different algorithms for various types of problems. One of them is the Gaussian Naive Bayes . It is used when the features are continuous variables, and it assumes that the features follow a Gaussian distribution. Let’s dig deeper and build our own Naive Bayes classifier from scratch using Python.

1. Load required libraries

The only required library to build your own Naive Bayes classifier is NumPy . NumPy is an open source project aiming to enable numerical computing with Python and we would use it for arithmetical operations.

2. Instantiate the class

The next step is to instantiate our Naive Bayes classifier class. A Class is like an object constructor, or a “blueprint” for creating objects. In an object oriented programming language, almost everything is an object, with its properties and methods.

“__init__” is a reserve method in python classes. It is known as a constructor in object oriented concepts. This method is called when an object is created from the class and it allows the class to initialize the attributes of a class.

3. Separate Classes

According to the Bayes Theorem, we need to know the prior probability of any class before attempting to predict that particular class. To calculate this, we have to assign the feature values to the specific class. We can do this by separating the classes and saving them into a dictionary.

Dictionaries are Python’s implementation of a data structure that is more generally known as an associative array. A dictionary consists of a collection of key-value pairs. Each key-value pair maps the key to its associated value.

4. Feature Summary (Statistics info)

The likelihood, or the probability of predictor given the class, is assumed to be normally distributed ( Gaussian ) and is calculated based on mean and standard deviation (see formula) . We’re going to create a summary for each feature in the data set, doing this would make it easier for us to access mean and standard deviation of features in the future.

zip()
yield

5. Gaussian distribution function

The likelihood for features following a normal distribution is calculated using the Gaussian distribution function:

To use the formula for further calculation we defined a distribution method and embedded the formula exactly as it exist above.

6. Train the model

Training the model means applying the model to the dataset so it can iterate through the data set and learn dataset’s patterns. In the Naive Bayes classifier, training involves calculating the mean and standard deviation for each feature of each class. This will allow us to calculate the likelihoods to be used for predictions.

If we take a closer look to the code snippet above, we can see that we have separated the classes in the training data set. Then, calculated the mean and standard deviation for each class and then calculated the prior probability of the class using len(feature_values)/len(X) .

7. Predict

To predict a class, we have to first calculate the posterior probability for each class. The class with the highest posterior probability will be the predicted class.

The posterior probability is the joint probability divided by the marginal probability. The marginal probability, or the denominator, is the total joint probability of all classes and will be the same across all classes. We need the class with the highest posterior probability, which means it will be the greatest joint probability.

Joint probability

Joint probability is the numerator of the fraction used to calculate the posterior probability. For multiple features, the joint probability formula is:

Applying the same formula using Python would result to the following code snippet:

Taking a closer look at the snippet above, we followed the following steps for each class:

Get the summary (mean, standard deviation, and prior probability)
Calculate the Normal Probability of each feature
Get the total likelihood (the product of all normal probabilities)
Get the joint probability by multiplying the prior probability with the total likelihood.

Predict the class

After having the joint probability of each class, we can select the class with the maximum value for the joint probability:

max(joint_proba, key=joint_proba.get)

Putting it all together

If we put the joint probability step and predict class step together, we can predict the class for each row in a test data set with the following code snippet.

8. Accuracy Score

Calculating the accuracy score is an essential part of testing any machine learning model. To test our Naive Bayes classifier’s performance, we divide the number of correct predictions by the total number of predictions which would give us a number from 0 to 1.

Our NaiveBayesClassifier vs. Sklearn GaussianNB

Now that we built our classification model, let’s use the UCI Wine data set to compare our model’s performance with the performance of the GaussianNB model from scikit-learn.

Naive Bayes Classifier

NaiveBayesClassifier accuracy: 0.972

Sklearn Gaussian NB (Naive Bayes)

Scikit-learn GaussianNB accuracy: 0.972

As you can see, the accuracy of the models is the same, meaning that we implemented a successful Gaussian Naive Bayes model from scratch.

Feel free to use the github repository here for find the the entire python files and the notebook used to create this article.

References

How The Naive Bayes Classifier Works In Machine Learning

UCI Wine Data Set

Scikit-learn Naive Bayes

以上所述就是小编给大家介绍的《Classification Model from Scratch》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Classification Model from Scratch

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Java语言精粹

Jim Waldo / 王江平 / 电子工业出版社 / 2011-6 / 39.00元

这是一本几乎只讲java优点的书。 Jim Waldo先生是原sun微系统公司实验室的杰出工程师，他亲历并参与了java从技术萌生、发展到崛起的整个过程。在这《java语言精粹》里，jim总结了他所认为的java语言及其环境的诸多精良部分，包括：类型系统、异常处理、包机制、垃圾回收、java虚拟机、javadoc、集合、远程方法调用和并发机制。另外，他还从开发者的角度分析了在java技术周围......一起来看看《Java语言精粹》这本书的介绍吧!

码农工具

Classification Model from Scratch