A common classification method based on probability--simple Bayes

Today introduces a common classification method based on probability in machine learning. Naive Bayes, the previously introduced KNN, decision tree and other methods are hard decisions because the output of these classifiers is only 0 or 1, naive Bayes. The method outputs a probability of a certain class, and its value ranges from 0-1. Naive Bayes is very effective in text classification, or spam recognition.

NaÃ¯ve Bayes is based on our commonly used Bayes' theorem:

Suppose we want to deal with a two-category problem: c1, c2, given a sample, such as an email, can be represented by the vector x, the mail is a text, and the text is composed of words, so x actually contains The information about the words that appear in this email, we ask is that given the sample x, we need to determine whether the sample belongs to c1 or c2. Of course, we can use probability to express:

This is our common posterior probability. According to Bayes' theorem, we can get:

Machine Learning: Naive Bayes--python

This is what we call Naive Bayes, and the next is all kinds of statistics.

We give an example of using Naive Bayes for text categorization:

First create a database:

Def Load_dataset():

postingList=[['my', 'dog', 'has', 'flea', \

'problems', 'help', 'please'],

['maybe', 'not', 'take', 'him', \

'to', 'dog', 'park', 'stupid'],

['my', 'dalmation', 'is', 'so', 'cute', \

'I', 'love', 'him'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['mr', 'licks', 'ate', 'my', 'steak', 'how',\

'to', 'stop', 'him'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

classVec = [0, 1, 0, 1, 0, 1]

Return postingList, classVec

Next, we create a dictionary library to ensure that each word has a position index in the dictionary. In general, the size of the dictionary is the size of the dimensions of our sample:

Def Create_vocablist(dataset):

vocabSet = set ([])

For document in dataset :

vocabSet = vocabSet | set(document)

Return list(vocabSet)

We can turn a sample into a vector: one method is to count only whether the word appears, and the other is to count the number of occurrences of the word.

Def Word2Vec(vocabList, inputSet):

returnVec = [0] * len(vocabList)

For word in inputSet :

If word in vocabList :

returnVec [vocabList.index (word)] = 1

Else:

Print ("the word %s is not in the vocabulary" % word)

Return returnVec

def BoW_Vec (vocabList, inputSet):

returnVec = [0] * len(vocabList)

For word in inputSet :

If word in vocabList :

returnVec[vocabList.index(word)] += 1

Else:

Print ("the word %s is not in the vocabulary" % word)

Return returnVec

Next, we build the classifier: here we need to note that since the probability is a number between 0-1, continuous multiplication will make the final result tend to zero, so we can multiply the probability to the logarithm Addition of domains:

Def Train_NB(trainMat, trainClass) :

Num_doc = len (trainMat)

Num_word = len(trainMat[0])

P_1 = sum(trainClass) / float(Num_doc)

P0_num = np.zeros(Num_word) + 1

P1_num = np.zeros(Num_word) + 1

P0_deno = 2.0

P1_deno = 2.0

For i in range(Num_doc):

If trainClass[i] == 1:

P1_num += trainMat[i]

P1_deno +=sum(trainMat[i])

Else:

P0_num += trainMat[i]

P0_deno += sum(trainMat[i])

P1_vec = np.log(P1_num / P1_deno)

P0_vec = np.log(P0_num / P0_deno)

Return P_1, P1_vec, P0_vec

Def Classify_NB(testVec, P0_vec, P1_vec, P1):

P1 = sum(testVec * P1_vec) + math.log(P1)

P0 = sum(testVec * P0_vec) + math.log(1-P1)

If p1 ã€‹ p0:

Return 1

Else:

Return 0

Def Text_parse(longstring):

Import re

regEx = re.compile(r'\W*')

Listoftokens = regEx.split(longstring)

Return [tok.lower() for tok in Listoftokens if len(tok)ã€‹0]

# return Listoftokens

Here is a simple test:

Test_string = 'This book is the best book on Python or ML\

I have ever laid eyes upon.'

wordList = Text_parse(test_string)

Mydata, classVec = Load_dataset()

'''

Doc_list = []

Full_list = []

For i in range (len(Mydata)):

Doc_list.append(Mydata[i])

Full_list.extend(Mydata[i])

'''

Vocablist = Create_vocablist(Mydata)

Wordvec = Word2Vec (Vocablist, Mydata[0])

trainMat = []

For doc in Mydata:

trainMat.append(Word2Vec(Vocablist, doc))

P_1, P1_vec, P0_vec = Train_NB(trainMat, classVec)

Print Mydata

Print classVec

Print wordList

Memory RAM

Shenzhen Kaixuanye Technology Co., Ltd. , https://www.icoilne.com