Today introduces a common classification method based on probability in machine learning. Naive Bayes, the previously introduced KNN, decision tree and other methods are hard decisions because the output of these classifiers is only 0 or 1, naive Bayes. The method outputs a probability of a certain class, and its value ranges from 0-1. Naive Bayes is very effective in text classification, or spam recognition.
Naïve Bayes is based on our commonly used Bayes' theorem:
Suppose we want to deal with a two-category problem: c1, c2, given a sample, such as an email, can be represented by the vector x, the mail is a text, and the text is composed of words, so x actually contains The information about the words that appear in this email, we ask is that given the sample x, we need to determine whether the sample belongs to c1 or c2. Of course, we can use probability to express:
This is our common posterior probability. According to Bayes' theorem, we can get:
This is what we call Naive Bayes, and the next is all kinds of statistics.
We give an example of using Naive Bayes for text categorization:
First create a database:
Def Load_dataset():
postingList=[['my', 'dog', 'has', 'flea', \
'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', \
'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', \
'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how',\
'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0, 1, 0, 1, 0, 1]
Return postingList, classVec
Next, we create a dictionary library to ensure that each word has a position index in the dictionary. In general, the size of the dictionary is the size of the dimensions of our sample:
Def Create_vocablist(dataset):
vocabSet = set ([])
For document in dataset :
vocabSet = vocabSet | set(document)
Return list(vocabSet)
We can turn a sample into a vector: one method is to count only whether the word appears, and the other is to count the number of occurrences of the word.
Def Word2Vec(vocabList, inputSet):
returnVec = [0] * len(vocabList)
For word in inputSet :
If word in vocabList :
returnVec [vocabList.index (word)] = 1
Else:
Print ("the word %s is not in the vocabulary" % word)
Return returnVec
def BoW_Vec (vocabList, inputSet):
returnVec = [0] * len(vocabList)
For word in inputSet :
If word in vocabList :
returnVec[vocabList.index(word)] += 1
Else:
Print ("the word %s is not in the vocabulary" % word)
Return returnVec
Next, we build the classifier: here we need to note that since the probability is a number between 0-1, continuous multiplication will make the final result tend to zero, so we can multiply the probability to the logarithm Addition of domains:
Def Train_NB(trainMat, trainClass) :
Num_doc = len (trainMat)
Num_word = len(trainMat[0])
P_1 = sum(trainClass) / float(Num_doc)
P0_num = np.zeros(Num_word) + 1
P1_num = np.zeros(Num_word) + 1
P0_deno = 2.0
P1_deno = 2.0
For i in range(Num_doc):
If trainClass[i] == 1:
P1_num += trainMat[i]
P1_deno +=sum(trainMat[i])
Else:
P0_num += trainMat[i]
P0_deno += sum(trainMat[i])
P1_vec = np.log(P1_num / P1_deno)
P0_vec = np.log(P0_num / P0_deno)
Return P_1, P1_vec, P0_vec
Def Classify_NB(testVec, P0_vec, P1_vec, P1):
P1 = sum(testVec * P1_vec) + math.log(P1)
P0 = sum(testVec * P0_vec) + math.log(1-P1)
If p1 》 p0:
Return 1
Else:
Return 0
Def Text_parse(longstring):
Import re
regEx = re.compile(r'\W*')
Listoftokens = regEx.split(longstring)
Return [tok.lower() for tok in Listoftokens if len(tok)》0]
# return Listoftokens
Here is a simple test:
Test_string = 'This book is the best book on Python or ML\
I have ever laid eyes upon.'
wordList = Text_parse(test_string)
Mydata, classVec = Load_dataset()
'''
Doc_list = []
Full_list = []
For i in range (len(Mydata)):
Doc_list.append(Mydata[i])
Full_list.extend(Mydata[i])
'''
Vocablist = Create_vocablist(Mydata)
Wordvec = Word2Vec (Vocablist, Mydata[0])
trainMat = []
For doc in Mydata:
trainMat.append(Word2Vec(Vocablist, doc))
P_1, P1_vec, P0_vec = Train_NB(trainMat, classVec)
Print Mydata
Print classVec
Print wordList
Shenzhen Kaixuanye Technology Co., Ltd. , https://www.icoilne.com