python編寫分類決策樹的代碼

2020-02-16 11:15:58

字體：大中小

供稿：網(wǎng)友

決策樹通常在機(jī)器學(xué)習(xí)中用于分類。

優(yōu)點(diǎn)：計(jì)算復(fù)雜度不高，輸出結(jié)果易于理解，對中間值缺失不敏感，可以處理不相關(guān)特征數(shù)據(jù)。
缺點(diǎn)：可能會產(chǎn)生過度匹配問題。
適用數(shù)據(jù)類型：數(shù)值型和標(biāo)稱型。

1.信息增益

劃分?jǐn)?shù)據(jù)集的目的是：將無序的數(shù)據(jù)變得更加有序。組織雜亂無章數(shù)據(jù)的一種方法就是使用信息論度量信息。通常采用信息增益，信息增益是指數(shù)據(jù)劃分前后信息熵的減少值。信息越無序信息熵越大，獲得信息增益最高的特征就是最好的選擇。
熵定義為信息的期望，符號xi的信息定義為：

其中p(xi)為該分類的概率。
熵，即信息的期望值為：

計(jì)算信息熵的代碼如下：

def calcShannonEnt(dataSet):  numEntries = len(dataSet)  labelCounts = {}  for featVec in dataSet:    currentLabel = featVec[-1]    if currentLabel not in labelCounts:      labelCounts[currentLabel] = 0    labelCounts[currentLabel] += 1  shannonEnt = 0  for key in labelCounts:    shannonEnt = shannonEnt - (labelCounts[key]/numEntries)*math.log2(labelCounts[key]/numEntries)  return shannonEnt

可以根據(jù)信息熵，按照獲取最大信息增益的方法劃分?jǐn)?shù)據(jù)集。

2.劃分?jǐn)?shù)據(jù)集

劃分?jǐn)?shù)據(jù)集就是將所有符合要求的元素抽出來。

def splitDataSet(dataSet,axis,value):  retDataset = []  for featVec in dataSet:    if featVec[axis] == value:      newVec = featVec[:axis]      newVec.extend(featVec[axis+1:])      retDataset.append(newVec)  return retDataset

3.選擇最好的數(shù)據(jù)集劃分方式

信息增益是熵的減少或者是信息無序度的減少。

def chooseBestFeatureToSplit(dataSet):  numFeatures = len(dataSet[0]) - 1  bestInfoGain = 0  bestFeature = -1  baseEntropy = calcShannonEnt(dataSet)  for i in range(numFeatures):    allValue = [example[i] for example in dataSet]#列表推倒，創(chuàng)建新的列表    allValue = set(allValue)#最快得到列表中唯一元素值的方法    newEntropy = 0    for value in allValue:      splitset = splitDataSet(dataSet,i,value)      newEntropy = newEntropy + len(splitset)/len(dataSet)*calcShannonEnt(splitset)    infoGain = baseEntropy - newEntropy    if infoGain > bestInfoGain:      bestInfoGain = infoGain      bestFeature = i  return bestFeature

4.遞歸創(chuàng)建決策樹

結(jié)束條件為：程序遍歷完所有劃分?jǐn)?shù)據(jù)集的屬性，或每個分支下的所有實(shí)例都具有相同的分類。
當(dāng)數(shù)據(jù)集已經(jīng)處理了所有屬性，但是類標(biāo)簽還不唯一時，采用多數(shù)表決的方式?jīng)Q定葉子節(jié)點(diǎn)的類型。

上一篇：Python反射用法實(shí)例簡析

下一篇：Python2.7下安裝Scrapy框架步驟教程