Python語言描述KNN算法與Kd樹

2020-02-16 11:03:49

字體：大中小

來源：轉載

供稿：網友

最近鄰法和k-近鄰法

下面圖片中只有三種豆，有三個豆是未知的種類，如何判定他們的種類？

提供一種思路，即：未知的豆離哪種豆最近就認為未知豆和該豆是同一種類。由此，我們引出最近鄰算法的定義：為了判定未知樣本的類別，以全部訓練樣本作為代表點，計算未知樣本與所有訓練樣本的距離，并以最近鄰者的類別作為決策未知樣本類別的唯一依據。但是，最近鄰算法明顯是存在缺陷的，比如下面的例子：有一個未知形狀(圖中綠色的圓點)，如何判斷它是什么形狀？

顯然，最近鄰算法的缺陷——對噪聲數據過于敏感，為了解決這個問題，我們可以可以把未知樣本周邊的多個最近樣本計算在內，擴大參與決策的樣本量，以避免個別數據直接決定決策結果。由此，我們引進K-最近鄰算法。K-最近鄰算法是最近鄰算法的一個延伸。基本思路是：選擇未知樣本一定范圍內確定個數的K個樣本，該K個樣本大多數屬于某一類型，則未知樣本判定為該類型。如何選擇一個最佳的K值取決于數據。一般情況下，在分類時較大的K值能夠減小噪聲的影響，但會使類別之間的界限變得模糊。待測樣本（綠色圓圈）既可能分到紅色三角形類，也可能分到藍色正方形類。如果k取3，從圖可見，待測樣本的3個鄰居在實線的內圓里，按多數投票結果，它屬于紅色三角形類。但是如果k取5，那么待測樣本的最鄰近的5個樣本在虛線的圓里，按表決法，它又屬于藍色正方形類。在實際應用中，K先取一個比較小的數值，再采用交叉驗證法來逐步調整K值，最終選擇適合該樣本的最優的K值。

KNN算法實現　
算法基本步驟：

1）計算待分類點與已知類別的點之間的距離

2）按照距離遞增次序排序

3）選取與待分類點距離最小的k個點

4）確定前k個點所在類別的出現次數

5）返回前k個點出現次數最高的類別作為待分類點的預測分類

　　下面是一個按照算法基本步驟用python實現的簡單例子，根據已分類的4個樣本點來預測未知點(圖中的灰點)的分類：

from numpy import * # create a dataset which contains 4 samples with 2 classes def createDataSet():   # create a matrix: each row as a sample   group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]])   labels = ['A', 'A', 'B', 'B'] # four samples and two classes   return group, labels# classify using kNN (k Nearest Neighbors ) # Input:   newInput: 1 x N#       dataSet: M x N (M samples N, features)#       labels:  1 x M  #       k: number of neighbors to use for comparison # Output:   the most popular class label  def kNNClassify(newInput, dataSet, labels, k):   numSamples = dataSet.shape[0] # shape[0] stands for the num of row   ## step 1: calculate Euclidean distance   # tile(A, reps): Construct an array by repeating A reps times   # the following copy numSamples rows for dataSet   diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise   squaredDiff = diff ** 2 # squared for the subtract   squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row   distance = squaredDist ** 0.5   ## step 2: sort the distance   # argsort() returns the indices that would sort an array in a ascending order   sortedDistIndices = argsort(distance)   classCount = {} # define a dictionary (can be append element)   for i in xrange(k):     ## step 3: choose the min k distance     voteLabel = labels[sortedDistIndices[i]]     ## step 4: count the times labels occur     # when the key voteLabel is not in dictionary classCount, get()     # will return 0     classCount[voteLabel] = classCount.get(voteLabel, 0) + 1   ## step 5: the max voted class will return   maxCount = 0   for key, value in classCount.items():     if value > maxCount:       maxCount = value       maxIndex = key   return maxIndex    if __name__== "__main__":    dataSet, labels = createDataSet()   testX = array([1.2, 1.0])   k = 3   outputLabel = kNNClassify(testX, dataSet, labels, 3)   print "Your input is:", testX, "and classified to class: ", outputLabel   testX = array([0.1, 0.3])   outputLabel = kNNClassify(testX, dataSet, labels, 3)   print "Your input is:", testX, "and classified to class: ", outputLabel

上一篇：Python數據結構與算法之圖的最短路徑(Dijkstra算法)完整實例

下一篇：關于反爬蟲的一些簡單總結