python實(shí)現(xiàn)kMeans算法

2020-02-16 11:15:11

字體：大中小

供稿：網(wǎng)友

聚類是一種無(wú)監(jiān)督的學(xué)習(xí)，將相似的對(duì)象放到同一簇中，有點(diǎn)像是全自動(dòng)分類，簇內(nèi)的對(duì)象越相似，簇間的對(duì)象差別越大，則聚類效果越好。

1、k均值聚類算法

k均值聚類將數(shù)據(jù)分為k個(gè)簇，每個(gè)簇通過(guò)其質(zhì)心，即簇中所有點(diǎn)的中心來(lái)描述。首先隨機(jī)確定k個(gè)初始點(diǎn)作為質(zhì)心，然后將數(shù)據(jù)集分配到距離最近的簇中。然后將每個(gè)簇的質(zhì)心更新為所有數(shù)據(jù)集的平均值。然后再進(jìn)行第二次劃分?jǐn)?shù)據(jù)集，直到聚類結(jié)果不再變化為止。

偽代碼為

隨機(jī)創(chuàng)建k個(gè)簇質(zhì)心
當(dāng)任意一個(gè)點(diǎn)的簇分配發(fā)生改變時(shí)：
    對(duì)數(shù)據(jù)集中的每個(gè)數(shù)據(jù)點(diǎn)：
        對(duì)每個(gè)質(zhì)心：
            計(jì)算數(shù)據(jù)集到質(zhì)心的距離
        將數(shù)據(jù)集分配到最近距離質(zhì)心對(duì)應(yīng)的簇
    對(duì)每一個(gè)簇，計(jì)算簇中所有點(diǎn)的均值并將均值作為質(zhì)心

python實(shí)現(xiàn)

import numpy as npimport matplotlib.pyplot as pltdef loadDataSet(fileName):  dataMat = []  with open(fileName) as f:  for line in f.readlines():   line = line.strip().split('/t')   dataMat.append(line) dataMat = np.array(dataMat).astype(np.float32) return dataMatdef distEclud(vecA,vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2)))def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m):  centmin = min(dataSet[:,i])  centmax = max(dataSet[:,i])  center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return centerdef kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged:  clusterChanged = False  for i in range(m):   minDist = np.inf   minIndex = -1   for j in range(k):    distJI = distMeans(dataSet[i,:],centroids[j,:])    if distJI < minDist:     minDist = distJI     minIndex = j   if clusterAssment[i,0] != minIndex:    clusterChanged = True   clusterAssment[i,:] = minIndex,minDist**2  for cent in range(k):   ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]]   centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssmentdata = loadDataSet('testSet.txt')muCentroids, clusterAssing = kMeans(data,4)fig = plt.figure(0)ax = fig.add_subplot(111)ax.scatter(data[:,0],data[:,1],c = clusterAssing[:,0].A)plt.show()print(clusterAssing)

2、二分k均值算法

K均值算法可能會(huì)收斂到局部最小值，而非全局最小。一種用于度量聚類效果的指標(biāo)為誤差平方和（SSE）。因?yàn)槿×似椒?，更加重視原理中心的點(diǎn)。為了克服k均值算法可能會(huì)收斂到局部最小值的問(wèn)題，有人提出來(lái)二分k均值算法。

上一篇：python實(shí)現(xiàn)決策樹(shù)

下一篇：python+selenium實(shí)現(xiàn)登錄賬戶后自動(dòng)點(diǎn)擊的示例