聚類是一種無(wú)監(jiān)督的學(xué)習(xí),將相似的對(duì)象放到同一簇中,有點(diǎn)像是全自動(dòng)分類,簇內(nèi)的對(duì)象越相似,簇間的對(duì)象差別越大,則聚類效果越好。
1、k均值聚類算法
k均值聚類將數(shù)據(jù)分為k個(gè)簇,每個(gè)簇通過(guò)其質(zhì)心,即簇中所有點(diǎn)的中心來(lái)描述。首先隨機(jī)確定k個(gè)初始點(diǎn)作為質(zhì)心,然后將數(shù)據(jù)集分配到距離最近的簇中。然后將每個(gè)簇的質(zhì)心更新為所有數(shù)據(jù)集的平均值。然后再進(jìn)行第二次劃分?jǐn)?shù)據(jù)集,直到聚類結(jié)果不再變化為止。
偽代碼為
隨機(jī)創(chuàng)建k個(gè)簇質(zhì)心
當(dāng)任意一個(gè)點(diǎn)的簇分配發(fā)生改變時(shí):
對(duì)數(shù)據(jù)集中的每個(gè)數(shù)據(jù)點(diǎn):
對(duì)每個(gè)質(zhì)心:
計(jì)算數(shù)據(jù)集到質(zhì)心的距離
將數(shù)據(jù)集分配到最近距離質(zhì)心對(duì)應(yīng)的簇
對(duì)每一個(gè)簇,計(jì)算簇中所有點(diǎn)的均值并將均值作為質(zhì)心
python實(shí)現(xiàn)
import numpy as npimport matplotlib.pyplot as pltdef loadDataSet(fileName): dataMat = [] with open(fileName) as f: for line in f.readlines(): line = line.strip().split('/t') dataMat.append(line) dataMat = np.array(dataMat).astype(np.float32) return dataMatdef distEclud(vecA,vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2)))def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m): centmin = min(dataSet[:,i]) centmax = max(dataSet[:,i]) center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return centerdef kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = distMeans(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 for cent in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssmentdata = loadDataSet('testSet.txt')muCentroids, clusterAssing = kMeans(data,4)fig = plt.figure(0)ax = fig.add_subplot(111)ax.scatter(data[:,0],data[:,1],c = clusterAssing[:,0].A)plt.show()print(clusterAssing)2、二分k均值算法
K均值算法可能會(huì)收斂到局部最小值,而非全局最小。一種用于度量聚類效果的指標(biāo)為誤差平方和(SSE)。因?yàn)槿×似椒?,更加重視原理中心的點(diǎn)。為了克服k均值算法可能會(huì)收斂到局部最小值的問(wèn)題,有人提出來(lái)二分k均值算法。
新聞熱點(diǎn)
疑難解答
圖片精選