sklearn中应用“肘部”系数进行聚类结果的评估¶

发布日期：2019-10-25
难度：中等
类别：聚类分析、“肘部”系数、聚类结果的评估
标签：Python、scipy.spatial.distance.cdist、matplotlib.pyplot

1. 问题描述¶

如下程序是scikit-learn中运用“肘部“观察法则来评估聚类最佳数目的一个实例。应用的numpy随机生成的3000个数据点，最佳聚类数目设为3类。

2. 程序实现¶

#应用"肘部"法则进行簇评估实例。
#导入必要的包。
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
#生成两个类团的取值范围和二维数组的行列数，并合并到X。
cluster1 = np.random.uniform(0.5, 1.5, (2, 1000))
cluster2 = np.random.uniform(10.5, 11.5, (2, 1000))
cluster3 = np.random.uniform(20.5, 21.5, (2, 1000))
X = np.hstack((cluster1, cluster2, cluster3)).T
#对于每一个k取值范围在1-9，计算欧氏距离，展示每种取值的聚类情况。
K = range(1, 10)
meandistortions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average Distortion')
plt.title('Selecting K with the Elbow Method');
plt.show()