over 2 years ago



Kmeans的中文是物以類聚,文言一點就是近朱者赤。是一種非監督式學習,容易和KNN搞混。概念是,

  1. 如果我們隨便丟兩個點在空間中,計算待分群的資料點離哪個點較近,就歸類為哪群。
  2. 分群後,計算不同的中心值(平均)。
  3. 以此中心點,更新為新點,再重複作1直到點不再改變。

實做
一群資料分佈在空間中。假定人為指定分作k群,按照以下步驟

  • 隨意在空間中安置k個點,分析空間中每個點,按照(歐幾里得)距離遠近,被分類到第幾個群(cluster)
  • 更新分群後的平均值
  • 檢查所有點集合,若重新計算新點和每個點分群都相同. 則停止迭代,反之重複以上步驟。

此時會求得一組根據初始猜測值,得到的k分群最佳解.
程式碼

## clustering

from __future__ import division
import random
import matplotlib.pyplot as plt


def squared_distance(v,w):
    '''
    v,w is a vector,
    return the value of (v1-w1)^2 + (v2-w2)^2 +...
    '''
    return sum([(vi-wi)**2 for vi,wi in zip(v,w)])
def vector_add(v,w):
    '''add corresponding elements'''
    return [vi+wi for vi,wi in zip(v,w)]

def vector_sum(vectors):
    '''sum over elements '''
    result = vectors[0]
    for vector in vectors[1:]:
        result = vector_add(result,vector)
    return result
def scalar_multiply(c,v):
    '''c is a number, v is vector'''
    return [c*vi for vi in v]
def vector_mean(v_list):
    '''
    v_list is the lists of data sets (x1,y1,z1,..),(x2,y2,z2,...),
    retrun the value of mean of this data sets

    '''
    n = len(v_list)
    
    return scalar_multiply(1/n, vector_sum(v_list))


class KMeans:
    def __init__(self,k):
        self.k = k
        self.means = None ## mean point to be classified


    def classify(self,input):
        """classify the input data(xi,yi) into one of the k-size group
        """
        clf = min(range(self.k), key = lambda i:squared_distance(input, self.means[i]))
        return clf

    def train(self,inputs):
        initial_guess = random.sample(inputs, self.k)
        self.means = initial_guess
        # Find new category 

        category = None
        while True:
            new_category = map(self.classify, inputs)
            if new_category == category:
                return
            category = new_category
            ## find all points asign to cluster i

            for i in range(self.k):
                i_points = [p for p,a in zip(inputs,category) if a==i]

                if i_points: ## make sure no divide by zero lens

                    self.means[i] = vector_mean(i_points)


## drawing 

def draw():
    inputs = [[-14,-5],[13,13],[20,23],[-19,-11],[-9,-16],[21,27],[-49,15],[26,13],[-46,5],[-34,-1],[11,15],[-49,0],[-22,-16],[19,28],[-12,-8],[-13,-19],[-41,8],[-11,-6],[-25,-9],[-18,-3]]

    test = KMeans(3)
    test.train(inputs)
    x0 = [xi for xi,_ in inputs]
    y0 = [yi for _,yi in inputs]

    x_means = [xi for xi,_ in test.means]
    y_means = [yi for _,yi in test.means]

    print test.means

    plt.scatter(x0,y0)
    plt.scatter(x_means,y_means,color ='r')
    plt.show()
← 類神經網路(Artificial Nueral Network) 抽象繪圖-KMeans演算法 →
 
comments powered by Disqus